0

I have this df with email headers. I need to eliminate all duplicates where Subject is the same AND Source is different. I have spent hours trying to figure out a solution or find a similar case...

Date From Subject Source
12/06/21 Sender1 Test123 Inbox
12/06/21 Sender2 Confirm Inbox
12/06/21 Sender1 Test123 Sent
12/06/21 Sender3 Test_on Inbox
12/06/21 Sender3 Test_on Inbox

Practically from the table above the rows with subject = 'Test123' should be dropped.

Date From Subject Source
12/06/21 Sender2 Confirm Inbox
12/06/21 Sender3 Test_on Inbox
12/06/21 Sender3 Test_on Inbox
1
  • something like df[df['Subject'].duplicated(keep=False) & ~df['Source'].duplicated(keep=False)]? Commented Dec 6, 2021 at 22:10

2 Answers 2

1

You can use set to determine for each sender if there are multiple source. If yes, drop the row.

>>> df.loc[df.groupby('From')['Source'].transform(lambda x: len(set(x)) == 1)]

       Date     From  Subject Source
1  12/06/21  Sender2  Confirm  Inbox
3  12/06/21  Sender3  Test_on  Inbox
4  12/06/21  Sender3  Test_on  Inbox
Sign up to request clarification or add additional context in comments.

1 Comment

And it works... Thank you Corralien!
0
duplicated_subject = df.duplicated('Subject', keep=False)
duplicated_subject_and_source = df.duplicated(['Subject', 'Source'], keep=False)
df[~duplicated_subject | duplicated_subject_and_source]

eliminate all duplicates where "Subject is the same AND Source is different"

is equivalent to

keep where "Subject is not duplicated OR Subject is duplicated and Source is the same"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.