1

I'm trying to drop values from a dataframe that fuzzy match items in a list.

I have a dataframe (test_df) that looks like:

   id          email         created_at      
0  1   son@mail_a.com   2017-01-21 18:19:00  
1  2   boy@mail_b.com   2017-01-22 01:19:00  
2  3  girl@mail_c.com   2017-01-22 01:19:00 

I have a list of a few hundred email domains that I am reading in from a txt file that looks like:

mail_a.com
mail_d.com
mail_e.com

I'm trying to drop from the dataframe any row that contains a matching email domain using:

email_domains = open('file.txt', 'r')
to_drop = email_domains.read().splitlines()    
dropped_df = test_df[~test_df['email'].isin(to_drop)]
    print(test_df)

So, the result should look like:

   id          email         created_at       
0  2   boy@mail_b.com   2017-01-22 01:19:00  
1  3  girl@mail_c.com   2017-01-22 01:19:00 

But the first row with "son@mail_a.com" is not dropped. Any suggestions?

4 Answers 4

3

isin looks for exact matches. Your condition is more suitable for endswith or contains:

df[~df['email'].str.endswith(tuple(to_drop))]
Out: 
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

df[~df['email'].str.contains('|'.join(to_drop))]
Out: 
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00
Sign up to request clarification or add additional context in comments.

1 Comment

@MaxU Thank you. :)
2

It's pretty easy to parse domain name from the email, so we can first parse domains using .str.split('@') and then check it using isin() method:

In [12]: df[~df.email.str.split('@').str[1].isin(domains.domain)]
Out[12]:
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

where:

In [13]: domains
Out[13]:
       domain
0  mail_a.com
1  mail_d.com
2  mail_e.com

Comments

0

you can use apply and split the string and use it for your isin

print test_df[~test_df['email'].apply(lambda x: x.split('@')[1]).isin(to_drop)]

results in

            created_at            email
1  2017-01-22 01:19:00   boy@mail_b.com
2  2017-01-22 01:19:00  girl@mail_c.com

Comments

0

Yet another answer...This is a one liner:

exclude = ['mail_a.com','mail_d.com','mail_e.com']
df[df.apply(lambda x: all([x['email'].rfind(ex) < 0 for ex in exclude]), axis=1)]
# OUTPUT
# Out[50]:
#              created_at            email  id
# 1   2017-01-22 01:19:00   boy@mail_b.com   2
# 2   2017-01-22 01:19:00  girl@mail_c.com   3

Here I use the rfind returns -1 if the pattern isn't found.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.