Python Pandas - dropping multiple values based on a list

Question

I'm trying to drop values from a dataframe that fuzzy match items in a list.

I have a dataframe (test_df) that looks like:

   id          email         created_at      
0  1   son@mail_a.com   2017-01-21 18:19:00  
1  2   boy@mail_b.com   2017-01-22 01:19:00  
2  3  girl@mail_c.com   2017-01-22 01:19:00

I have a list of a few hundred email domains that I am reading in from a txt file that looks like:

mail_a.com
mail_d.com
mail_e.com

I'm trying to drop from the dataframe any row that contains a matching email domain using:

email_domains = open('file.txt', 'r')
to_drop = email_domains.read().splitlines()    
dropped_df = test_df[~test_df['email'].isin(to_drop)]
    print(test_df)

So, the result should look like:

   id          email         created_at       
0  2   boy@mail_b.com   2017-01-22 01:19:00  
1  3  girl@mail_c.com   2017-01-22 01:19:00

But the first row with "son@mail_a.com" is not dropped. Any suggestions?

score 3 · Accepted Answer · 2017-04-24 19:25:14Z

3

isin looks for exact matches. Your condition is more suitable for endswith or contains:

df[~df['email'].str.endswith(tuple(to_drop))]
Out: 
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

df[~df['email'].str.contains('|'.join(to_drop))]
Out: 
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

edited Apr 24, 2017 at 19:25

answered Apr 24, 2017 at 19:23

user2285236

Sign up to request clarification or add additional context in comments.

1 Comment

user2285236 Over a year ago

@MaxU Thank you. :)

MaxU - stand with Ukraine · Accepted Answer · 2017-04-24 19:22:27Z

2

It's pretty easy to parse domain name from the email, so we can first parse domains using .str.split('@') and then check it using isin() method:

In [12]: df[~df.email.str.split('@').str[1].isin(domains.domain)]
Out[12]:
   id            email           created_at
1   2   boy@mail_b.com  2017-01-22 01:19:00
2   3  girl@mail_c.com  2017-01-22 01:19:00

where:

In [13]: domains
Out[13]:
       domain
0  mail_a.com
1  mail_d.com
2  mail_e.com

answered Apr 24, 2017 at 19:22

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Comments

plasmon360 · Accepted Answer · 2017-04-24 19:28:54Z

0

you can use apply and split the string and use it for your isin

print test_df[~test_df['email'].apply(lambda x: x.split('@')[1]).isin(to_drop)]

results in

            created_at            email
1  2017-01-22 01:19:00   boy@mail_b.com
2  2017-01-22 01:19:00  girl@mail_c.com

answered Apr 24, 2017 at 19:28

plasmon360

4,1991 gold badge21 silver badges19 bronze badges

Comments

aquil.abdullah · Accepted Answer · 2017-04-24 19:47:11Z

0

Yet another answer...This is a one liner:

exclude = ['mail_a.com','mail_d.com','mail_e.com']
df[df.apply(lambda x: all([x['email'].rfind(ex) < 0 for ex in exclude]), axis=1)]
# OUTPUT
# Out[50]:
#              created_at            email  id
# 1   2017-01-22 01:19:00   boy@mail_b.com   2
# 2   2017-01-22 01:19:00  girl@mail_c.com   3

Here I use the rfind returns -1 if the pattern isn't found.

answered Apr 24, 2017 at 19:47

aquil.abdullah

3,1673 gold badges24 silver badges40 bronze badges

Collectives™ on Stack Overflow

Python Pandas - dropping multiple values based on a list

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related