How to drop rows by condition on string value in pandas dataframe?

Question

Consider a Pandas Dataframe like:

>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df

Giving:

                   url
0      http://url1.com
1  http://www.url1.com
2  http://www.url2.com
3  http://www.url3.com
4  http://www.url1.com

I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:

                   url
0   http://ww.url3.com

I do this

domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))

But this give me no result.

Any idea how to solve the above problem?

Edit: Solution

import pandas as pd
import tldextract

df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)

BENY · Accepted Answer · 2020-05-29 15:44:34Z

2

If we checking domain , we should find the 100% match domain rather than use string contain . since the subdomain may contain the same key work as domain

import tldextract

s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]: 
0     True
1     True
2     True
3    False
4     True
Name: url, dtype: bool

df=df[~s]

answered May 29, 2020 at 15:44

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shubham Sharma · Accepted Answer · 2020-05-29 15:46:51Z

1

Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:

m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)

Result:

                   url
0  http://www.url3.com

edited May 29, 2020 at 15:46

answered May 29, 2020 at 15:40

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

2 Comments

Quang Hoang Over a year ago

I believe ulr1 and ulr2 are just dummies. your pattern will be hard to swallow for OP's actual data.

Shubham Sharma Over a year ago

@QuangHoang Corrected.

Ch3steR · Accepted Answer · 2020-05-29 16:20:21Z

1

you can use pd.Series.str.contains here.

df[~df.url.str.contains('|'.join(domainToCheck))]

                   url
3  http://www.url3.com

If you want to reset index use this

df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)

                   url
0  http://www.url3.com

edited May 29, 2020 at 16:20

answered May 29, 2020 at 15:38

Ch3steR

20.8k4 gold badges34 silver badges66 bronze badges

2 Comments

Quang Hoang Over a year ago

I thinnk '|'.join(domainToCheck) is safer.

Ch3steR Over a year ago

@QuangHoang Yes, agreed. Changed the answer. Thank you.

Collectives™ on Stack Overflow

How to drop rows by condition on string value in pandas dataframe?

Edit: Solution

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Edit: Solution

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related