1

Consider a Pandas Dataframe like:

>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df

Giving:

                   url
0      http://url1.com
1  http://www.url1.com
2  http://www.url2.com
3  http://www.url3.com
4  http://www.url1.com

I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:

                   url
0   http://ww.url3.com

I do this

domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))

But this give me no result.

Any idea how to solve the above problem?

Edit: Solution

import pandas as pd
import tldextract

df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)

3 Answers 3

2

If we checking domain , we should find the 100% match domain rather than use string contain . since the subdomain may contain the same key work as domain

import tldextract

s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]: 
0     True
1     True
2     True
3    False
4     True
Name: url, dtype: bool

df=df[~s]
Sign up to request clarification or add additional context in comments.

Comments

1

Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:

m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)

Result:

                   url
0  http://www.url3.com

2 Comments

I believe ulr1 and ulr2 are just dummies. your pattern will be hard to swallow for OP's actual data.
@QuangHoang Corrected.
1

you can use pd.Series.str.contains here.

df[~df.url.str.contains('|'.join(domainToCheck))]

                   url
3  http://www.url3.com

If you want to reset index use this

df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)

                   url
0  http://www.url3.com

2 Comments

I thinnk '|'.join(domainToCheck) is safer.
@QuangHoang Yes, agreed. Changed the answer. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.