1

I've seen discussions on partial string match merges here, here, and elsewhere, but nothing on how to do it on the table containing the superstring instead of the substring.

Given sample data like

df1 = pd.DataFrame({'uri': ['http://www.foo.com/index', 
                            'https://bar.net/directory', 
                            'www.baz.gov/aboutus']})
df2 = pd.DataFrame({'fqdn': ['www.foo.com',
                             'www.qux.mil']})
print(df1)

                         uri
0   http://www.foo.com/index
1  https://bar.net/directory
2        www.baz.gov/aboutus
print(df2)

          fqdn
0  www.foo.com
1  www.qux.mil

my end goal is to remove the rows from df1 that contain any substring in df2. In the actual data, df1 has several million rows and df2 has a few hundred, and any given row in df1 will have at most one substring in df2.

Given the sample data, I hope to end up with a dataframe like

                         uri
0  https://bar.net/directory
1        www.baz.gov/aboutus

and by my logic, the intermediate step is to generate

                         uri           fqdn
0   http://www.foo.com/index    www.foo.com
1  https://bar.net/directory         np.NaN
2        www.baz.gov/aboutus         np.NaN

but I can't figure out how to check all of df2's values inside of df1.apply().


Edit:

While both answers below work, I got the fastest results in my specific scenario by compiling a regex object and using extract:

import re

fqdn_list= re.compile(f"({'|'.join(df2.fqdn)})")

df1['fqdn'] = df1.uri.str.extract(fqdn_list)

2 Answers 2

2

Is this what you need ? str.findall

df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
Out[192]: 
0    www.foo.com
1            NaN
2            NaN
Name: uri, dtype: object
#df1['fqdn']=df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
Sign up to request clarification or add additional context in comments.

1 Comment

+1 This works perfectly on the toy example, but I don't know if it scales to the size of my set or if I need something with short-circuiting. I'm getting ahead of myself. I'll accept once I can test in the morning.
1

Concatenate the strings from df2 with the pipe (|) as seperator which is the or operator in regex. This way we can check if the df1 contains any of these strings and remove them with str.contains and ~ which is the not operator;

m = ~df1['uri'].str.contains('|'.join(df2['fqdn']))
df1[m]

Output

                         uri
1  https://bar.net/directory
2        www.baz.gov/aboutus

5 Comments

While this accomplishes what I need without the middle step when tested on the toy set, when applied to even a subset of the real data it never finishes, whereas WeNYoBen's approach runs almost instantly. Is that weird?
Yes, it was weird, and it was because of a typo.
Upon further testing using samples of up to 1M strings, your solution is approximately 20% faster.
That's nice to hear. Glad I could help. @Andy
I managed to get a significant speed boost (~25%) by compiling a regex object, check the bottom of my question if you're interested.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.