Partial String Matching between Pandas DataFrames

Question

I've seen discussions on partial string match merges here, here, and elsewhere, but nothing on how to do it on the table containing the superstring instead of the substring.

Given sample data like

df1 = pd.DataFrame({'uri': ['http://www.foo.com/index', 
                            'https://bar.net/directory', 
                            'www.baz.gov/aboutus']})
df2 = pd.DataFrame({'fqdn': ['www.foo.com',
                             'www.qux.mil']})

print(df1)

                         uri
0   http://www.foo.com/index
1  https://bar.net/directory
2        www.baz.gov/aboutus

print(df2)

          fqdn
0  www.foo.com
1  www.qux.mil

my end goal is to remove the rows from df1 that contain any substring in df2. In the actual data, df1 has several million rows and df2 has a few hundred, and any given row in df1 will have at most one substring in df2.

Given the sample data, I hope to end up with a dataframe like

                         uri
0  https://bar.net/directory
1        www.baz.gov/aboutus

and by my logic, the intermediate step is to generate

                         uri           fqdn
0   http://www.foo.com/index    www.foo.com
1  https://bar.net/directory         np.NaN
2        www.baz.gov/aboutus         np.NaN

but I can't figure out how to check all of df2's values inside of df1.apply().

Edit:

While both answers below work, I got the fastest results in my specific scenario by compiling a regex object and using extract:

import re

fqdn_list= re.compile(f"({'|'.join(df2.fqdn)})")

df1['fqdn'] = df1.uri.str.extract(fqdn_list)

BENY · Accepted Answer · 2019-08-08 21:24:06Z

2

Is this what you need ? str.findall

df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
Out[192]: 
0    www.foo.com
1            NaN
2            NaN
Name: uri, dtype: object
#df1['fqdn']=df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]

answered Aug 8, 2019 at 21:24

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andy Over a year ago

+1 This works perfectly on the toy example, but I don't know if it scales to the size of my set or if I need something with short-circuiting. I'm getting ahead of myself. I'll accept once I can test in the morning.

Erfan · Accepted Answer · 2019-08-08 21:33:09Z

1

Concatenate the strings from df2 with the pipe (|) as seperator which is the or operator in regex. This way we can check if the df1 contains any of these strings and remove them with str.contains and ~ which is the not operator;

m = ~df1['uri'].str.contains('|'.join(df2['fqdn']))
df1[m]

Output

                         uri
1  https://bar.net/directory
2        www.baz.gov/aboutus

answered Aug 8, 2019 at 21:33

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

5 Comments

Andy Over a year ago

While this accomplishes what I need without the middle step when tested on the toy set, when applied to even a subset of the real data it never finishes, whereas WeNYoBen's approach runs almost instantly. Is that weird?

Andy Over a year ago

Yes, it was weird, and it was because of a typo.

Andy Over a year ago

Upon further testing using samples of up to 1M strings, your solution is approximately 20% faster.

Erfan Over a year ago

That's nice to hear. Glad I could help. @Andy

Andy Over a year ago

I managed to get a significant speed boost (~25%) by compiling a regex object, check the bottom of my question if you're interested.

Collectives™ on Stack Overflow

Partial String Matching between Pandas DataFrames

2 Answers 2

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related