Join Pandas DataFrames matching by string and substring

Question

i want to merge two dataframes by partial string match. I have two data frames to combine. First df1 consists of 130.000 rows like this:

id    text                        xc1       xc2
1     adidas men shoes            52465     220
2     vakko men suits             49220     224
3     burberry men shirt          78248     289
4     prada women shoes           45780     789
5     lcwaikiki men sunglasses    34788     745

and second df2 consists of 8000 rows like this:

id    keyword               abc1     abc2
1     men shoes             1000     11
2     men suits             2000     12
3     men shirt             3000     13
4     women socks           4000     14
5     men sunglasses        5000     15

After matching between keyword and text, outputshould look like this:

id    text                        xc1       xc2   keyword         abc1  abc2
1     adidas men shoes            52465     220   men shoes       1000  11
2     vakko men suits             49220     224   men suits       2000  12
3     burberry men shirt          78248     289   men shirt       3000  13
4     lcwaikiki men sunglasses    34788     745   men sunglasses  5000  15

Is it possible the partial string match is at the middle of text in df1 ? e.g. vakko men suits blue ? — SeaBean
– SeaBean, Commented May 23, 2021 at 8:02
@SeaBean Yes. There are such examples when I examine the dataset. — Bushmaster
– Bushmaster, Commented May 23, 2021 at 8:12
@AnuragDabas unfortunately no. the row numbers of both data frames are different — Bushmaster
– Bushmaster, Commented May 23, 2021 at 8:18

SeaBean · Accepted Answer · 2021-05-24 10:09:54Z

3

Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:

df3 = df1.merge(df2, how='cross')    # for Pandas version >= 1.2.0 (released in Dec 2020)

import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]

If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross', you can replace the merge statement with:

# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)

After the cross join, we created a boolean mask to filter for the cases that keyword is found within text by using re.search within .apply().

We have to use re.search instead of simple Python substring test like stringA in stringB found in most of the similar answers in StackOverflow. Such kind of test will fail with false match of 'men suits' in keyword with 'women suits' in text since it returns True for test of 'men suits' in 'women suits'.

We use regex with a pair of word boundary \b meta-characters around the keyword (regex pattern: rf"\b{x['keyword']}\b") to ensure matching only for whole word match for text in df1, i.e. men suits in df2 would not match with women suits in df1 since the word women does not have a word boundary between the letters wo and men.

Result:

print(df_out)


    id_x                      text    xc1  xc2  id_y         keyword  abc1  abc2
0      1          adidas men shoes  52465  220     1       men shoes  1000    11
6      2           vakko men suits  49220  224     2       men suits  2000    12
12     3        burberry men shirt  78248  289     3       men shirt  3000    13
24     5  lcwaikiki men sunglasses  34788  745     5  men sunglasses  5000    15

Here, columns id_x and id_y are the original id column in df1 and df2 respectively. As seen from the comment, these are just row numbers of the dataframes that you may not care about. We can then remove these 2 columns and reset index to clean up the layout:

df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)

Final outcome

print(df_out)


                       text    xc1  xc2         keyword  abc1  abc2
0          adidas men shoes  52465  220       men shoes  1000    11
1           vakko men suits  49220  224       men suits  2000    12
2        burberry men shirt  78248  289       men shirt  3000    13
3  lcwaikiki men sunglasses  34788  745  men sunglasses  5000    15

edited May 24, 2021 at 10:09

answered May 23, 2021 at 11:33

SeaBean

23.4k3 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Bushmaster Over a year ago

thank you for the answer. I get an error like this: TypeError: expected string or bytes-like object could some of the values be because it is integer?

SeaBean Over a year ago

@muratmert41 In such case, cast x['text'] to string first, using str(x['text']). You can refer to my edited codes above.

SeaBean Over a year ago

@muratmert41 Feel free to let me know if any further thing I can help! :-)

Bushmaster Over a year ago

your answers really help me a lot. I'm not getting an error anymore. but there are some minor problems that prevent me from getting the right result. I'm working on them.

Bushmaster Over a year ago

I'm sorry I forgot that. thank you again.

|

joao · Accepted Answer · 2021-05-23 09:34:58Z

2

Let's start by ordering the keywords longest-first, so that "women suits" matches "before "men suits"

lkeys = df2.keyword.reindex(df2.keyword.str.len().sort_values(ascending=False).index)

Now define a matching function; each text value from df1 will be passed as s to find a matching keyword:

def is_match(arr, s):
    for a in arr:
        if a in s:
            return a
    return None

Now we can extract the keyword from each text in df1, and add it to a new column:

df1['keyword'] = df1['text'].apply(lambda x: is_match(lkeys, x))

We now have everything we need for a standard merge:

pd.merge(df1, df2, on='keyword')

answered May 23, 2021 at 9:34

joao

2,3032 gold badges13 silver badges16 bronze badges

1 Comment

Bushmaster Over a year ago

thank you for the answer. I guess some values contain int values. for example for example: "16 inch Bicycle Rim". therefore I get an error like this: TypeError: argument of type 'int' is not iterable

Collectives™ on Stack Overflow

Join Pandas DataFrames matching by string and substring

2 Answers 2

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related