Python - Search List of Strings for list of Substrings, Return Largest Value From Another Col

Question

df1 contains the larger main strings I want to search. df2 contains a list of substrings and associated with each is a value.

import pandas as pd

df1 = pd.DataFrame(columns = ['MainString'])
df1 = df1.append({'MainString':'abcdef'}, ignore_index=True)
df1 = df1.append({'MainString':'ghijkl'}, ignore_index=True)
df1 = df1.append({'MainString':'mnopqr'}, ignore_index=True)
df1 = df1.append({'MainString':'stuvwx'}, ignore_index=True)

df2 = pd.DataFrame(columns = ['Substring','Value'])
df2 = df2.append({'Substring':'bcde','Value':0.5}, ignore_index=True)
df2 = df2.append({'Substring':'bcd','Value':0.6}, ignore_index=True)
df2 = df2.append({'Substring':'mno','Value':0.4}, ignore_index=True)
df2 = df2.append({'Substring':'stuv','Value':0.7}, ignore_index=True)
df2 = df2.append({'Substring':'uvwx','Value':0.7}, ignore_index=True)
df2 = df2.append({'Substring':'stu','Value':0.4}, ignore_index=True)

print(df1)
  MainString
0     abcdef
1     ghijkl
2     mnopqr
3     stuvwx

print(df2)
  Substring  Value
0      bcde    0.5
1       bcd    0.6
2       mno    0.4
3      stuv    0.7
4      uvwx    0.7
5       stu    0.4

I want to search df1['MainString'] for the values in df2['Substring'], but then return me just the largest value. If there's a tie (e.g. stuv & uvwx), return the first. So the final would look something like:

  MainString Substring Value
0     abcdef       bcd   0.6
1     ghijkl       NaN   NaN
2     mnopqr       mno   0.4
3     stuvwx      stuv   0.7

Not sure if I need to just loop through and evaluate each MainString with each Substring. I've tried adapting this solution but it is returning just the first matched string, not the substring with the highest value:

s_list = list(df2['Substring'])
s_list = '(' + '|'.join(s_list) + ')'
df1['test'] = df1['MainString'].str.extract(s_list, expand=False)

print(df1)
  MainString  test
0     abcdef  bcde
1     ghijkl   NaN
2     mnopqr   mno
3     stuvwx  stuv

How long is your df2? A loop may help if it's not too crazy long. — Quang Hoang
– Quang Hoang, Commented Dec 11, 2019 at 16:18
It's not too long, and I could make a loop for this. But I might apply this to another future process, and that table would be much longer (df1 maybe a couple million rows, df2 a few hundred). — maxutil
– maxutil, Commented Dec 11, 2019 at 16:23

Swier · Accepted Answer · 2019-12-11 16:31:59Z

4

The code from this answer allows you to join two dataframes on substring match. It greedily picks the first match, so you'll have to sort the dataframe containing substrings by the relevant values, so it matches the highest value.

The following code implements this for your example:

pattern = "|".join(df2.sort_values("Value", ascending=False).Substring)
result = df1.copy()
result.insert(
    0, "Substring", df1["MainString"].str.extract("(" + pattern + ")", expand=False)
)

result = result.join(df2.set_index("Substring"), on="Substring")

answered Dec 11, 2019 at 16:31

Swier

4,2463 gold badges32 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

maxutil Over a year ago

Nice, this works and it didn't occur to me to sort.

Collectives™ on Stack Overflow

Python - Search List of Strings for list of Substrings, Return Largest Value From Another Col

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related