1

df1 contains the larger main strings I want to search. df2 contains a list of substrings and associated with each is a value.

import pandas as pd

df1 = pd.DataFrame(columns = ['MainString'])
df1 = df1.append({'MainString':'abcdef'}, ignore_index=True)
df1 = df1.append({'MainString':'ghijkl'}, ignore_index=True)
df1 = df1.append({'MainString':'mnopqr'}, ignore_index=True)
df1 = df1.append({'MainString':'stuvwx'}, ignore_index=True)

df2 = pd.DataFrame(columns = ['Substring','Value'])
df2 = df2.append({'Substring':'bcde','Value':0.5}, ignore_index=True)
df2 = df2.append({'Substring':'bcd','Value':0.6}, ignore_index=True)
df2 = df2.append({'Substring':'mno','Value':0.4}, ignore_index=True)
df2 = df2.append({'Substring':'stuv','Value':0.7}, ignore_index=True)
df2 = df2.append({'Substring':'uvwx','Value':0.7}, ignore_index=True)
df2 = df2.append({'Substring':'stu','Value':0.4}, ignore_index=True)

print(df1)
  MainString
0     abcdef
1     ghijkl
2     mnopqr
3     stuvwx

print(df2)
  Substring  Value
0      bcde    0.5
1       bcd    0.6
2       mno    0.4
3      stuv    0.7
4      uvwx    0.7
5       stu    0.4

I want to search df1['MainString'] for the values in df2['Substring'], but then return me just the largest value. If there's a tie (e.g. stuv & uvwx), return the first. So the final would look something like:

  MainString Substring Value
0     abcdef       bcd   0.6
1     ghijkl       NaN   NaN
2     mnopqr       mno   0.4
3     stuvwx      stuv   0.7

Not sure if I need to just loop through and evaluate each MainString with each Substring. I've tried adapting this solution but it is returning just the first matched string, not the substring with the highest value:

s_list = list(df2['Substring'])
s_list = '(' + '|'.join(s_list) + ')'
df1['test'] = df1['MainString'].str.extract(s_list, expand=False)

print(df1)
  MainString  test
0     abcdef  bcde
1     ghijkl   NaN
2     mnopqr   mno
3     stuvwx  stuv
2
  • How long is your df2? A loop may help if it's not too crazy long. Commented Dec 11, 2019 at 16:18
  • It's not too long, and I could make a loop for this. But I might apply this to another future process, and that table would be much longer (df1 maybe a couple million rows, df2 a few hundred). Commented Dec 11, 2019 at 16:23

1 Answer 1

4

The code from this answer allows you to join two dataframes on substring match. It greedily picks the first match, so you'll have to sort the dataframe containing substrings by the relevant values, so it matches the highest value.

The following code implements this for your example:

pattern = "|".join(df2.sort_values("Value", ascending=False).Substring)
result = df1.copy()
result.insert(
    0, "Substring", df1["MainString"].str.extract("(" + pattern + ")", expand=False)
)

result = result.join(df2.set_index("Substring"), on="Substring")
Sign up to request clarification or add additional context in comments.

1 Comment

Nice, this works and it didn't occur to me to sort.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.