df1 contains the larger main strings I want to search. df2 contains a list of substrings and associated with each is a value.
import pandas as pd
df1 = pd.DataFrame(columns = ['MainString'])
df1 = df1.append({'MainString':'abcdef'}, ignore_index=True)
df1 = df1.append({'MainString':'ghijkl'}, ignore_index=True)
df1 = df1.append({'MainString':'mnopqr'}, ignore_index=True)
df1 = df1.append({'MainString':'stuvwx'}, ignore_index=True)
df2 = pd.DataFrame(columns = ['Substring','Value'])
df2 = df2.append({'Substring':'bcde','Value':0.5}, ignore_index=True)
df2 = df2.append({'Substring':'bcd','Value':0.6}, ignore_index=True)
df2 = df2.append({'Substring':'mno','Value':0.4}, ignore_index=True)
df2 = df2.append({'Substring':'stuv','Value':0.7}, ignore_index=True)
df2 = df2.append({'Substring':'uvwx','Value':0.7}, ignore_index=True)
df2 = df2.append({'Substring':'stu','Value':0.4}, ignore_index=True)
print(df1)
MainString
0 abcdef
1 ghijkl
2 mnopqr
3 stuvwx
print(df2)
Substring Value
0 bcde 0.5
1 bcd 0.6
2 mno 0.4
3 stuv 0.7
4 uvwx 0.7
5 stu 0.4
I want to search df1['MainString'] for the values in df2['Substring'], but then return me just the largest value. If there's a tie (e.g. stuv & uvwx), return the first. So the final would look something like:
MainString Substring Value
0 abcdef bcd 0.6
1 ghijkl NaN NaN
2 mnopqr mno 0.4
3 stuvwx stuv 0.7
Not sure if I need to just loop through and evaluate each MainString with each Substring. I've tried adapting this solution but it is returning just the first matched string, not the substring with the highest value:
s_list = list(df2['Substring'])
s_list = '(' + '|'.join(s_list) + ')'
df1['test'] = df1['MainString'].str.extract(s_list, expand=False)
print(df1)
MainString test
0 abcdef bcde
1 ghijkl NaN
2 mnopqr mno
3 stuvwx stuv
df2? A loop may help if it's not too crazy long.