Find multiple strings in a given column

Question

I'm not sure whether it is possible to do easily.

I have 2 dataframes. In the first one (df1) there is a column with texts ('Texts') and in the second one there are 2 columns, one with some sort texts ('subString') and the second with a score ('Score').

What I want is to sum up all the scores associated to the subString field in the second dataframe when these subString are a substring of the text column in the first dataframe.

For example, if I have a dataframe like this:

df1 = pd.DataFrame({
        'ID':[1,2,3,4,5,6],
        'Texts':['this is a string',
                 'here we have another string',
                 'this one is completly different',
                 'one more',
                 'this is one more',
                 'and the last one'],
        'c':['C','C','C','C','C','C'],
        'd':['D','D','D','D','NaN','NaN']
    }, columns = ['ID','Texts','c','d'])
df1

Out[2]: 
   ID                            Texts  c    d
0   1                 this is a string  C    D
1   2      here we have another string  C    D
2   3  this one is completly different  C    D
3   4                         one more  C    D
4   5                 this is one more  C  NaN
5   6                 and the last one  C  NaN

And another dataframe like this:

df2 = pd.DataFrame({
        'SubString':['This', 'one', 'this is', 'is one'],
        'Score':[0.5, 0.2, 0.75, -0.5]
    }, columns = ['SubString','Score'])
df2

Out[3]: 
  SubString  Score
0      This   0.50
1       one   0.20
2   this is   0.75
3    is one  -0.50

I want to get something like this:

df1['Score'] = 0.0
for index1, row1 in df1.iterrows():
    score = 0
    for index2, row2 in df2.iterrows():
        if row2['SubString'] in row1['Texts']:
            score += row2['Score']
    df1.set_value(index1, 'Score', score)
df1

Out[4]: 
   ID                            Texts  c    d  Score
0   1                 this is a string  C    D   0.75
1   2      here we have another string  C    D   0.00
2   3  this one is completly different  C    D  -0.30
3   4                         one more  C    D   0.20
4   5                 this is one more  C  NaN   0.45
5   6                 and the last one  C  NaN   0.20

Is there a less garbled and faster way to do it?

Thanks!

Zero · Accepted Answer · 2018-01-23 16:36:45Z

1

Option 1

In [691]: np.array([np.where(df1.Texts.str.contains(x.SubString), x.Score, 0) 
                    for _, x in df2.iterrows()]
                   ).sum(axis=0)
Out[691]: array([ 0.75,  0.  , -0.3 ,  0.2 ,  0.45,  0.2 ])

Option 2

In [674]: df1.Texts.apply(lambda x: df2.Score[df2.SubString.apply(lambda y: y in x)].sum())
Out[674]:
0    0.75
1    0.00
2   -0.30
3    0.20
4    0.45
5    0.20
Name: Texts, dtype: float64

Note: apply doesn't get rid of loops, it just hides them.

edited Jan 23, 2018 at 16:36

answered Jan 23, 2018 at 16:26

Zero

77.4k22 gold badges153 silver badges153 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fernando S. Peregrino Over a year ago

Thanks Zero for your quick and accurate response, it works pretty well!

Collectives™ on Stack Overflow

Find multiple strings in a given column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related