Python pandas count number of Regex matches in a string

Question

I have a dataframe with sentences and a dictionary of terms grouped into topics, where I want to count the number of term matches for each topic.

import pandas as pd

terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}

df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})

So far I have created columns for the topics and marked it with 1 if a word is present by iterating over the dictionary.

df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])


for k, v in terms.items():
    for val in v:
        df.loc[df.Foo.str.contains(val), k] = 1


print (df)

and I get:

>>> 
                                                 Foo  Score animals games  \
0   The quick brown fox was playing basketball today      4       1     1   
1  John and Rob visited the eagles nest, the foxe...      6       1   NaN   
2                         Bill smells like a wet dog      2     NaN   NaN   
3  Steve threw the football at a deer. But the fo...      7       1     1   
4                  Sheriff John does not like hockey      8     NaN     1   

  people  
0    NaN  
1      1  
2    NaN  
3      1  
4      1

What is the best way to count the number of words for each topic that appears in the sentence? and is there a more efficient way of looping over the dictionary without using cython?

do you really need to count your terms for each row or you will count them for the whole data frame? I.e. what is the end goal? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Apr 4, 2016 at 11:53

jezrael · Accepted Answer · 2016-04-04 12:11:52Z

2

You can use split with stack what is 5 times faster as Counter solution:

df1 = df.Foo.str.split(expand=True).stack()
                                   .reset_index(level=1, drop=True)
                                   .reset_index(name='Foo')

for k, v in terms.items():
    df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1

print df1.groupby('index').sum().astype(int)
       games  animals  people
index                        
0          1        1       0
1          0        2       2
2          0        0       0
3          2        1       1
4          1        0       1

Timings:

In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop

In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop

Code:

def a(df):
    df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
    for k, v in terms.items():
        df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
    return df1.groupby('index').sum().astype(int)

def b(df):
    from collections import Counter

    df1 = pd.DataFrame(terms)

    res = []
    for i,r in df.iterrows():
        s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
        res.append(pd.DataFrame(s).T)
    return pd.concat(res)

edited Apr 4, 2016 at 12:11

answered Apr 4, 2016 at 11:39

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Colonel Beauvel Over a year ago

it just counts one of them, there might be more than one word in a sentence.

Colonel Beauvel · Accepted Answer · 2016-04-04 12:00:08Z

1

I would go for Counter and replace:

from collections import Counter

df1 = pd.DataFrame(terms)

res = []
for i,r in df.iterrows():
    s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
    res.append(pd.DataFrame(s).T)


In [109]: pd.concat(res)
Out[109]:
   animals  games  people
0        1      1       0
0        0      0       2
0        0      0       0
0        0      2       1
0        0      1       1

answered Apr 4, 2016 at 12:00

Colonel Beauvel

31.3k11 gold badges49 silver badges88 bronze badges

2 Comments

Colonel Beauvel Over a year ago

which is erroneous due to exact match (for example on eagle - eagles)

ccsv Over a year ago

I know I just want a rough match I will iron out the code later with regex thanks.

Collectives™ on Stack Overflow

Python pandas count number of Regex matches in a string

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related