3

I have a dataframe with sentences and a dictionary of terms grouped into topics, where I want to count the number of term matches for each topic.

import pandas as pd

terms = {'animals':["fox","deer","eagle"],
'people':['John', 'Rob','Steve'],
'games':['basketball', 'football', 'hockey']
}

df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['The quick brown fox was playing basketball today','John and Rob visited the eagles nest, the foxes ran away','Bill smells like a wet dog','Steve threw the football at a deer. But the football missed','Sheriff John does not like hockey']
})

So far I have created columns for the topics and marked it with 1 if a word is present by iterating over the dictionary.

df = pd.concat([df, pd.DataFrame(columns=list(terms.keys()))])


for k, v in terms.items():
    for val in v:
        df.loc[df.Foo.str.contains(val), k] = 1


print (df)

and I get:

>>> 
                                                 Foo  Score animals games  \
0   The quick brown fox was playing basketball today      4       1     1   
1  John and Rob visited the eagles nest, the foxe...      6       1   NaN   
2                         Bill smells like a wet dog      2     NaN   NaN   
3  Steve threw the football at a deer. But the fo...      7       1     1   
4                  Sheriff John does not like hockey      8     NaN     1   

  people  
0    NaN  
1      1  
2    NaN  
3      1  
4      1  

What is the best way to count the number of words for each topic that appears in the sentence? and is there a more efficient way of looping over the dictionary without using cython?

1
  • do you really need to count your terms for each row or you will count them for the whole data frame? I.e. what is the end goal? Commented Apr 4, 2016 at 11:53

2 Answers 2

2

You can use split with stack what is 5 times faster as Counter solution:

df1 = df.Foo.str.split(expand=True).stack()
                                   .reset_index(level=1, drop=True)
                                   .reset_index(name='Foo')

for k, v in terms.items():
    df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
#print df1

print df1.groupby('index').sum().astype(int)
       games  animals  people
index                        
0          1        1       0
1          0        2       2
2          0        0       0
3          2        1       1
4          1        0       1

Timings:

In [233]: %timeit a(df)
100 loops, best of 3: 4.9 ms per loop

In [234]: %timeit b(df)
10 loops, best of 3: 25.2 ms per loop

Code:

def a(df):
    df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
    for k, v in terms.items():
        df1[k] = df1.Foo.str.contains('|'.join(terms[k]))
    return df1.groupby('index').sum().astype(int)

def b(df):
    from collections import Counter

    df1 = pd.DataFrame(terms)

    res = []
    for i,r in df.iterrows():
        s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
        res.append(pd.DataFrame(s).T)
    return pd.concat(res)
Sign up to request clarification or add additional context in comments.

1 Comment

it just counts one of them, there might be more than one word in a sentence.
1

I would go for Counter and replace:

from collections import Counter

df1 = pd.DataFrame(terms)

res = []
for i,r in df.iterrows():
    s = df1.replace(Counter(r['Foo'].split())).replace('\w',0,regex=True).sum()
    res.append(pd.DataFrame(s).T)


In [109]: pd.concat(res)
Out[109]:
   animals  games  people
0        1      1       0
0        0      0       2
0        0      0       0
0        0      2       1
0        0      1       1

2 Comments

which is erroneous due to exact match (for example on eagle - eagles)
I know I just want a rough match I will iron out the code later with regex thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.