1

I have one data frame df consisting of a 2 columns(word and meaning/definition of that word). I want to use the Collections.Counter object for each definition of a word and count the frequency of words occurring in the definition in the most pythonic way possible.

The traditional approach would be to iterate over the data frame using the iterrows() methods and do the computations.

Sample output

<table style="height: 59px;" border="True" width="340">
  <tbody>
    <tr>
      <td>Word</td>
      <td>Meaning</td>
      <td>Word Freq</td>
    </tr>
    <tr>
      <td>Array</td>
      <td>collection of homogeneous datatype</td>
      <td>{'collection':1,'of':1....}</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
    </tr>
  </tbody>
</table>

3 Answers 3

2

I would take advantage of Pandas str accessor methods and do this

from collections import Counter
Counter(df.definition.str.cat(sep=' ').split())

Some Test data

df = pd.DataFrame({'word': ['some', 'words', 'yes'], 'definition': ['this is a definition', 'another definition', 'one final definition']})

print(df)
             definition   word
0  this is a definition   some
1    another definition  words
2  one final definition    yes

And then concatenating and splitting by space and using Counter

Counter(df.definition.str.cat(sep=' ').split())

Counter({'a': 1,
         'another': 1,
         'definition': 3,
         'final': 1,
         'is': 1,
         'one': 1,
         'this': 1})
Sign up to request clarification or add additional context in comments.

3 Comments

Ted Petrou : thank you for the comment. I would also like to know how to do the similar computation in the most pythonic way for say 1000 word definition?
This works for ant number of definitions of all different word sizes
I mean to say for 1000 different word i.e 1000 row in the dataframe ?
0

Assuming that df has two columns 'word' and 'definition', then you simply use the .map method with Counter on the definition series after splitting on space. Then sum the result.

from collections import Counter

def_counts = df.definition.map(lambda x: Counter(x.split()))
all_counts = def_counts.sum()

1 Comment

Thanks James .. your suggestion helped a lot
0

I intend for this answer to be useful but not the chosen answer. In fact, I'm only making an argument for Counter and @TedPetrou's answer.

create large example of random words

a = np.random.choice(list(ascii_lowercase), size=(100000, 5))

definitions = pd.Series(
    pd.DataFrame(a).sum(1).values.reshape(-1, 10).tolist()).str.join(' ')

definitions.head()

0    hmwnp okuat sexzr jsxhh bdoyc kdbas nkoov moek...
1    iiuot qnlgs xrmss jfwvw pmogp vkrvl bygit qqon...
2    ftcap ihuto ldxwo bvvch zuwpp bdagx okhtt lqmy...
3    uwmcs nhmxa qeomd ptlbg kggxr hpclc kwnix rlon...
4    npncx lnors gyomb dllsv hyayw xdynr ctwvh nsib...
dtype: object

timing
Counter is an order of 1000 times faster than fastest I could think of.

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.