How to write most efficient way to add a value for an column in dataframe python?

Question

I have one data frame df consisting of a 2 columns(word and meaning/definition of that word). I want to use the Collections.Counter object for each definition of a word and count the frequency of words occurring in the definition in the most pythonic way possible.

The traditional approach would be to iterate over the data frame using the iterrows() methods and do the computations.

Sample output

<table style="height: 59px;" border="True" width="340">
  <tbody>
    <tr>
      <td>Word</td>
      <td>Meaning</td>
      <td>Word Freq</td>
    </tr>
    <tr>
      <td>Array</td>
      <td>collection of homogeneous datatype</td>
      <td>{'collection':1,'of':1....}</td>
    </tr>
    <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
    </tr>
  </tbody>
</table>

piRSquared · Accepted Answer · 2017-01-06 20:07:23Z

2

I would take advantage of Pandas str accessor methods and do this

from collections import Counter
Counter(df.definition.str.cat(sep=' ').split())

Some Test data

df = pd.DataFrame({'word': ['some', 'words', 'yes'], 'definition': ['this is a definition', 'another definition', 'one final definition']})

print(df)
             definition   word
0  this is a definition   some
1    another definition  words
2  one final definition    yes

And then concatenating and splitting by space and using Counter

Counter(df.definition.str.cat(sep=' ').split())

Counter({'a': 1,
         'another': 1,
         'definition': 3,
         'final': 1,
         'is': 1,
         'one': 1,
         'this': 1})

edited Jan 6, 2017 at 20:07

piRSquared

296k68 gold badges509 silver badges654 bronze badges

answered Jan 6, 2017 at 19:02

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user765160 Over a year ago

Ted Petrou : thank you for the comment. I would also like to know how to do the similar computation in the most pythonic way for say 1000 word definition?

Ted Petrou Over a year ago

This works for ant number of definitions of all different word sizes

user765160 Over a year ago

I mean to say for 1000 different word i.e 1000 row in the dataframe ?

James · Accepted Answer · 2017-01-06 19:00:33Z

0

Assuming that df has two columns 'word' and 'definition', then you simply use the .map method with Counter on the definition series after splitting on space. Then sum the result.

from collections import Counter

def_counts = df.definition.map(lambda x: Counter(x.split()))
all_counts = def_counts.sum()

answered Jan 6, 2017 at 19:00

James

37k4 gold badges54 silver badges79 bronze badges

1 Comment

user765160 Over a year ago

Thanks James .. your suggestion helped a lot

piRSquared · Accepted Answer · 2017-01-06 21:07:36Z

I intend for this answer to be useful but not the chosen answer. In fact, I'm only making an argument for Counter and @TedPetrou's answer.

create large example of random words

a = np.random.choice(list(ascii_lowercase), size=(100000, 5))

definitions = pd.Series(
    pd.DataFrame(a).sum(1).values.reshape(-1, 10).tolist()).str.join(' ')

definitions.head()

0    hmwnp okuat sexzr jsxhh bdoyc kdbas nkoov moek...
1    iiuot qnlgs xrmss jfwvw pmogp vkrvl bygit qqon...
2    ftcap ihuto ldxwo bvvch zuwpp bdagx okhtt lqmy...
3    uwmcs nhmxa qeomd ptlbg kggxr hpclc kwnix rlon...
4    npncx lnors gyomb dllsv hyayw xdynr ctwvh nsib...
dtype: object

timing
Counter is an order of 1000 times faster than fastest I could think of.

Collectives™ on Stack Overflow

How to write most efficient way to add a value for an column in dataframe python?

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related