creating set and count dictionaries from dataframe column

Question

I am trying to create the set of word tokens and a word count dictionary from a dataframe column.

df = pd.DataFrame({'a':[11,11,11,12,12,12], 'b':['The Effect','effective than','more','more','bark oola','a'], 'c': [1,2,3,5,6,9]})

I am now creating tokens from the column 'b' using the code

set(list(itertools.chain.from_iterable(df.b.str.split())))

is this the most efficient way ?

what if I need the tokens and count (number of time that specific token appear in the column) in a dictionary

Ch3steR · Accepted Answer · 2021-02-19 13:46:34Z

1

You can use str.join with str.split then convert to set

set(' '.join(df['b']).split())
# {'Effect', 'The', 'a', 'bark', 'effective', 'more', 'oola', 'than'}

You can use Series.explode and then Series.unique

df['b'].str.split().explode().unique()

# array(['The', 'Effect', 'effective', 'than', 'more', 'bark', 'oola', 'a'],
#       dtype=object)

`timeit`s

Benchmarking setup

s = pd.Series(['this', 'many strings', 'all are humans']*500)
s.append(['again some more random', 'foo bar']*500)

In [43]: %%timeit 
    ...: s.str.split().explode().unique() 
    ...:  
    ...:                                                                        
1.46 ms ± 4.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [44]: %%timeit 
    ...: set(list(itertools.chain.from_iterable(s.str.split()))) 
    ...:  
    ...:                                                                        
776 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [49]: %timeit np.unique(s.str.split().explode())                             
2.48 ms ± 62.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [64]: %timeit set(' '.join(s).split())                                       
292 µs ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Feb 19, 2021 at 13:46

answered Feb 19, 2021 at 13:11

Ch3steR

20.8k4 gold badges34 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ch3steR Over a year ago

@jezrael Add a solution which is 3x faster than OP's. I in general string operation of pandas is slow.

Ch3steR Over a year ago

@jezrael add timeits too

Collectives™ on Stack Overflow

creating set and count dictionaries from dataframe column

1 Answer 1

`timeit`s

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

timeits

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`timeit`s