1

I am trying to create the set of word tokens and a word count dictionary from a dataframe column.

df = pd.DataFrame({'a':[11,11,11,12,12,12], 'b':['The Effect','effective than','more','more','bark oola','a'], 'c': [1,2,3,5,6,9]})

I am now creating tokens from the column 'b' using the code

set(list(itertools.chain.from_iterable(df.b.str.split())))

is this the most efficient way ?

what if I need the tokens and count (number of time that specific token appear in the column) in a dictionary

1 Answer 1

1

You can use str.join with str.split then convert to set

set(' '.join(df['b']).split())
# {'Effect', 'The', 'a', 'bark', 'effective', 'more', 'oola', 'than'}

You can use Series.explode and then Series.unique

df['b'].str.split().explode().unique()

# array(['The', 'Effect', 'effective', 'than', 'more', 'bark', 'oola', 'a'],
#       dtype=object)

timeits

Benchmarking setup

s = pd.Series(['this', 'many strings', 'all are humans']*500)
s.append(['again some more random', 'foo bar']*500)
In [43]: %%timeit 
    ...: s.str.split().explode().unique() 
    ...:  
    ...:                                                                        
1.46 ms ± 4.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [44]: %%timeit 
    ...: set(list(itertools.chain.from_iterable(s.str.split()))) 
    ...:  
    ...:                                                                        
776 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [49]: %timeit np.unique(s.str.split().explode())                             
2.48 ms ± 62.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [64]: %timeit set(' '.join(s).split())                                       
292 µs ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

2 Comments

@jezrael Add a solution which is 3x faster than OP's. I in general string operation of pandas is slow.
@jezrael add timeits too

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.