1

I have a list of 'words' I want to count below

word_list = ['one','two','three']

And I have a column within pandas dataframe with text below.

TEXT                                       | USER
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | User 1
"Is it two or one?"                        | User 1
"Mayhaps it be three afterall..."          | User 2
"Three times and it's a charm."            | User 2
"One fish, two fish, red fish, blue fish." | User 2
"There's only one cat in the hat."         | User 3
"One does not simply code into pandas."    | User 3
"Two nights later..."                      | User 1
"Quoth the Raven... nevermore."            | User 2

The desired output that I would like is the following below, where I want to count the number of unique users that has text related to any word in word_list, using the data found in the "TEXT" column

Word | Unique User Count
one  |      3          User 1/2/3 here
two  |      2          User 1/2 here
three|      1          User 2 here

Is there a way to do this in Python 2.7?

1 Answer 1

1
df[word_list]=df.TEXT.apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg([lambda x : ','.join(x),len])

Out[31]: 
                        <lambda>  len
level_1                              
one       User 1, User 1, User 3    3
three                     User 2    1
two               User 1, User 2    2

EDIT

df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})


Out[50]: 
          Unique               User Count
        <lambda>                 <lambda>
level_1                                  
one            3   User 2, User 3, User 1
three          1                   User 2
two            2           User 2, User 1

EDIT 2

df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
Target=df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})
Target.columns=Target.columns.droplevel(1)
Target.drop('User Count',axis=1).reset_index().rename(columns={'level_1':'Words'})
Out[94]: 
   Words  Unique
0    one       3
1  three       1
2    two       2
Sign up to request clarification or add additional context in comments.

8 Comments

Is there one that doesn't include the column of which users are in the unique count? Just the final count only.
Can you eliminate the 2 rows with just <lambda> in each column and also level_1 too? Also also the User Count column? Just need the unique number count.
@Leggerless check the edit , and also, you have more than 15 you can upvoted and accept the answer
Edit 2 doesn't seem to be working as intended. Edit 1 still does. On #2 -- TypeError: drop() got an unexpected keyword argument 'axis'
@Leggerless man it is drop('column',axis=1) axis without the ' '
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.