Counting number of unique values in column A based on substring filter on column B

Question

I have a list of 'words' I want to count below

word_list = ['one','two','three']

And I have a column within pandas dataframe with text below.

TEXT                                       | USER
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | User 1
"Is it two or one?"                        | User 1
"Mayhaps it be three afterall..."          | User 2
"Three times and it's a charm."            | User 2
"One fish, two fish, red fish, blue fish." | User 2
"There's only one cat in the hat."         | User 3
"One does not simply code into pandas."    | User 3
"Two nights later..."                      | User 1
"Quoth the Raven... nevermore."            | User 2

The desired output that I would like is the following below, where I want to count the number of unique users that has text related to any word in word_list, using the data found in the "TEXT" column

Word | Unique User Count
one  |      3          User 1/2/3 here
two  |      2          User 1/2 here
three|      1          User 2 here

Is there a way to do this in Python 2.7?

BENY · Accepted Answer · 2017-10-24 20:17:39Z

1

df[word_list]=df.TEXT.apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg([lambda x : ','.join(x),len])

Out[31]: 
                        <lambda>  len
level_1                              
one       User 1, User 1, User 3    3
three                     User 2    1
two               User 1, User 2    2

EDIT

df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})


Out[50]: 
          Unique               User Count
        <lambda>                 <lambda>
level_1                                  
one            3   User 2, User 3, User 1
three          1                   User 2
two            2           User 2, User 1

EDIT 2

df[word_list]=df.TEXT.str.lower().apply(lambda x : pd.Series([x.find(y) for y in word_list])).ne(-1)
df1=df[['USER','one','two','three']].set_index('USER').astype(int).replace({0:np.nan})
Target=df1.stack().reset_index().groupby('level_1').USER.agg({'User Count':[lambda x : ','.join(set(x))],'Unique':[lambda x : x.nunique()]})
Target.columns=Target.columns.droplevel(1)
Target.drop('User Count',axis=1).reset_index().rename(columns={'level_1':'Words'})
Out[94]: 
   Words  Unique
0    one       3
1  three       1
2    two       2

edited Oct 24, 2017 at 20:17

answered Oct 24, 2017 at 18:36

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Leggerless Over a year ago

Is there one that doesn't include the column of which users are in the unique count? Just the final count only.

Leggerless Over a year ago

Can you eliminate the 2 rows with just <lambda> in each column and also level_1 too? Also also the User Count column? Just need the unique number count.

BENY Over a year ago

@Leggerless check the edit , and also, you have more than 15 you can upvoted and accept the answer

Leggerless Over a year ago

Edit 2 doesn't seem to be working as intended. Edit 1 still does. On #2 -- TypeError: drop() got an unexpected keyword argument 'axis'

BENY Over a year ago

@Leggerless man it is drop('column',axis=1) axis without the ' '

|

Collectives™ on Stack Overflow

Counting number of unique values in column A based on substring filter on column B

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related