unique combinations of values in selected columns in pandas data frame and count

Question

I have my data in pandas data frame as follows:

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
                   'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

So, my data looks like this

----------------------------
index         A        B
0           yes      yes
1           yes       no
2           yes       no
3           yes       no
4            no      yes
5            no      yes
6           yes       no
7           yes      yes
8           yes      yes
9            no       no
-----------------------------

I would like to transform it to another data frame. The expected output can be shown in the following python script:

output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})

So, my expected output looks like this

--------------------------------------------
index      A       B       count
--------------------------------------------
0         no       no        1
1         no      yes        2
2        yes       no        4
3        yes      yes        3
--------------------------------------------

Actually, I can achieve to find all combinations and count them by using the following command: mytable = df1.groupby(['A','B']).size()

However, it turns out that such combinations are in a single column. I would like to separate each value in a combination into different column and also add one more column for the result of counting. Is it possible to do that? May I have your suggestions? Thank you in advance.

EdChum · Accepted Answer · 2017-08-25 07:49:42Z

333

You can groupby on cols 'A' and 'B' and call size and then reset_index and rename the generated column:

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

update

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size which returns the number of unique groups:

In[202]:
df1.groupby(['A','B']).size()

Out[202]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

So now to restore the grouped columns, we call reset_index:

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]: 
     A    B  0
0   no   no  1
1   no  yes  2
2  yes   no  4
3  yes  yes  3

This restores the indices but the size aggregation is turned into a generated column 0, so we have to rename this:

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]: 
     A    B  count
0   no   no      1
1   no  yes      2
2  yes   no      4
3  yes  yes      3

groupby does accept the arg as_index which we could have set to False so it doesn't make the grouped columns the index, but this generates a series and you'd still have to restore the indices and so on....:

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]: 
A    B  
no   no     1
     yes    2
yes  no     4
     yes    3
dtype: int64

edited Aug 25, 2017 at 7:49

answered Feb 8, 2016 at 11:46

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Bryan P Over a year ago

Note: as described in stackoverflow.com/a/54364400/1224158, you could substitute count() instead of size() to both ignore NaN and return a dataframe

Enrico Gandini Over a year ago

Maybe a bit clearer: right after size(), use rename("count"). So, the Series produced by size() now have name attribute set to "count", and the name attribute will be the column name of the DataFrame produced by reset_index().

Ma0 Over a year ago

if you would like to avoid explicitly listing the column names and are interested in grouping all columns, you could do instead: df1.groupby(list(df1.columns))

cherrywoods Over a year ago

With a recent pandas version (>= 1.1.0), Mykola Zotko's answer is much more readable: df.value_counts now does this out of the box.

Mykola Zotko · Accepted Answer · 2021-06-08 07:56:25Z

54

In Pandas 1.1.0 you can use the method value_counts with DataFrames:

df.value_counts() # or df[['A', 'B']].value_counts()

Result:

A    B
yes  no     4
     yes    3
no   yes    2
     no     1
dtype: int64

Convert index to columns and sort by value counts:

df.value_counts(ascending=True).reset_index(name='count')

Result:

     A    B  count
0   no   no      1
1   no  yes      2
2  yes  yes      3
3  yes   no      4

answered Jun 8, 2021 at 7:56

Mykola Zotko

18.2k6 gold badges87 silver badges90 bronze badges

1 Comment

Dave Reikher Over a year ago

With value_counts noted improved performance by a factor of ~2 relative to the accepted answer (groupby) to find unique rows in two out of O(100) columns for a df of size of O(10K) with O(10K) unique rows

TendaiM · Accepted Answer · 2022-01-14 13:13:31Z

5

Based on the accepted answer and @Bryan P's comment relating to the differences between count() and size(), I opted for count() for cleaner code as below :

df1.groupby(['A','B']).count().reset_index()

answered Jan 14, 2022 at 13:13

TendaiM

3331 gold badge3 silver badges10 bronze badges

Comments

Martin Alexandersson · Accepted Answer · 2018-09-27 16:18:17Z

1

Slightly related, I was looking for the unique combinations and I came up with this method:

def unique_columns(df,columns):

    result = pd.Series(index = df.index)

    groups = meta_data_csv.groupby(by = columns)
    for name,group in groups:
       is_unique = len(group) == 1
       result.loc[group.index] = is_unique

    assert not result.isnull().any()

    return result

And if you only want to assert that all combinations are unique:

df1.set_index(['A','B']).index.is_unique

answered Sep 27, 2018 at 16:18

Martin Alexandersson

1,46712 silver badges14 bronze badges

1 Comment

user3290553 Over a year ago

Didn't know about set_index(). Kept trying to use groupby() for grouping together rows with a particular common pair of columns. Amazing, thank you!

MikeB2019x · Accepted Answer · 2020-07-27 19:39:14Z

I haven't done time test with this but it was fun to try. Basically convert two columns to one column of tuples. Now convert that to a dataframe, do 'value_counts()' which finds the unique elements and counts them. Fiddle with zip again and put the columns in order you want. You can probably make the steps more elegant but working with tuples seems more natural to me for this problem

b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})

b['count'] = pd.Series(zip(*[b.A,b.B]))
df = pd.DataFrame(b['count'].value_counts().reset_index())
df['A'], df['B'] = zip(*df['index'])
df = df.drop(columns='index')[['A','B','count']]

Paul Rougieux · Accepted Answer · 2021-05-25 15:39:13Z

Placing @EdChum's very nice answer into a function count_unique_index. The unique method only works on pandas series, not on data frames. The function below reproduces the behavior of the unique function in R:

unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

And adds a count of the occurrences as requested by the OP.

def count_unique_index(df, by):                                                                                                                                                 
    return df.groupby(by).size().reset_index().rename(columns={0:'count'})                                                                                                      

df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],                                                                                             
                    'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})   
                                                                                                                                                                                 
count_unique_index(df1, ['A','B'])                                                                                                                                              
     A    B  count                                                                                                                                                                  
0   no   no      1                                                                                                                                                                  
1   no  yes      2                                                                                                                                                                  
2  yes   no      4                                                                                                                                                                  
3  yes  yes      3

Collectives™ on Stack Overflow

unique combinations of values in selected columns in pandas data frame and count

6 Answers 6

4 Comments

1 Comment

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

4 Comments

1 Comment

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related