127

I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
                   'two': pd.Series([1., 2., 1])})

I get a df that looks like this:

    one two
0   1   1
1   1   2
2   1   1

I imagine the first step is to find all the different unique rows, which I do by:

df.drop_duplicates()

This gives me the following df:

    one two
0   1   1
1   1   2

Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:

Row     Count
[1 1]     2
[1 2]     1

How should I go about doing this last step?

Edit:

Here's a larger example to make it more clear:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})

gives me:

    one three   two
0   True    True    True
1   True    False   False
2   True    False   False
3   False   False   True

I want a result that tells me:

       Row           Count
[True True True]       1
[True False False]     2
[False False True]     1
0

12 Answers 12

131

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one    three  two  
False  False  True     1
True   False  False    2
       True   True     1
dtype: int64
Sign up to request clarification or add additional context in comments.

1 Comment

With latest version of Pandas (1.1.0 released in July 2020) onwards, this code can be fine-tuned to count also duplicate rows with NaN entries. See here for details.
61

Specific to your question, as the others mentioned fast and easy way would be:

df.groupby(df.columns.tolist(),as_index=False).size()

If you like to count duplicates on particular column(s):

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

df.duplicated(subset='one', keep='first').sum()

where

subset : column label or sequence of labels(by default use all of the columns)

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

4 Comments

May not be the fastest solution, but it is for sure the clearest to read
doesnt answer the question. want to count the number of duplicates for each duplicate value.
@clg4 added solution more specific to the question
As you mention EdChum's solution on all columns and a solution on a particular column. I would add that df.groupby(df["my_column"].tolist(),as_index=False).size() can be used to get groupby on a column
51
df.groupby(df.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'})

   one  two  records
0    1    1        2
1    1    2        1

1 Comment

This works also on dask :)
14

I use:

used_features =[
    "one",
    "two",
    "three"
]

df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.

Comments

12

If you just need to find a count the for unique and duplicate rows (entire row duplicated) this could work:

df.duplicated().value_counts()

output: False 11398 True 154 dtype: int64

1 Comment

The question was about counting unique combinations of values, not simply counting how many rows are duplicates of other rows.
8

None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:

# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})

# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]

1 Comment

thanks @olisteadman that answer should be in the positon 1
6

If you find some counts missing or get error: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements, read here:

1. Count duplicate rows with NaN entries:

The accepted solution is great and believed to have been helpful to many members. In a recent task, I found it can be further fine-tuned to support complete counting of a dataframe with NaN entries. Pandas supports missing entries or null values as NaN values. Let's see what's the output for this use case when our dataframe contains NaN entries:

  Col1  Col2 Col3 Col4
0  ABC   123  XYZ  NaN       # group #1 of 3
1  ABC   123  XYZ  NaN       # group #1 of 3
2  ABC   678  PQR  def           # group #2 of 1
3  MNO   890  EFG  abc               # group #3 of 4 
4  MNO   890  EFG  abc               # group #3 of 4 
5  CDE   234  567  xyz                   # group #4 of 2 
6  ABC   123  XYZ  NaN       # group #1 of 3
7  CDE   234  567  xyz                   # group #4 of 2 
8  MNO   890  EFG  abc               # group #3 of 4 
9  MNO   890  EFG  abc               # group #3 of 4 

Applying the code:

df.groupby(df.columns.tolist(),as_index=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   678  PQR  def     1
1  CDE   234  567  xyz     2
2  MNO   890  EFG  abc     4

Oh, how come the count of Group #1 with 3 duplicate rows is missing?!

For some Pandas version, you may get an error instead: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements

Solution:

Use the parameter dropna= for the .groupby() function, as follows:

df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   123  XYZ  NaN     3          # <===  count of rows with `NaN`
1  ABC   678  PQR  def     1
2  CDE   234  567  xyz     2
3  MNO   890  EFG  abc     4

The count of duplicate rows with NaN can be successfully output with dropna=False. This parameter has been supported since Pandas version 1.1.0


2. Alternative Solution

Another way to count duplicate rows with NaN entries is as follows:

df.value_counts(dropna=False).reset_index(name='count')

gives:

  Col1  Col2 Col3 Col4  count
0  MNO   890  EFG  abc      4
1  ABC   123  XYZ  NaN      3
2  CDE   234  567  xyz      2
3  ABC   678  PQR  def      1

Here, we use the .value_counts() function with also the parameter dropna=False. However, this parameter has been supported only recently since Pandas version 1.3.0 If your version is older than this, you'll need to use the .groupby() solution if you want to get complete counts for rows with NaN entries.

You will see that the output is in different sequence than the previous result. The counts are sorted in descending order. If you want to get unsorted result, you can specify sort=False:

df.value_counts(dropna=False, sort=False).reset_index(name='count')

it gives the same result as the df.groupby(df.columns.tolist(), as_index=False, dropna=False).size() solution:

  Col1  Col2 Col3 Col4  count
0  ABC   123  XYZ  NaN      3
1  ABC   678  PQR  def      1
2  CDE   234  567  xyz      2
3  MNO   890  EFG  abc      4

Note that this .value_counts() solution supports dataframes both with and without NaN entries and can be used as a general solution.

In fact, in the underlying implementation codes .value_counts() calls GroupBy.size to get the counts: click the link to see the underlying codes: counts = self.groupby(subset, dropna=dropna).grouper.size()

Hence, for this use case, .value_counts() and the .groupby() solution in the accepted solution are actually doing the same thing. We should be able to use the .value_counts() function to get the desired counts of duplicate rows equally well.

Use of .value_counts() function to get counts of duplicate rows has the additional benefit that its syntax is simpler. You can simply use df.value_counts() or df.value_counts(dropna=False) depending on whether your dataframe contains NaN or not. Chain with .reset_index() if you want the result as a dataframe instead of a Series.

3 Comments

Is there a way to preserve original index of the first previously unseen row occurrence?
Managed to do it like this, but please let me know if there's a more simple way counts_reordered = df.value_counts(dropna=False, sort=False); df.drop_duplicates(inplace=True); counts = counts_reordered.reindex(tuple(values) for index, values in df.iterrows()).values
Regarding the parameter dropna=False, this also works when there are None values
4
df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

Produces:

>>> df1
       Count
1 1 1      2
3 2 3      1
1 2 2      1

If the index values must be a list, you could take the above code a step further with:

df1.index = df1.index.str.split()

Produces:

           Count
[1, 1, 1]      2
[3, 2, 3]      1
[1, 2, 2]      1

Comments

2

ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.

import pandas as pd

# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
                  'test2':['bar',None,None,None,'bar'],
                  'test3':[None, 'foo','bar',None,None]})

# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
      .groupby(df.columns.tolist()).apply(len)\
      .rename('group_count')\
      .reset_index()\
      .replace('',np.nan)\
      .sort_values(by = ['group_count'], ascending = False))
df
  test test2 test3  group_count
3  foo   bar   NaN            2
0  NaN   NaN   NaN            1
1  NaN   NaN   bar            1
2  bar   NaN   foo            1

Comments

0

To count rows in DataFrame you can use the method value_counts (Pandas 1.1.0):

df = pd.DataFrame({'A': [1, 1, 2, 2, 3], 'B': [10, 10, 20, 20, 30]})

df.value_counts().reset_index(name='counts').query('counts > 1')

Output:

   A   B  counts
0  1  10       2
1  2  20       2

Comments

0

It is as easy as:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})


rs = pd.DataFrame(df.value_counts(sort=False).index.to_list(), columns=df.columns)
rs["#"] = df.value_counts(sort=False).values


    one     two     three   #
0   False   True    False   1
1   True    False   False   2
2   True    True    True    1

But, if you want to just inform this piece of code is enough:

df.value_counts(sort=False)

Comments

0
df.groupby(df.columns.tolist()).size().reset_index(name='count')
   one  two    count
0    1    1        2
1    1    2        1

1 Comment

Thank you for your interest in contributing to the Stack Overflow community. This question already has quite a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.