How to count duplicate rows in pandas dataframe?

Question

I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
                   'two': pd.Series([1., 2., 1])})

I get a df that looks like this:

I imagine the first step is to find all the different unique rows, which I do by:

df.drop_duplicates()

This gives me the following df:

    one two
0   1   1
1   1   2

Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:

Row     Count
[1 1]     2
[1 2]     1

How should I go about doing this last step?

Edit:

Here's a larger example to make it more clear:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})

gives me:

    one three   two
0   True    True    True
1   True    False   False
2   True    False   False
3   False   False   True

I want a result that tells me:

       Row           Count
[True True True]       1
[True False False]     2
[False False True]     1

EdChum · Accepted Answer · 2016-02-23 19:02:50Z

131

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one    three  two  
False  False  True     1
True   False  False    2
       True   True     1
dtype: int64

edited Feb 23, 2016 at 19:02

answered Feb 23, 2016 at 17:51

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SeaBean Over a year ago

With latest version of Pandas (1.1.0 released in July 2020) onwards, this code can be fine-tuned to count also duplicate rows with NaN entries. See here for details.

Arash · Accepted Answer · 2021-10-10 21:24:59Z

61

Specific to your question, as the others mentioned fast and easy way would be:

df.groupby(df.columns.tolist(),as_index=False).size()

If you like to count duplicates on particular column(s):

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

df.duplicated(subset='one', keep='first').sum()

where

subset : column label or sequence of labels(by default use all of the columns)

keep : {‘first’, ‘last’, False}, default ‘first’

first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.

edited Oct 10, 2021 at 21:24

answered Nov 29, 2018 at 20:31

Arash

1,0741 gold badge9 silver badges17 bronze badges

4 Comments

Federico Dorato Over a year ago

May not be the fastest solution, but it is for sure the clearest to read

clg4 Over a year ago

doesnt answer the question. want to count the number of duplicates for each duplicate value.

Arash Over a year ago

@clg4 added solution more specific to the question

Timothee W Over a year ago

As you mention EdChum's solution on all columns and a solution on a particular column. I would add that df.groupby(df["my_column"].tolist(),as_index=False).size() can be used to get groupby on a column

Max Ghenis · Accepted Answer · 2019-02-16 16:27:35Z

51

df.groupby(df.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'})

   one  two  records
0    1    1        2
1    1    2        1

edited Feb 16, 2019 at 16:27

Max Ghenis

16k17 gold badges94 silver badges142 bronze badges

answered Dec 21, 2016 at 18:21

Denis

5114 silver badges3 bronze badges

1 Comment

S1M0N38 Over a year ago

This works also on dask :)

Sergey Zaitsev · Accepted Answer · 2019-04-07 07:52:22Z

14

I use:

used_features =[
    "one",
    "two",
    "three"
]

df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.

answered Apr 7, 2019 at 7:52

Sergey Zaitsev

5855 silver badges6 bronze badges

Comments

Ammad · Accepted Answer · 2022-06-26 22:06:05Z

12

If you just need to find a count the for unique and duplicate rows (entire row duplicated) this could work:

df.duplicated().value_counts()

output: False 11398 True 154 dtype: int64

answered Jun 26, 2022 at 22:06

Ammad

1311 silver badge2 bronze badges

1 Comment

AlexK Over a year ago

The question was about counting unique combinations of values, not simply counting how many rows are duplicates of other rows.

olisteadman · Accepted Answer · 2018-12-12 15:51:25Z

8

None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:

# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})

# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]

answered Dec 12, 2018 at 15:51

olisteadman

4626 silver badges13 bronze badges

1 Comment

Enrique Benito Casado Over a year ago

thanks @olisteadman that answer should be in the positon 1

SeaBean · Accepted Answer · 2021-10-13 19:54:18Z

If you find some counts missing or get error: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements, read here:

1. Count duplicate rows with `NaN` entries:

The accepted solution is great and believed to have been helpful to many members. In a recent task, I found it can be further fine-tuned to support complete counting of a dataframe with NaN entries. Pandas supports missing entries or null values as NaN values. Let's see what's the output for this use case when our dataframe contains NaN entries:

  Col1  Col2 Col3 Col4
0  ABC   123  XYZ  NaN       # group #1 of 3
1  ABC   123  XYZ  NaN       # group #1 of 3
2  ABC   678  PQR  def           # group #2 of 1
3  MNO   890  EFG  abc               # group #3 of 4 
4  MNO   890  EFG  abc               # group #3 of 4 
5  CDE   234  567  xyz                   # group #4 of 2 
6  ABC   123  XYZ  NaN       # group #1 of 3
7  CDE   234  567  xyz                   # group #4 of 2 
8  MNO   890  EFG  abc               # group #3 of 4 
9  MNO   890  EFG  abc               # group #3 of 4

Applying the code:

df.groupby(df.columns.tolist(),as_index=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   678  PQR  def     1
1  CDE   234  567  xyz     2
2  MNO   890  EFG  abc     4

Oh, how come the count of Group #1 with 3 duplicate rows is missing?!

For some Pandas version, you may get an error instead: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements

Solution:

Use the parameter dropna= for the .groupby() function, as follows:

df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   123  XYZ  NaN     3          # <===  count of rows with `NaN`
1  ABC   678  PQR  def     1
2  CDE   234  567  xyz     2
3  MNO   890  EFG  abc     4

The count of duplicate rows with NaN can be successfully output with dropna=False. This parameter has been supported since Pandas version 1.1.0

2. Alternative Solution

Another way to count duplicate rows with NaN entries is as follows:

df.value_counts(dropna=False).reset_index(name='count')

gives:

  Col1  Col2 Col3 Col4  count
0  MNO   890  EFG  abc      4
1  ABC   123  XYZ  NaN      3
2  CDE   234  567  xyz      2
3  ABC   678  PQR  def      1

Here, we use the .value_counts() function with also the parameter dropna=False. However, this parameter has been supported only recently since Pandas version 1.3.0 If your version is older than this, you'll need to use the .groupby() solution if you want to get complete counts for rows with NaN entries.

You will see that the output is in different sequence than the previous result. The counts are sorted in descending order. If you want to get unsorted result, you can specify sort=False:

df.value_counts(dropna=False, sort=False).reset_index(name='count')

it gives the same result as the df.groupby(df.columns.tolist(), as_index=False, dropna=False).size() solution:

  Col1  Col2 Col3 Col4  count
0  ABC   123  XYZ  NaN      3
1  ABC   678  PQR  def      1
2  CDE   234  567  xyz      2
3  MNO   890  EFG  abc      4

Note that this .value_counts() solution supports dataframes both with and without NaN entries and can be used as a general solution.

In fact, in the underlying implementation codes .value_counts() calls GroupBy.size to get the counts: click the link to see the underlying codes: counts = self.groupby(subset, dropna=dropna).grouper.size()

Hence, for this use case, .value_counts() and the .groupby() solution in the accepted solution are actually doing the same thing. We should be able to use the .value_counts() function to get the desired counts of duplicate rows equally well.

Use of .value_counts() function to get counts of duplicate rows has the additional benefit that its syntax is simpler. You can simply use df.value_counts() or df.value_counts(dropna=False) depending on whether your dataframe contains NaN or not. Chain with .reset_index() if you want the result as a dataframe instead of a Series.

Is there a way to preserve original index of the first previously unseen row occurrence?
Managed to do it like this, but please let me know if there's a more simple way counts_reordered = df.value_counts(dropna=False, sort=False); df.drop_duplicates(inplace=True); counts = counts_reordered.reindex(tuple(values) for index, values in df.iterrows()).values
Regarding the parameter dropna=False, this also works when there are None values

Jarad · Accepted Answer · 2016-02-23 20:11:22Z

4

df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

Produces:

>>> df1
       Count
1 1 1      2
3 2 3      1
1 2 2      1

If the index values must be a list, you could take the above code a step further with:

df1.index = df1.index.str.split()

Produces:

           Count
[1, 1, 1]      2
[3, 2, 3]      1
[1, 2, 2]      1

edited Feb 23, 2016 at 20:11

answered Feb 23, 2016 at 18:55

Jarad

19.2k20 gold badges105 silver badges165 bronze badges

Comments

ThibTrip · Accepted Answer · 2019-09-12 10:03:02Z

ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.

import pandas as pd

# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
                  'test2':['bar',None,None,None,'bar'],
                  'test3':[None, 'foo','bar',None,None]})

# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
      .groupby(df.columns.tolist()).apply(len)\
      .rename('group_count')\
      .reset_index()\
      .replace('',np.nan)\
      .sort_values(by = ['group_count'], ascending = False))
df

  test test2 test3  group_count
3  foo   bar   NaN            2
0  NaN   NaN   NaN            1
1  NaN   NaN   bar            1
2  bar   NaN   foo            1

Mykola Zotko · Accepted Answer · 2021-09-03 11:23:22Z

0

To count rows in DataFrame you can use the method value_counts (Pandas 1.1.0):

df = pd.DataFrame({'A': [1, 1, 2, 2, 3], 'B': [10, 10, 20, 20, 30]})

df.value_counts().reset_index(name='counts').query('counts > 1')

Output:

   A   B  counts
0  1  10       2
1  2  20       2

edited Sep 3, 2021 at 11:23

answered Jun 24, 2021 at 11:48

Mykola Zotko

18.2k6 gold badges88 silver badges90 bronze badges

Comments

Alireza Ghaffari · Accepted Answer · 2023-03-25 09:59:38Z

0

It is as easy as:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})


rs = pd.DataFrame(df.value_counts(sort=False).index.to_list(), columns=df.columns)
rs["#"] = df.value_counts(sort=False).values


    one     two     three   #
0   False   True    False   1
1   True    False   False   2
2   True    True    True    1

But, if you want to just inform this piece of code is enough:

df.value_counts(sort=False)

edited Mar 25, 2023 at 9:59

answered Mar 25, 2023 at 8:14

Alireza Ghaffari

1,1125 gold badges16 silver badges31 bronze badges

Comments

Roman Skrypin · Accepted Answer · 2023-11-12 15:01:14Z

0

df.groupby(df.columns.tolist()).size().reset_index(name='count')

   one  two    count
0    1    1        2
1    1    2        1

answered Nov 12, 2023 at 15:01

Roman Skrypin

212 bronze badges

1 Comment

Jeremy Caney Over a year ago

Thank you for your interest in contributing to the Stack Overflow community. This question already has quite a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?

Collectives™ on Stack Overflow

How to count duplicate rows in pandas dataframe?

12 Answers 12

1 Comment

4 Comments

1 Comment

Comments

1 Comment

1 Comment

1. Count duplicate rows with `NaN` entries:

Solution:

2. Alternative Solution

3 Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

1 Comment

4 Comments

1 Comment

Comments

1 Comment

1 Comment

1. Count duplicate rows with NaN entries:

Solution:

2. Alternative Solution

3 Comments

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related

1. Count duplicate rows with `NaN` entries: