Check for duplicate values in Pandas dataframe column

Question

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

 if len(df['Student'].unique()) < len(df.index):
    # Code to remove duplicates based on Date column runs

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

    Student Date
0   Joe     December 2017
1   James   January 2018
2   Bob     April 2018
3   Joe     December 2017
4   Jack    February 2018
5   Jack    March 2018

@Wen Yes this, but maybe convert to datetime and sort after. Quick check would be: any(df['Student'].duplicated()) — Anton vBR
– Anton vBR, Commented May 8, 2018 at 22:21
A couple of misunderstandings: a) it's never necessary to check len(df[col].unique()), pandas has df[col].nunique() b) but anyway you don't even need that either, you're just looking for df[col].duplicated(...) — smci
– smci, Commented May 31, 2020 at 3:09

Community · Accepted Answer · 2020-06-20 09:12:55Z

130

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 8, 2018 at 22:28

Anton vBR

19k6 gold badges47 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jeff Mitchell Over a year ago

Thanks, any(df['Student'].duplicated()) was what I was after.

Jeff Mitchell Over a year ago

Incidentally, I wasn't able to get the date conversion to work (my existing function did work though). I got the error AttributeError: 'DataFrame' object has no attribute 'Date' for the line df['Date'] = pd.to_datetime(df.Date)

Anton vBR Over a year ago

@JeffMitchell df.Date equals df['Date']. It is case-sensitive. Are you sure your columns is called Date? Could try df['Date'] too

Carsten · Accepted Answer · 2021-03-10 11:01:23Z

19

You can use is_unique:

df['Student'].is_unique

# equals true in case of no duplicates

Older pandas versions required:

pd.Series(df['Student']).is_unique

edited Mar 10, 2021 at 11:01

answered Mar 5, 2020 at 13:42

Carsten

3,0781 gold badge18 silver badges30 bronze badges

2 Comments

joshlk Over a year ago

You just need to do df['Student'].is_unique. df['Student'] is already a Pandas series

Carsten Over a year ago

True, older pandas versions however struggled with that. Edited the answer now

Katarzyna · Accepted Answer · 2020-03-06 21:45:10Z

9

If you want to know how many duplicates & what they are use:

df.pivot_table(index=['ColumnName'], aggfunc='size')

df.pivot_table(index=['ColumnName1',.., 'ColumnNameN'], aggfunc='size')

answered Mar 6, 2020 at 21:45

Katarzyna

1,7622 gold badges11 silver badges15 bronze badges

Comments

Asclepius · Accepted Answer · 2020-05-08 23:35:18Z

2

In addition to DataFrame.duplicated and Series.duplicated, Pandas also has a DataFrame.any and Series.any.

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

With Python ≥3.8, check for duplicates and access some duplicate rows:

if (duplicated := df.duplicated(keep=False)).any():
    some_duplicates = df[duplicated].sort_values(by=df.columns.to_list()).head()
    print(f"Dataframe has one or more duplicated rows, for example:\n{some_duplicates}")

edited May 8, 2020 at 23:35

answered Dec 24, 2019 at 18:29

Asclepius

64.6k20 gold badges188 silver badges164 bronze badges

Collectives™ on Stack Overflow

Check for duplicate values in Pandas dataframe column

4 Answers 4

Main question

Further reading and references

Example to play around with

3 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Main question

Further reading and references

Example to play around with

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related