97

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

 if len(df['Student'].unique()) < len(df.index):
    # Code to remove duplicates based on Date column runs

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

    Student Date
0   Joe     December 2017
1   James   January 2018
2   Bob     April 2018
3   Joe     December 2017
4   Jack    February 2018
5   Jack    March 2018
3
  • 1
    sort your df by date then df.drop_duplicates('student') Commented May 8, 2018 at 22:18
  • 3
    @Wen Yes this, but maybe convert to datetime and sort after. Quick check would be: any(df['Student'].duplicated()) Commented May 8, 2018 at 22:21
  • A couple of misunderstandings: a) it's never necessary to check len(df[col].unique()), pandas has df[col].nunique() b) but anyway you don't even need that either, you're just looking for df[col].duplicated(...) Commented May 31, 2020 at 3:09

4 Answers 4

130

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Further reading and references

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

  1. drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
  2. duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise 
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keep paramater we can normally skip a few rows directly accessing what we need:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.

Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, any(df['Student'].duplicated()) was what I was after.
Incidentally, I wasn't able to get the date conversion to work (my existing function did work though). I got the error AttributeError: 'DataFrame' object has no attribute 'Date' for the line df['Date'] = pd.to_datetime(df.Date)
@JeffMitchell df.Date equals df['Date']. It is case-sensitive. Are you sure your columns is called Date? Could try df['Date'] too
19

You can use is_unique:

df['Student'].is_unique

# equals true in case of no duplicates

Older pandas versions required:

pd.Series(df['Student']).is_unique

2 Comments

You just need to do df['Student'].is_unique. df['Student'] is already a Pandas series
True, older pandas versions however struggled with that. Edited the answer now
9

If you want to know how many duplicates & what they are use:

df.pivot_table(index=['ColumnName'], aggfunc='size')

df.pivot_table(index=['ColumnName1',.., 'ColumnNameN'], aggfunc='size')

Comments

2

In addition to DataFrame.duplicated and Series.duplicated, Pandas also has a DataFrame.any and Series.any.

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

With Python ≥3.8, check for duplicates and access some duplicate rows:

if (duplicated := df.duplicated(keep=False)).any():
    some_duplicates = df[duplicated].sort_values(by=df.columns.to_list()).head()
    print(f"Dataframe has one or more duplicated rows, for example:\n{some_duplicates}")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.