Finding duplicate rows python

Question

I have timestamp and id variables in my dataframe (df)

timestamp                    id
2016-06-09 8:33:37           a1  
2016-06-09 8:33:37           a1  
2016-06-09 8:33:38           a1
2016-06-09 8:33:39           a1
2016-06-09 8:33:39           a1
2016-06-09 8:33:37           b1
2016-06-09 8:33:38           b1

Each id can't have two timestamps. I have to print these duplicate timestamps for each id. In my above case, the output should be for rows 1,2,4,5

The following code will give the duplicate timestamp

set([x for x in df['timestamp'] if df['timestamp'].count(x) > 1])

How to consider id along with timestamp to have the duplicate rows?

jezrael · Accepted Answer · 2016-10-05 19:57:15Z

1

You can use groupby and get mask of all duplicates values per group by Series.duplicated. Last use boolean indexing:

print (df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False)))
0     True
1     True
2    False
3     True
4     True
5    False
6    False
Name: timestamp, dtype: bool

print (df[df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False))])
            timestamp  id
0 2016-06-09 08:33:37  a1
1 2016-06-09 08:33:37  a1
3 2016-06-09 08:33:39  a1
4 2016-06-09 08:33:39  a1

answered Oct 5, 2016 at 19:57

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Little Bobby Tables · Accepted Answer · 2019-12-16 22:14:43Z

0

If you wish to find all duplicates then use the duplicated method. It only works on the columns. On the other hand df.index.duplicated works on the index. Therefore we do a quick reset_index to bring the index into the columns.

df = df.reset_index()
df.loc[df.duplicated(keep=False)]

    index               id
0   2016-06-09 8:33:37  a1
1   2016-06-09 8:33:37  a1
3   2016-06-09 8:33:39  a1
4   2016-06-09 8:33:39  a1

If you just wish to remove duplicates there is the DataFrame method drop_duplicates.

df = df.reset_index()
df = df.drop_duplicates() # keep='first' by default.

    index               id
0   2016-06-09 8:33:37  a1
2   2016-06-09 8:33:38  a1
3   2016-06-09 8:33:39  a1
5   2016-06-09 8:33:37  b1
6   2016-06-09 8:33:38  b1

If you wish to get back your old index then set_index using the default columns name 'index' for either of the above then simply use.

df.set_index('index')

                    id
index               
2016-06-09 8:33:37  a1
2016-06-09 8:33:38  a1
2016-06-09 8:33:39  a1
2016-06-09 8:33:37  b1
2016-06-09 8:33:38  b1

The above methods allow you choose whether to keep the first, last or none of the duplicates by setting the keep attribute to 'first', 'last' or False. So to remove all duplicates in df then use keep=False.

edited Dec 16, 2019 at 22:14

answered Oct 5, 2016 at 20:50

Little Bobby Tables

4,8146 gold badges34 silver badges51 bronze badges

2 Comments

PV8 Over a year ago

any change in this solution? ix is depreciated?

Little Bobby Tables Over a year ago

use loc instead

Collectives™ on Stack Overflow

Finding duplicate rows python

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related