2

I have timestamp and id variables in my dataframe (df)

timestamp                    id
2016-06-09 8:33:37           a1  
2016-06-09 8:33:37           a1  
2016-06-09 8:33:38           a1
2016-06-09 8:33:39           a1
2016-06-09 8:33:39           a1
2016-06-09 8:33:37           b1
2016-06-09 8:33:38           b1

Each id can't have two timestamps. I have to print these duplicate timestamps for each id. In my above case, the output should be for rows 1,2,4,5

The following code will give the duplicate timestamp

set([x for x in df['timestamp'] if df['timestamp'].count(x) > 1])

How to consider id along with timestamp to have the duplicate rows?

0

2 Answers 2

1

You can use groupby and get mask of all duplicates values per group by Series.duplicated. Last use boolean indexing:

print (df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False)))
0     True
1     True
2    False
3     True
4     True
5    False
6    False
Name: timestamp, dtype: bool

print (df[df.groupby(['id'])['timestamp'].apply(lambda x: x.duplicated(keep=False))])
            timestamp  id
0 2016-06-09 08:33:37  a1
1 2016-06-09 08:33:37  a1
3 2016-06-09 08:33:39  a1
4 2016-06-09 08:33:39  a1
Sign up to request clarification or add additional context in comments.

Comments

0

If you wish to find all duplicates then use the duplicated method. It only works on the columns. On the other hand df.index.duplicated works on the index. Therefore we do a quick reset_index to bring the index into the columns.

df = df.reset_index()
df.loc[df.duplicated(keep=False)]

    index               id
0   2016-06-09 8:33:37  a1
1   2016-06-09 8:33:37  a1
3   2016-06-09 8:33:39  a1
4   2016-06-09 8:33:39  a1

If you just wish to remove duplicates there is the DataFrame method drop_duplicates.

df = df.reset_index()
df = df.drop_duplicates() # keep='first' by default.

    index               id
0   2016-06-09 8:33:37  a1
2   2016-06-09 8:33:38  a1
3   2016-06-09 8:33:39  a1
5   2016-06-09 8:33:37  b1
6   2016-06-09 8:33:38  b1

If you wish to get back your old index then set_index using the default columns name 'index' for either of the above then simply use.

df.set_index('index')

                    id
index               
2016-06-09 8:33:37  a1
2016-06-09 8:33:38  a1
2016-06-09 8:33:39  a1
2016-06-09 8:33:37  b1
2016-06-09 8:33:38  b1

The above methods allow you choose whether to keep the first, last or none of the duplicates by setting the keep attribute to 'first', 'last' or False. So to remove all duplicates in df then use keep=False.

2 Comments

any change in this solution? ix is depreciated?
use loc instead

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.