Create a list of duplicate index entries in pandas dataframe

Question

I am trying to identify which time stamps in my index have duplicates. I want to create a list of the time stamp strings. I would like to return a single timestamp for each of the time stamps that have duplicates if possible.

#required packages
import os
import pandas as pd
import numpy as np
import datetime

# create sample time series
header = ['A','B','C','D','E']
period = 5
cols = len(header)

dates = pd.date_range('1/1/2000', periods=period, freq='10min')
dates2 = pd.date_range('1/1/2022', periods=period, freq='10min')
df = pd.DataFrame(np.random.randn(period,cols),index=dates,columns=header)
df0 = pd.DataFrame(np.random.randn(period,cols),index=dates2,columns=header)
df1 = pd.concat([df]*3)                                                         #creates duplicate entries by copying the dataframe
df1 = pd.concat([df1, df0])
df2 = df1.sample(frac=1)                                                        #shuffles the dataframe
df3 = df1.sort_index()                                                          #sorts the dataframe by index

print(df2)
#print(df3)

# Identifying duplicated entries

df4 = df2.duplicated()

print(df4)

I would like to then use the list call out all the duplicate entries for each time stamp. From the code above, is there a good way to call the index that correlates to a bool type that is false?

Edit: added an extra dataframe to create some unique values and tripled the first data frame to create more than a single repeat.Also added more detail to the question.

Scott Boston · Accepted Answer · 2017-09-21 17:50:27Z

2

IIUC:

df4[~df4]

Output:

2000-01-01 00:10:00    False
2000-01-01 00:00:00    False
2000-01-01 00:40:00    False
2000-01-01 00:30:00    False
2000-01-01 00:20:00    False
dtype: bool

List of timestamps,

df4[~df4].index.tolist()

Output:

[Timestamp('2000-01-01 00:10:00'),
 Timestamp('2000-01-01 00:00:00'),
 Timestamp('2000-01-01 00:40:00'),
 Timestamp('2000-01-01 00:30:00'),
 Timestamp('2000-01-01 00:20:00')]

answered Sep 21, 2017 at 17:50

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MaxU - stand with Ukraine · Accepted Answer · 2017-09-21 17:55:54Z

1

In [46]: df2.drop_duplicates()
Out[46]:
                            A         B         C         D         E
2000-01-01 00:00:00  0.932587 -1.508587 -0.385396 -0.692379  2.083672
2000-01-01 00:40:00  0.237324 -0.321555 -0.448842 -0.983459  0.834747
2000-01-01 00:20:00  1.624815 -0.571193  1.951832 -0.642217  1.744168
2000-01-01 00:30:00  0.079106 -1.290473  2.635966  1.390648  0.206017
2000-01-01 00:10:00  0.760976  0.643825 -1.855477 -1.172241  0.532051

In [47]: df2.drop_duplicates().index.tolist()
Out[47]:
[Timestamp('2000-01-01 00:00:00'),
 Timestamp('2000-01-01 00:40:00'),
 Timestamp('2000-01-01 00:20:00'),
 Timestamp('2000-01-01 00:30:00'),
 Timestamp('2000-01-01 00:10:00')]

answered Sep 21, 2017 at 17:55

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

Moose Drool Over a year ago

This works really well and is a little more flexible than any of the other answers. Is there an easy way to convert the list of Timestamps to just the strings? I've tried to use to_string, but the list does not have that attribute. basically just make a list of the timestamps as: ['2000-01-01 00:00:00' '2000-01-01 00:40:00' '2000-01-01 00:20:00' '2000-01-01 00:30:00' '2000-01-01 00:10:00']

Collectives™ on Stack Overflow

Create a list of duplicate index entries in pandas dataframe

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest