1

I am trying to identify which time stamps in my index have duplicates. I want to create a list of the time stamp strings. I would like to return a single timestamp for each of the time stamps that have duplicates if possible.

#required packages
import os
import pandas as pd
import numpy as np
import datetime

# create sample time series
header = ['A','B','C','D','E']
period = 5
cols = len(header)

dates = pd.date_range('1/1/2000', periods=period, freq='10min')
dates2 = pd.date_range('1/1/2022', periods=period, freq='10min')
df = pd.DataFrame(np.random.randn(period,cols),index=dates,columns=header)
df0 = pd.DataFrame(np.random.randn(period,cols),index=dates2,columns=header)
df1 = pd.concat([df]*3)                                                         #creates duplicate entries by copying the dataframe
df1 = pd.concat([df1, df0])
df2 = df1.sample(frac=1)                                                        #shuffles the dataframe
df3 = df1.sort_index()                                                          #sorts the dataframe by index

print(df2)
#print(df3)

# Identifying duplicated entries

df4 = df2.duplicated()

print(df4)  

I would like to then use the list call out all the duplicate entries for each time stamp. From the code above, is there a good way to call the index that correlates to a bool type that is false?

Edit: added an extra dataframe to create some unique values and tripled the first data frame to create more than a single repeat.Also added more detail to the question.

2 Answers 2

2

IIUC:

df4[~df4]

Output:

2000-01-01 00:10:00    False
2000-01-01 00:00:00    False
2000-01-01 00:40:00    False
2000-01-01 00:30:00    False
2000-01-01 00:20:00    False
dtype: bool

List of timestamps,

df4[~df4].index.tolist()

Output:

[Timestamp('2000-01-01 00:10:00'),
 Timestamp('2000-01-01 00:00:00'),
 Timestamp('2000-01-01 00:40:00'),
 Timestamp('2000-01-01 00:30:00'),
 Timestamp('2000-01-01 00:20:00')]
Sign up to request clarification or add additional context in comments.

Comments

1
In [46]: df2.drop_duplicates()
Out[46]:
                            A         B         C         D         E
2000-01-01 00:00:00  0.932587 -1.508587 -0.385396 -0.692379  2.083672
2000-01-01 00:40:00  0.237324 -0.321555 -0.448842 -0.983459  0.834747
2000-01-01 00:20:00  1.624815 -0.571193  1.951832 -0.642217  1.744168
2000-01-01 00:30:00  0.079106 -1.290473  2.635966  1.390648  0.206017
2000-01-01 00:10:00  0.760976  0.643825 -1.855477 -1.172241  0.532051

In [47]: df2.drop_duplicates().index.tolist()
Out[47]:
[Timestamp('2000-01-01 00:00:00'),
 Timestamp('2000-01-01 00:40:00'),
 Timestamp('2000-01-01 00:20:00'),
 Timestamp('2000-01-01 00:30:00'),
 Timestamp('2000-01-01 00:10:00')]

1 Comment

This works really well and is a little more flexible than any of the other answers. Is there an easy way to convert the list of Timestamps to just the strings? I've tried to use to_string, but the list does not have that attribute. basically just make a list of the timestamps as: ['2000-01-01 00:00:00' '2000-01-01 00:40:00' '2000-01-01 00:20:00' '2000-01-01 00:30:00' '2000-01-01 00:10:00']

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.