Python pandas change duplicate timestamp to unique

Question

I have a file containing duplicate timestamps, maximum two for each timestamp, actually they are not duplicate, it is just the second timestamp needs to add a millisecond timestamp. For example, I am having these in the file,

....
2011/1/4    9:14:00
2011/1/4    9:15:00
2011/1/4    9:15:01
2011/1/4    9:15:01
2011/1/4    9:15:02
2011/1/4    9:15:02
2011/1/4    9:15:03
2011/1/4    9:15:03
2011/1/4    9:15:04
....

I would like to change them into

2011/1/4    9:14:00
2011/1/4    9:15:00
2011/1/4    9:15:01
2011/1/4    9:15:01.500
2011/1/4    9:15:02
2011/1/4    9:15:02.500
2011/1/4    9:15:03
2011/1/4    9:15:03.500
2011/1/4    9:15:04
....

what is the most efficient way to perform such task?

Jeff · Accepted Answer · 2013-08-10 12:58:34Z

Setup

In [69]: df = DataFrame(dict(time = x))

In [70]: df
Out[70]: 
                 time
0 2013-01-01 09:01:00
1 2013-01-01 09:01:00
2 2013-01-01 09:01:01
3 2013-01-01 09:01:01
4 2013-01-01 09:01:02
5 2013-01-01 09:01:02
6 2013-01-01 09:01:03
7 2013-01-01 09:01:03
8 2013-01-01 09:01:04
9 2013-01-01 09:01:04

Find the locations where the difference in time from the previous row is 0 seconds

In [71]: mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')

In [72]: mask
Out[72]: 
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
9     True
Name: time, dtype: bool

Set theose locations to use an offset of 5 milliseconds (In your question you used 500 but could be anything). This requires numpy >= 1.7. (Not that this syntax will be changing in 0.13 to allow a more direct df.loc[mask,'time'] += pd.offsets.Milli(5)

In [73]: df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5))

In [74]: df
Out[74]: 
                        time
0        2013-01-01 09:01:00
1 2013-01-01 09:01:00.005000
2        2013-01-01 09:01:01
3 2013-01-01 09:01:01.005000
4        2013-01-01 09:01:02
5 2013-01-01 09:01:02.005000
6        2013-01-01 09:01:03
7 2013-01-01 09:01:03.005000
8        2013-01-01 09:01:04
9 2013-01-01 09:01:04.005000

TomAugspurger · Accepted Answer · 2013-08-10 12:28:49Z

So this algorithm should work very well... I'm just having a hell of a time with numpy's datetime datatypes.

In [154]: df
Out[154]: 
                  0
0  2011/1/4 9:14:00
1  2011/1/4 9:15:00
2  2011/1/4 9:15:01
3  2011/1/4 9:15:01
4  2011/1/4 9:15:02
5  2011/1/4 9:15:02
6  2011/1/4 9:15:03
7  2011/1/4 9:15:03
8  2011/1/4 9:15:04


In [155]: ((dt.diff() == 0) * .005)
Out[155]: 
0    0.000
1    0.000
2    0.000
3    0.005
4    0.000
5    0.005
6    0.000
7    0.005
8    0.000
Name: 0, dtype: float64

And the idea is to add those two together. Of course, one is datetime64 and the other is float64. For whatever reasons, np.timedelta64 doesn't operate on arrays? Anyway if you can sort out the dtype issues that will work.

Steve Barnes · Accepted Answer · 2013-08-10 10:55:33Z

0

Assuming - as you have shown in your example that they are sequential:

lasttimestamp = None
for ts = readtimestamp(infile): # I will leave this to you
   if ts == lasttimestamp:
      ts += inc_by  # and this
   lasttimestamp = ts
   writetimestamp(outfile, ts) # and this to

edited Aug 10, 2013 at 10:55

answered Aug 10, 2013 at 8:11

Steve Barnes

28.5k6 gold badges68 silver badges80 bronze badges

2 Comments

Andy Hayden Over a year ago

This is not going to be very efficient.

Steve Barnes Over a year ago

It ought to be nearly as fast as the disk access while still ensuring that only the duplicates get changed.

Collectives™ on Stack Overflow

Python pandas change duplicate timestamp to unique

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related