Add missing times in dataframe column with pandas

Question

I have a dataframe like so:

df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
                   'X':[-5,-4,-2,5,6,10,11],
                   'Y':[3,4,5,9,20,22,23]})

As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds! My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.

Could you please suggest a smart and elegant way to achieve my goal?

Here is what the output should look like:

           X     Y
time   
23:59:45  -5.0   3.0
23:59:46   NaN   NaN
23:59:47   NaN   NaN
23:59:48   NaN   NaN
...        ...   ...
00:00:10  10.0  22.0
00:00:11   NaN   NaN
00:00:12   NaN   NaN
00:00:13   NaN   NaN
00:00:14   NaN   NaN
00:00:15  11.0  23.0

Note: I do not need the dates.

jezrael · Accepted Answer · 2017-10-04 09:29:55Z

5

Use to_timedelta with reindex by timedelta_range:

df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')

df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
      time    X     Y
0 00:00:00  5.0   9.0
1 00:00:01  NaN   NaN
2 00:00:02  NaN   NaN
3 00:00:03  NaN   NaN
4 00:00:04  NaN   NaN
5 00:00:05  6.0  20.0
6 00:00:06  NaN   NaN
7 00:00:07  NaN   NaN
8 00:00:08  NaN   NaN
9 00:00:09  NaN   NaN

If need replace NaNs:

df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
      time  X   Y
0 00:00:00  5   9
1 00:00:01  0   0
2 00:00:02  0   0
3 00:00:03  0   0
4 00:00:04  0   0
5 00:00:05  6  20
6 00:00:06  0   0
7 00:00:07  0   0
8 00:00:08  0   0
9 00:00:09  0   0

Another solution with resample, but is possible some rows are missing in the end:

df = df.set_index('time').resample('S').first()
print (df.tail(10))
            X    Y
time              
23:59:46  NaN  NaN
23:59:47  NaN  NaN
23:59:48  NaN  NaN
23:59:49  NaN  NaN
23:59:50  NaN  NaN
23:59:51  NaN  NaN
23:59:52  NaN  NaN
23:59:53  NaN  NaN
23:59:54  NaN  NaN
23:59:55 -2.0  5.0

EDIT1:

idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])

df['time'] = pd.to_timedelta(df['time'])        
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
      time    X    Y
0 23:59:45 -5.0  3.0
1 23:59:46  NaN  NaN
2 23:59:47  NaN  NaN
3 23:59:48  NaN  NaN
4 23:59:49  NaN  NaN
5 23:59:50  NaN  NaN
6 23:59:51  NaN  NaN
7 23:59:52  NaN  NaN
8 23:59:53  NaN  NaN
9 23:59:54  NaN  NaN

print (df.tail(10))
       time     X     Y
21 00:00:06   NaN   NaN
22 00:00:07   NaN   NaN
23 00:00:08   NaN   NaN
24 00:00:09   NaN   NaN
25 00:00:10  10.0  22.0
26 00:00:11   NaN   NaN
27 00:00:12   NaN   NaN
28 00:00:13   NaN   NaN
29 00:00:14   NaN   NaN
30 00:00:15  11.0  23.0

EDIT:

Another solution - change next day to 1 day timedeltas:

df['time'] = pd.to_timedelta(df['time'])        

a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
    X   Y            time
0  -5   3 0 days 23:59:45
1  -4   4 0 days 23:49:50
2  -2   5 0 days 23:59:55
3   5   9 1 days 00:00:00
4   6  20 1 days 00:00:05
5  10  22 1 days 00:00:10
6  11  23 1 days 00:00:15

idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')

df = df.set_index('time').reindex(idx).reset_index()

print (df.head(10))
      time    X    Y
0 23:49:50 -4.0  4.0
1 23:49:51  NaN  NaN
2 23:49:52  NaN  NaN
3 23:49:53  NaN  NaN
4 23:49:54  NaN  NaN
5 23:49:55  NaN  NaN
6 23:49:56  NaN  NaN
7 23:49:57  NaN  NaN
8 23:49:58  NaN  NaN
9 23:49:59  NaN  NaN

print (df.tail(10))
               time     X     Y
616 1 days 00:00:06   NaN   NaN
617 1 days 00:00:07   NaN   NaN
618 1 days 00:00:08   NaN   NaN
619 1 days 00:00:09   NaN   NaN
620 1 days 00:00:10  10.0  22.0
621 1 days 00:00:11   NaN   NaN
622 1 days 00:00:12   NaN   NaN
623 1 days 00:00:13   NaN   NaN
624 1 days 00:00:14   NaN   NaN
625 1 days 00:00:15  11.0  23.0

edited Oct 4, 2017 at 9:29

answered Oct 4, 2017 at 8:05

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Federico Gentile Over a year ago

Thanks for the answer, however there is a problem with it because the times should start at '23:59:45" and end at "00:00:15" (of the day after). Therefore I just need to fill the dataframe between those 2 times

jezrael Over a year ago

Hmmm, can you add desired output?

jezrael Over a year ago

And also there should by more as 1 midnight?

Federico Gentile Over a year ago

the midnight is just a corner case so that the example is valid no matter what starting and ending time I choose

jezrael Over a year ago

Not so easy - need concatenate 2 different ranges.

Collectives™ on Stack Overflow

Add missing times in dataframe column with pandas

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related