I have a kind of time series dataframe of a train traffic data.
df = pd.DataFrame({
'train': [1, 1, 1, 2, 1, 2],
'station': [1000, 1001, 1001, 1000, 1002, 1003],
'time': pd.to_datetime(['20200525 13:30:00',
'20200525 13:45:00',
'20200525 13:50:00',
'20200525 13:35:00',
'20200525 14:10:00',
'20200525 14:00:00']),
'mvt': [10, -1, 2, 20, 0, 0],
},
columns=['train', 'station', 'time', 'mvt'])
On the stations the trains are either passing trough, or some coaches are attached or detached. As this is a time series data, every event is on a separate row.
I have to merge the rows of the same train on the same station where 2 movements (mvt) are happening one after the other (the second timestamp > first timestamp) and put the movements in 2 separate columns. (mvt_x and mvt_y) and keeping the timestamp of the last operation. On a single row passage the mvt_y will be always NaN.
Here is the expected result:
train station time mvt_x mvt_y
0 1 1000 2020-05-25 13:30:00 10 NaN
1 1 1001 2020-05-25 13:50:00 -1 2.0
2 2 1000 2020-05-25 13:35:00 20 NaN
3 1 1002 2020-05-25 14:10:00 0 NaN
4 2 1003 2020-05-25 14:00:00 0 NaN
df.time = df.groupby(['train', 'station', 'time').time.transform('last')would be a starting point BUT there's the issue of multiple visits to the same station... is there another column that can take the place ofvisitID? or do we have to build it by resetting time on every station change...?