0

I have a kind of time series dataframe of a train traffic data.

df = pd.DataFrame({
    'train': [1, 1, 1, 2, 1, 2],
    'station': [1000, 1001, 1001, 1000, 1002, 1003],
    'time': pd.to_datetime(['20200525 13:30:00',
                            '20200525 13:45:00',
                            '20200525 13:50:00',
                            '20200525 13:35:00',
                            '20200525 14:10:00',
                            '20200525 14:00:00']),
    'mvt': [10, -1, 2, 20, 0, 0],
    },
    columns=['train', 'station', 'time', 'mvt'])

On the stations the trains are either passing trough, or some coaches are attached or detached. As this is a time series data, every event is on a separate row.

I have to merge the rows of the same train on the same station where 2 movements (mvt) are happening one after the other (the second timestamp > first timestamp) and put the movements in 2 separate columns. (mvt_x and mvt_y) and keeping the timestamp of the last operation. On a single row passage the mvt_y will be always NaN.

Here is the expected result:

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00     10    NaN
1      1     1001 2020-05-25 13:50:00     -1    2.0
2      2     1000 2020-05-25 13:35:00     20    NaN
3      1     1002 2020-05-25 14:10:00      0    NaN
4      2     1003 2020-05-25 14:00:00      0    NaN
8
  • 1
    do you have some code to improve? Commented Jul 27, 2020 at 17:13
  • Well, not really. I'm stuck. Of course I can solve it with an iterative way after sorting the dataset by train, time, station, but the dataset is rather huge (several million rows) so it would not be very efficient. But I guess a kind of groupby would be in there Commented Jul 27, 2020 at 17:19
  • Could you specify what exactly your question is? Do you expect some code that converts any dataframe of the first format to the second one? Or is a general approach enough? Commented Jul 27, 2020 at 17:23
  • Are the rows you have to merge always ordered in a way that they are next to each other? Commented Jul 27, 2020 at 17:25
  • yes df.time = df.groupby(['train', 'station', 'time').time.transform('last') would be a starting point BUT there's the issue of multiple visits to the same station... is there another column that can take the place of visitID? or do we have to build it by resetting time on every station change...? Commented Jul 27, 2020 at 17:28

2 Answers 2

2

Create the data frame

import pandas as pd

df = pd.DataFrame({
    'train': [1, 1, 1, 2, 1, 2],
    'station': [1000, 1001, 1001, 1000, 1002, 1003],
    'time': pd.to_datetime(['20200525 13:30:00',
                            '20200525 13:45:00',
                            '20200525 13:50:00',
                            '20200525 13:35:00',
                            '20200525 14:10:00',
                            '20200525 14:00:00']),
    'mvt': [10, -1, 2, 20, 0, 0],
    },
    columns=['train', 'station', 'time', 'mvt'])

Compute rank, to identify (train-station) pairs with 1 movement vs 2 movements. Then re-shape the data frame, using rank:

df['rank'] = df.groupby(['train', 'station'])['time'].rank().astype(int)

# re-shape the data frame - 'rank' is part of column label
x = (df.set_index(['train', 'station', 'rank'])
       .unstack(level='rank')
       .reset_index())

# find rows with a time with rank=2 ...
mask = x.loc[:, ('time', 2)].notna()

# ... and replace time-1 with time-2 (keep later time only)
x.loc[mask, ('time', 1)] = x.loc[mask, ('time', 2)]

# drop time-2
x = x.drop(columns=('time', 2))

# re-name columns
x.columns = ['train', 'station', 'time', 'mvt_x', 'mvt_y']

print(x)

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00   10.0    NaN
1      1     1001 2020-05-25 13:50:00   -1.0    2.0
2      1     1002 2020-05-25 14:10:00    0.0    NaN
3      2     1000 2020-05-25 13:35:00   20.0    NaN
4      2     1003 2020-05-25 14:00:00    0.0    NaN
Sign up to request clarification or add additional context in comments.

1 Comment

Wow! This is exactly what I want! You're great, man! :-)
1

Beat me to the punch... but here's a code for cases with multiple visits to the same station

# change df.time to the last time on each station
# sort by time to account for for multiple visits to a station
df = df.sort_values(['train', 'time', 'station'])
stopid = df.station.diff().cumsum().fillna(0).astype(int)
df.time = df.groupby(['train', 'station', stopid]).time.transform('last')

# create index for mvt on train_station groups
df = df.assign(mvt_id=df.groupby(['train', 'station', 'time']).cumcount())

# reshape df, similar to pivot
df = (
    df.set_index(['train', 'station', 'time', 'mvt_id'])
    .unstack('mvt_id').droplevel(0, axis=1)
    )
df.columns = ['mvt_x', 'mvt_y'] # hardcoded for only 2 movements per station
# might need a generator if expecting more than 2 mvts

df = df.reset_index()

print(df)

Output

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00   10.0    NaN
1      1     1001 2020-05-25 13:50:00   -1.0    2.0
2      1     1002 2020-05-25 14:10:00    0.0    NaN
3      2     1000 2020-05-25 13:35:00   20.0    NaN
4      2     1003 2020-05-25 14:00:00    0.0    NaN

1 Comment

Great! Now I'll have to test which solution is more efficient. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.