merge a single pandas dataframe multiple rows into one

Question

I have a kind of time series dataframe of a train traffic data.

df = pd.DataFrame({
    'train': [1, 1, 1, 2, 1, 2],
    'station': [1000, 1001, 1001, 1000, 1002, 1003],
    'time': pd.to_datetime(['20200525 13:30:00',
                            '20200525 13:45:00',
                            '20200525 13:50:00',
                            '20200525 13:35:00',
                            '20200525 14:10:00',
                            '20200525 14:00:00']),
    'mvt': [10, -1, 2, 20, 0, 0],
    },
    columns=['train', 'station', 'time', 'mvt'])

On the stations the trains are either passing trough, or some coaches are attached or detached. As this is a time series data, every event is on a separate row.

I have to merge the rows of the same train on the same station where 2 movements (mvt) are happening one after the other (the second timestamp > first timestamp) and put the movements in 2 separate columns. (mvt_x and mvt_y) and keeping the timestamp of the last operation. On a single row passage the mvt_y will be always NaN.

Here is the expected result:

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00     10    NaN
1      1     1001 2020-05-25 13:50:00     -1    2.0
2      2     1000 2020-05-25 13:35:00     20    NaN
3      1     1002 2020-05-25 14:10:00      0    NaN
4      2     1003 2020-05-25 14:00:00      0    NaN

Well, not really. I'm stuck. Of course I can solve it with an iterative way after sorting the dataset by train, time, station, but the dataset is rather huge (several million rows) so it would not be very efficient. But I guess a kind of groupby would be in there — Gabor
– Gabor, Commented Jul 27, 2020 at 17:19
Could you specify what exactly your question is? Do you expect some code that converts any dataframe of the first format to the second one? Or is a general approach enough? — Manumerous
– Manumerous, Commented Jul 27, 2020 at 17:23
Are the rows you have to merge always ordered in a way that they are next to each other? — Manumerous
– Manumerous, Commented Jul 27, 2020 at 17:25
yes df.time = df.groupby(['train', 'station', 'time').time.transform('last') would be a starting point BUT there's the issue of multiple visits to the same station... is there another column that can take the place of visitID? or do we have to build it by resetting time on every station change...? — RichieV
– RichieV, Commented Jul 27, 2020 at 17:28

jsmart · Accepted Answer · 2020-07-27 17:36:51Z

Create the data frame

import pandas as pd

df = pd.DataFrame({
    'train': [1, 1, 1, 2, 1, 2],
    'station': [1000, 1001, 1001, 1000, 1002, 1003],
    'time': pd.to_datetime(['20200525 13:30:00',
                            '20200525 13:45:00',
                            '20200525 13:50:00',
                            '20200525 13:35:00',
                            '20200525 14:10:00',
                            '20200525 14:00:00']),
    'mvt': [10, -1, 2, 20, 0, 0],
    },
    columns=['train', 'station', 'time', 'mvt'])

Compute rank, to identify (train-station) pairs with 1 movement vs 2 movements. Then re-shape the data frame, using rank:

df['rank'] = df.groupby(['train', 'station'])['time'].rank().astype(int)

# re-shape the data frame - 'rank' is part of column label
x = (df.set_index(['train', 'station', 'rank'])
       .unstack(level='rank')
       .reset_index())

# find rows with a time with rank=2 ...
mask = x.loc[:, ('time', 2)].notna()

# ... and replace time-1 with time-2 (keep later time only)
x.loc[mask, ('time', 1)] = x.loc[mask, ('time', 2)]

# drop time-2
x = x.drop(columns=('time', 2))

# re-name columns
x.columns = ['train', 'station', 'time', 'mvt_x', 'mvt_y']

print(x)

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00   10.0    NaN
1      1     1001 2020-05-25 13:50:00   -1.0    2.0
2      1     1002 2020-05-25 14:10:00    0.0    NaN
3      2     1000 2020-05-25 13:35:00   20.0    NaN
4      2     1003 2020-05-25 14:00:00    0.0    NaN

RichieV · Accepted Answer · 2020-07-27 17:55:33Z

1

Beat me to the punch... but here's a code for cases with multiple visits to the same station

# change df.time to the last time on each station
# sort by time to account for for multiple visits to a station
df = df.sort_values(['train', 'time', 'station'])
stopid = df.station.diff().cumsum().fillna(0).astype(int)
df.time = df.groupby(['train', 'station', stopid]).time.transform('last')

# create index for mvt on train_station groups
df = df.assign(mvt_id=df.groupby(['train', 'station', 'time']).cumcount())

# reshape df, similar to pivot
df = (
    df.set_index(['train', 'station', 'time', 'mvt_id'])
    .unstack('mvt_id').droplevel(0, axis=1)
    )
df.columns = ['mvt_x', 'mvt_y'] # hardcoded for only 2 movements per station
# might need a generator if expecting more than 2 mvts

df = df.reset_index()

print(df)

Output

   train  station                time  mvt_x  mvt_y
0      1     1000 2020-05-25 13:30:00   10.0    NaN
1      1     1001 2020-05-25 13:50:00   -1.0    2.0
2      1     1002 2020-05-25 14:10:00    0.0    NaN
3      2     1000 2020-05-25 13:35:00   20.0    NaN
4      2     1003 2020-05-25 14:00:00    0.0    NaN

answered Jul 27, 2020 at 17:55

RichieV

5,1832 gold badges13 silver badges24 bronze badges

1 Comment

Gabor Over a year ago

Great! Now I'll have to test which solution is more efficient. :-)

Collectives™ on Stack Overflow

merge a single pandas dataframe multiple rows into one

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related