How to combine multiple rows into a single row with python pandas based on the values of multiple columns?

Question

I need to combine multiple rows into a single row, and the original dataframes looks like:

IndividualID    DayID    TripID    JourSequence   TripPurpose
200100000001    1        1         1              3
200100000001    1        2         2              31
200100000001    1        3         3              23
200100000001    1        4         4              5
200100000009    1        55        1              3
200100000009    1        56        2              12
200100000009    1        57        3              4
200100000009    1        58        4              6
200100000009    1        59        5              19
200100000009    1        60        6              2

I was trying to build some sort of 'trip chain', so basically all the journey sequences and trip purposes of one individual on a single day should be in the same row...

Ideally I was trying to convert the table to something like this:

IndividualID    DayID     Seq1   TripPurp1     Seq2   TripPur2     Seq3   TripPurp3     Seq4   TripPur4
200100000001    1         1      3             2      31           3       23           4      5
200100000009    1         1      3             2      12           3        4           4      6

If this is not possible, then the following mode would also be fine:

IndividualID    DayID      TripPurposes
200100000001    1          3, 31, 23, 5
200100000009    1          3, 12, 4, 6

Is there any possible solutions? I was thinking on for loop/ while statement, but maybe that was not really a good idea. Thanks in advance!

Possible duplicate of How to combine multiple rows into a single row with pandas — McRist
– McRist, Commented Aug 17, 2018 at 18:31
You have a different number of rows for different IDs. How do you want to handle the missing/extra columns? @McRist Not a dupe. — DYZ
– DYZ, Commented Aug 17, 2018 at 18:33
I would check the maximum number of sequences of the individuals...hopefully no more than 10 sequences...for those having less than 10 sequences, is it possible to just leave it blank? — Steward
– Steward, Commented Aug 17, 2018 at 18:35
There is no such thing as 'blank'. It has to be a NaN, an empty string or something else. — DYZ
– DYZ, Commented Aug 17, 2018 at 18:37

Scott Boston · Accepted Answer · 2018-08-17 20:46:47Z

You can try:

df_out = df.set_index(['IndividualID','DayID',df.groupby(['IndividualID','DayID']).cumcount()+1]).unstack().sort_index(level=1, axis=1)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()

Output:

   IndividualID  DayID  JourSequence_1  TripID_1  TripPurpose_1  \
0  200100000001      1             1.0       1.0            3.0   
1  200100000009      1             1.0      55.0            3.0   

   JourSequence_2  TripID_2  TripPurpose_2  JourSequence_3  TripID_3  \
0             2.0       2.0           31.0             3.0       3.0   
1             2.0      56.0           12.0             3.0      57.0   

   TripPurpose_3  JourSequence_4  TripID_4  TripPurpose_4  JourSequence_5  \
0           23.0             4.0       4.0            5.0             NaN   
1            4.0             4.0      58.0            6.0             5.0   

   TripID_5  TripPurpose_5  JourSequence_6  TripID_6  TripPurpose_6  
0       NaN            NaN             NaN       NaN            NaN  
1      59.0           19.0             6.0      60.0            2.0

It_is_Chris · Accepted Answer · 2018-08-17 20:51:43Z

To get your second output you just need to groupby and apply list:

df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list)

                      TripPurpose
IndividualID  DayID 
200100000001    1   [3, 31, 23, 5]
200100000009    1   [3, 12, 4, 6, 19, 2]

to get your first output you can do something like this (probably not the best approach):

df2 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list))
trip = df2['TripPurpose'].apply(pd.Series).rename(columns = lambda x: 'TripPurpose'+ str(x+1))
df3 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['JourSequence'].apply(list))
seq = df3['JourSequence'].apply(pd.Series).rename(columns = lambda x: 'seq'+ str(x+1))
pd.merge(trip,seq,on=['IndividualID','DayID'])

output is not sorted

Collectives™ on Stack Overflow

How to combine multiple rows into a single row with python pandas based on the values of multiple columns?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related