1

I have large pandas dataframe with two columns rider_ID and person_ID given as:

ride_ID  person_ID  
 ride_1    person1   
 ride_1    person2    
 ride_1    person3    
 ride_2    person1    
 ride_2    person4    
 ride_3    person1    
 ride_3    person5    
 ride_3    person2    
 ride_3    person3  
 .....     ......
 .....     ......

For each unique ride_ID the number of person_ID could be anything either 2 or 20 or 100. All, I want to apply groupby on column ride_ID such that column person_ID will reflected into multiple columns with columns name as person_ID1 till person_IDn. Expected output as;

ride_ID  person_ID1 person_ID2   person_ID3   person_ID4   person_ID5 ....... person_IDn 

 ride_1   person1    person2      person3      NaN         NaN        ......                           
 ride_2   person1    NaN          NaN          person4     NaN        ......     
 ride_3   person1    person2      person3      NaN         person5 
3
  • How do you relate "person_ID1" with "person1"? Is it the suffix "1"? Does "person_ID" always has that format? Commented Nov 17, 2022 at 7:21
  • @AzharKhan The column names would be based on maximum number of persons for unique ride_ID. Lets say, ride_44 has maximum ride which is 50 then column names will range from person_ID1 to person_ID50 and then for each ride the corresponding person will marked. Commented Nov 17, 2022 at 7:25
  • How do you relate "person_ID" with value in that column? Commented Nov 17, 2022 at 7:26

1 Answer 1

1

You can use pivot(). For that, create a column "person_IDx" with values in serial fashion "person_ID1, person_ID2, ..., person_IDn" for each "ride_ID" type:

df = pd.DataFrame(data=[["ride_1","person1"],["ride_1","person2"],["ride_1","person3"],["ride_2","person1"],["ride_2","person4"],["ride_3","person1"],["ride_3","person5"],["ride_3","person2"],["ride_3","person3"]], columns=["ride_ID","person_ID"])

df["person_IDx"] = 1

df["person_IDx"] = df.groupby("ride_ID")["person_IDx"].transform("cumsum").apply(lambda x: f"person_ID{x}")

df = df.pivot(index="ride_ID", columns="person_IDx", values="person_ID").reset_index().rename_axis(columns={"person_IDx":""})

[Out]:
  ride_ID person_ID1 person_ID2 person_ID3 person_ID4
0  ride_1    person1    person2    person3        NaN
1  ride_2    person1    person4        NaN        NaN
2  ride_3    person1    person5    person2    person3
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.