5

I have a dataset 'df' with 3 columns.

>> Original Data

    Student Id    Name  Marks
0       id_1    John    112
1       id_2    Rafs    181
2       id_2    Rafs    182
3       id_2    Rafs    183
4       id_3    Juan    222
5       id_3    Juan    312
6       id_3  Roller     21

Trying to keep the columns 'Student_Id' and 'Name' as it is but convert 'Marks' to multiple columns. Such that each unique 'Student_Id' and 'Name' will have a row of all possible Marks. Also we need not create columns manually but it should be dynamically created depending on the values.

>> Expected Output

    Student Id    Name  Marks1  Marks2  Marks3
0       id_1    John     112    <NA>    <NA>
1       id_2    Rafs     181     182     183
2       id_3    Juan     222     312    <NA>
3       id_3  Roller      21    <NA>    <NA>

Sample data to replicate the input

import pandas as pd

data = [
    ["id_1", 'John', 112],
    ["id_2", 'Rafs', 181],
    ["id_2", 'Rafs', 182],
    ["id_2", 'Rafs', 183], 
    ["id_3", 'Juan', 222],
    ["id_3", 'Juan', 312],
    ["id_3", 'Roller', 21]
]
df = pd.DataFrame(data, columns = ['Student Id', 'Name', 'Marks'])

I tried the below but I am not getting the desired output. It gives results in brackets() an also the Marks is missing.

df3 = df.pivot_table(index=['Student Id','Name'], columns='Marks', aggfunc = 'max')

>>Output
Empty DataFrame
Columns: []
Index: [(id_1, John), (id_2, Rafs), (id_3, Juan), (id_3, Roller)]

3 Answers 3

7

Use GroupBy.cumcount for new column for counter columns created MultiIndex in df3:

df['g'] = df.groupby(['Student Id','Name']).cumcount().add(1)

df3 = (df.pivot_table(index=['Student Id','Name'], 
                     columns='g', 
                     values='Marks', 
                     aggfunc = 'max')
        .add_prefix('Marks')
        .rename_axis(None, axis=1)
        .reset_index())
print (df3)
  Student Id    Name  Marks1  Marks2  Marks3
0       id_1    John   112.0     NaN     NaN
1       id_2    Rafs   181.0   182.0   183.0
2       id_3    Juan   222.0   312.0     NaN
3       id_3  Roller    21.0     NaN     NaN

If need integers with missing values:

df['g'] = df.groupby(['Student Id','Name']).cumcount().add(1)

df3 = (df.pivot_table(index=['Student Id','Name'], 
                     columns='g', 
                     values='Marks', 
                     aggfunc = 'max')
        .add_prefix('Marks')
        .astype('Int64')
        .rename_axis(None, axis=1)
        .reset_index())
print (df3)
  Student Id    Name  Marks1  Marks2  Marks3
0       id_1    John     112    <NA>    <NA>
1       id_2    Rafs     181     182     183
2       id_3    Juan     222     312    <NA>
3       id_3  Roller      21    <NA>    <NA>
Sign up to request clarification or add additional context in comments.

Comments

2

another way:

temp = df.groupby(["Student Id", "Name"]).Marks.agg(list)

out = (pd.DataFrame(temp.tolist(), index=temp.index)
           .rename(columns=lambda x: f"Marks{x+1}")
           .reset_index())

temp will be a dataframe with aggregated lists of Marks per (id, name). Then we form a dataframe out of it while renaming the columns to desired format and resetting the index to put id & name to columns back.

to get

  Student Id    Name  Marks1  Marks2  Marks3
0       id_1    John     112     NaN     NaN
1       id_2    Rafs     181   182.0   183.0
2       id_3    Juan     222   312.0     NaN
3       id_3  Roller      21     NaN     NaN

Comments

1

Here is a much easy to understand answer without creating an extra column

#Grouping by Studend Id and Name
new_df = df.groupby(['Student Id','Name'])['Marks'].apply(list).reset_index()

#Now, in the marks columns, we have a list (as seen below).
#So, we convert the list into different columns, using pd.Series

#   Student Id  Name    Marks
# 0 id_1    John    [112]
# 1 id_2    Rafs    [181, 182, 183]
# 2 id_3    Juan    [222, 312]
# 3 id_4    Roller  [21]
temp_df = new_df['Marks'].apply(pd.Series)

#Now, this is all decorative stuff. 

#Converting the column names from 0,1,2 to Marks1, Marks2, Marks3
temp_df.columns = list(map(lambda x: 'Marks'+str(x+1), temp_df.columns))

# Assigning this new temporary df to the original df
new_df[temp_df.columns] = temp_df

#Dropping the Marks column
final_df = new_df.drop('Marks',axis=1)
print(final_df)

Output:

    Student Id  Name    Marks1  Marks2  Marks3
0   id_1    John    112.0   NaN NaN
1   id_2    Rafs    181.0   182.0   183.0
2   id_3    Juan    222.0   312.0   NaN
3   id_4    Roller  21.0    NaN NaN

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.