3

I have a large class of students separated into sections with each student havinga unique ID. I have the entire roster stored in a dataframe. I also have multiple dataframes representing the grades from a particular section of students on a particular assignment. I would like to merge all of this information into a single dataframe that represents my gradebook. For example:

import pandas as pd

# Initialize roster
data = [['ab10', 'Ann Big'], ['ca9', 'Carl Ahn'], ['jb19', 'John Brown'], ['cf25', 'Carol Fox']]
roster = pd.DataFrame(data, columns = ['ID', 'Name'])

# Initialize the section grades
data = [['ab10', 95], ['ca9', 72]]
grades0 = pd.DataFrame(data, columns = ['ID', 'Exp1'])

data = [['ab10', 83], ['ca9', 97]]
grades1 = pd.DataFrame(data, columns = ['ID', 'Exp2'])

data = [['jb19', 61], ['cf25', 95]]
grades2 = pd.DataFrame(data, columns = ['ID', 'Exp1'])

# Now merge the section grades with the roster to generate final gradebook
roster = roster.merge(grades0, on = 'ID', how = 'outer')
roster = roster.merge(grades1, on = 'ID', how = 'outer')
roster = roster.merge(grades2, on = 'ID', how = 'outer')

print(roster)

This code generates the following:

     ID        Name  Exp1_x  Exp2  Exp1_y
0  ab10     Ann Big    95.0  83.0     NaN
1   ca9    Carl Ahn    72.0  97.0     NaN
2  jb19  John Brown     NaN   NaN    61.0
3  cf25   Carol Fox     NaN   NaN    95.0

I don't want the duplicated Exp1 columns with the suffixes _x and _y. Instead I want:

     ID        Name    Exp1  Exp2
0  ab10     Ann Big    95.0  83.0 
1   ca9    Carl Ahn    72.0  97.0
2  jb19  John Brown    61.0   NaN
3  cf25   Carol Fox    95.0   NaN

There should be no duplicated data between the grade dataframes (but it would be good practice to raise an error were an overlap to exist).

2 Answers 2

3

reduce with combine_first

As there are is no duplication between the grades dataframes, we can therefore reduce with combine_first to combine all the the dataframes together

from functools import reduce

reduce(pd.DataFrame.combine_first, 
      [g.set_index('ID') for g in (roster, grades0, grades1, grades2)])

      Exp1  Exp2        Name
ID                          
ab10  95.0  83.0     Ann Big
ca9   72.0  97.0    Carl Ahn
cf25  95.0   NaN   Carol Fox
jb19  61.0   NaN  John Brown
Sign up to request clarification or add additional context in comments.

2 Comments

combine_first is exactly what I needed! Many thanks.
@Melissa Pleased to help!
0

I enjoy using pd.concat() with .groupby() for these cases, not only I believe might result cleaner, but you save a couple of lines of code and probably efficiency too (as you won't be making multiple merges). Replace your merge lines with:

roster = pd.concat([roster,grades0,grades1,grades2]).groupby(['ID'])['Exp1','Exp2'].sum().merge(roster,on='ID')
print(roster)

Which outputs:

    ID  Exp1  Exp2        Name
0  ab10  95.0  83.0     Ann Big
1   ca9  72.0  97.0    Carl Ahn
2  cf25  95.0   0.0   Carol Fox
3  jb19  61.0   0.0  John Brown

You can then re-order the columns to your preferred order. And if you prefer having NaNs to 0s then you can add .replace(0,np.nan) after the merge().

     ID  Exp1  Exp2        Name
0  ab10  95.0  83.0     Ann Big
1   ca9  72.0  97.0    Carl Ahn
2  cf25  95.0   NaN   Carol Fox
3  jb19  61.0   NaN  John Brown

1 Comment

Thanks, but what if I do not know the names of the assignments a priori? I need to avoid hardcoding ['Exp1','Exp2'].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.