I am not sure how to really precisely describe the question, so I'll add some more detail below and give a reproducible example.
Basically I have two columns and many rows in a Pandas dataframe, and I want to be able to do a transformation where I construct new columns that indicate the presence of at least one value for a given unit.
For example, let's say I have a pandas dataframe of two columns: students and classes they have taken. Let's say I also have a dictionary that maps each class to a subject. I want to create a new dataframe that has one column for studentid and one column for each subject. Each column for a subject will tell me if the student has taken at least one class in that subject (thus the final table is unique at the studentid level). For example:
import pandas as pd
s = {'student_id' : pd.Series(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']),
'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry', 'Algebra',
'Intro to Java', 'Chinese 101'])}
c = {'subject' : pd.Series(['Math', 'Math', 'Math', 'CS', 'Science', 'Science', 'CS', 'Languages']),
'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry',
'Intro to Java', 'Chinese 101'])}
students = pd.DataFrame(s, columns = ['student_id', 'classes'])
The output of this code would be (sorry not sure how to create tables in StackOverflow so I just put it as code).
students
student_id classes
0 A Algebra
1 A Geometry
2 A Topology
3 B Intro to Python
4 B Biology
5 B Chemistry
6 C Algebra
7 C Intro to Java
8 C Chinese 101
classes
subject classes
0 Math Algebra
1 Math Geometry
2 Math Topology
3 CS Intro to Python
4 Science Biology
5 Science Chemistry
6 CS Intro to Java
7 Languages Chinese 101
Now, I want to create a new dataframe that is basically a transformation of the students dataframe which adds new columns for each subject in the classes dataframe. To be more precise, I would like a new dataframe, perhaps titled, student_classes to be unique at the student_id level and have a value of 1 in the column for a subject if they have taken at least once class in that subject. Following this example, I would like:
student_id Math CS Science Languages
0 A 1 0 0 0
1 B 0 1 1 0
2 C 1 1 0 1
Here is what I have done that solves this particular example. Problem is that my actual data has nothing to do with students and the data frames are much much bigger which makes the following solution very slow and memory intensive. In fact, my iPython Notebook returns a memory error on my bigger tables.
So, what I have actually done is create a dictionary of dictionaries
classes_subject_dict={'Math': {'Algebra':1,
'Geometry':1,
'Topology':1,
},
'CS': {'Intro to Python':1,
'Intro to Java':1,
},
'Science':{'Biology':1,
'Chemistry':1,
},
'Languages':{'Chinese 101':1
}
}
Then, I look through the keys in the dictionary and use the map method (function? I'm not sure what the technical term is here) to map the value of 1 to the column defined by the subject if an appropriate class appeared:
for key in classes_subject_dict.keys():
students[key]=students.classes.map(classes_subject_dict[key])
Then, I take the max value within each column, drop the classes column, then drop duplicates to get my final table
for key in classes_subject_dict.keys():
students[key]=students.groupby(['student_id'])[key].transform(max)
students = students.drop('classes', 1)
students = students.drop_duplicates()
students = students.fillna(0)
students
student_id CS Languages Math Science
0 A 0 0 1 0
3 B 1 0 0 1
6 C 1 1 1 0
Again, this works well for this particular simple example, but my actual data is much much bigger both in terms of length and width. While my actual data doesn't really have anything to do with students, the analogous description would be I have something like 300 "subjects" and hundreds of thousands of "students". I noticed that using the map method is really slowing my code down, and I was wondering if there was a more efficient way of doing this.