2

I am not sure how to really precisely describe the question, so I'll add some more detail below and give a reproducible example.

Basically I have two columns and many rows in a Pandas dataframe, and I want to be able to do a transformation where I construct new columns that indicate the presence of at least one value for a given unit.

For example, let's say I have a pandas dataframe of two columns: students and classes they have taken. Let's say I also have a dictionary that maps each class to a subject. I want to create a new dataframe that has one column for studentid and one column for each subject. Each column for a subject will tell me if the student has taken at least one class in that subject (thus the final table is unique at the studentid level). For example:

import pandas as pd
s = {'student_id' : pd.Series(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']),
     'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry', 'Algebra',
                            'Intro to Java', 'Chinese 101'])}
c = {'subject' : pd.Series(['Math', 'Math', 'Math', 'CS', 'Science', 'Science', 'CS', 'Languages']),
     'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry',
                            'Intro to Java', 'Chinese 101'])}
students = pd.DataFrame(s, columns = ['student_id', 'classes'])

The output of this code would be (sorry not sure how to create tables in StackOverflow so I just put it as code).

students

 student_id   classes
0   A        Algebra
1   A        Geometry
2   A        Topology
3   B        Intro to Python
4   B        Biology
5   B        Chemistry
6   C        Algebra
7   C        Intro to Java
8   C        Chinese 101

classes

subject         classes
0   Math         Algebra
1   Math         Geometry
2   Math         Topology
3   CS           Intro to Python
4   Science      Biology
5   Science      Chemistry
6   CS           Intro to Java
7   Languages    Chinese 101

Now, I want to create a new dataframe that is basically a transformation of the students dataframe which adds new columns for each subject in the classes dataframe. To be more precise, I would like a new dataframe, perhaps titled, student_classes to be unique at the student_id level and have a value of 1 in the column for a subject if they have taken at least once class in that subject. Following this example, I would like:

 student_id  Math  CS  Science   Languages
0   A        1     0     0          0
1   B        0     1     1          0
2   C        1     1     0          1

Here is what I have done that solves this particular example. Problem is that my actual data has nothing to do with students and the data frames are much much bigger which makes the following solution very slow and memory intensive. In fact, my iPython Notebook returns a memory error on my bigger tables.

So, what I have actually done is create a dictionary of dictionaries

classes_subject_dict={'Math': {'Algebra':1,
                               'Geometry':1,
                               'Topology':1,
                              },
                      'CS': {'Intro to Python':1,
                             'Intro to Java':1,
                            },
                      'Science':{'Biology':1,
                                 'Chemistry':1,
                                },
                      'Languages':{'Chinese 101':1
                                  }
                     }

Then, I look through the keys in the dictionary and use the map method (function? I'm not sure what the technical term is here) to map the value of 1 to the column defined by the subject if an appropriate class appeared:

for key in classes_subject_dict.keys():
    students[key]=students.classes.map(classes_subject_dict[key])

Then, I take the max value within each column, drop the classes column, then drop duplicates to get my final table

for key in classes_subject_dict.keys():
    students[key]=students.groupby(['student_id'])[key].transform(max)

students = students.drop('classes', 1)
students = students.drop_duplicates()
students = students.fillna(0)

students

   student_id   CS  Languages   Math    Science
0   A           0   0            1       0
3   B           1   0            0       1
6   C           1   1            1       0

Again, this works well for this particular simple example, but my actual data is much much bigger both in terms of length and width. While my actual data doesn't really have anything to do with students, the analogous description would be I have something like 300 "subjects" and hundreds of thousands of "students". I noticed that using the map method is really slowing my code down, and I was wondering if there was a more efficient way of doing this.

2
  • There is an answered question very similar to this. Let me find it... Commented Jan 27, 2016 at 5:28
  • Here, found it: stackoverflow.com/questions/33553765/… Commented Jan 27, 2016 at 5:32

1 Answer 1

1

You can use merge, crosstab and then astype:

import pandas as pd
import pandas as pd
s = {'student_id' : pd.Series(['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']),
     'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry', 'Algebra',
                            'Intro to Java', 'Chinese 101'])}
c = {'subject' : pd.Series(['Math', 'Math', 'Math', 'CS', 'Science', 'Science', 'CS', 'Languages']),
     'classes' : pd.Series(['Algebra', 'Geometry', 'Topology', 'Intro to Python', 'Biology', 'Chemistry',
                            'Intro to Java', 'Chinese 101'])}
students = pd.DataFrame(s, columns = ['student_id', 'classes'])
classes = pd.DataFrame(c, columns = ['subject', 'classes'])
print students
  student_id          classes
0          A          Algebra
1          A         Geometry
2          A         Topology
3          B  Intro to Python
4          B          Biology
5          B        Chemistry
6          C          Algebra
7          C    Intro to Java
8          C      Chinese 101

print classes
     subject          classes
0       Math          Algebra
1       Math         Geometry
2       Math         Topology
3         CS  Intro to Python
4    Science          Biology
5    Science        Chemistry
6         CS    Intro to Java
7  Languages      Chinese 101
df = pd.merge(students, classes, on=['classes'])
print df
  student_id          classes    subject
0          A          Algebra       Math
1          C          Algebra       Math
2          A         Geometry       Math
3          A         Topology       Math
4          B  Intro to Python         CS
5          B          Biology    Science
6          B        Chemistry    Science
7          C    Intro to Java         CS
8          C      Chinese 101  Languages

df = pd.crosstab(df['student_id'], df['subject'])
print df
subject     CS  Languages  Math  Science
student_id                              
A            0          0     3        0
B            1          0     0        2
C            1          1     1        0

df = (df > 0)
print df
subject        CS Languages   Math Science
student_id                                
A           False     False   True   False
B            True     False  False    True
C            True      True   True   False
df = (df > 0).astype(int)
print df
subject     CS  Languages  Math  Science
student_id                              
A            0          0     1        0
B            1          0     0        1
C            1          1     1        0
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the help jezrael! Perhaps a dumb question but how do I keep student_id as a column? For example, I want to just do something like df['student_id'] but that no longer works after the cross tab

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.