2

I have a csv with 10K rows of movie data.

In the "genre" column, the data looks like this:

Adventure|Science Fiction|Thriller
Action|Adventure|Science Fiction|Fantasy
Action|Crime|Thriller
Western|Drama|Adventure|Thriller

I want to create multiple sub columns (ie action yes/no, adventure yes/no, drama yes/no, etc) based on the genre column.

question 1: how can i first determine all the unique genre titles in the genre column?

question 2: after i determine all the unique genre titles, how to create all the necessary ['insert genre' yes/no] columns?

2 Answers 2

3

Use str.get_dummies :

df = df['col'].str.get_dummies('|').replace({0:'no', 1:'yes'})

Or:

d = {0:'no', 1:'yes'}
df = df['col'].str.get_dummies('|').applymap(d.get)

For better performance use MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(df['col'].str.split('|')) ,
                   columns=mlb.classes_, 
                   index=df.index)
        .applymap(d.get))

print (df)
  Action Adventure Crime Drama Fantasy Science Fiction Thriller Western
0     no       yes    no    no      no             yes      yes      no
1    yes       yes    no    no     yes             yes       no      no
2    yes        no   yes    no      no              no      yes      no
3     no       yes    no   yes      no              no      yes     yes

Detail:

print (df['col'].str.get_dummies('|'))
   Action  Adventure  Crime  Drama  Fantasy  Science Fiction  Thriller  \
0       0          1      0      0        0                1         1   
1       1          1      0      0        1                1         0   
2       1          0      1      0        0                0         1   
3       0          1      0      1        0                0         1   

   Western  
0        0  
1        0  
2        0  
3        1  

Timings:

df = pd.concat([df] * 10000, ignore_index=True)


In [361]: %timeit pd.DataFrame(mlb.fit_transform(df['col'].str.split('|')) ,columns=mlb.classes_,  index=df.index)
10 loops, best of 3: 120 ms per loop

In [362]: %timeit df['col'].str.get_dummies('|')
1 loop, best of 3: 324 ms per loop

In [363]: %timeit pd.get_dummies(df['col'].str.split('|').apply(pd.Series).stack()).sum(level=0)
1 loop, best of 3: 7.77 s per loop
Sign up to request clarification or add additional context in comments.

Comments

2

Assuming your column is called Genres, this is one way.

res = pd.get_dummies(df['Genres'].str.split('|').apply(pd.Series).stack()).sum(level=0)

#    Action  Adventure  Crime  Drama  Fantasy  ScienceFiction  Thriller  Western
# 0       0          1      0      0        0               1         1        0
# 1       1          1      0      0        1               1         0        0
# 2       1          0      1      0        0               0         1        0
# 3       0          1      0      1        0               0         1        1

You can then convert binary values to "No" / "Yes" via pd.DataFrame.applymap:

df = df.applymap({0: 'no', 1: 'yes'}.get)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.