Convert dataframe column string values into dummy variable columns

Question

I have the following dataframe (excluded rest of columns):

| customer_id | department                    |
| ----------- | ----------------------------- |
| 11          | ['nail', 'men_skincare']      |
| 23          | ['nail', 'fragrance']         |
| 25          | []                            |
| 45          | ['skincare', 'men_fragrance'] |

I am working on preprocessing my data to be fit into a model. I want to turn the department variable into dummy variables for each unique department category (for however many unique departments there could be, not just limited to what is here).

Want to get this result:

| customer_id | department                    | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ----------                    | ---- | ------------ | --------- | -------- | ------------- |
| 11          | ['nail', 'men_skincare']      | 1    | 1            | 0         | 0        | 0             |
| 23          | ['nail', 'fragrance']         | 1    | 0            | 1         | 0        | 0             |
| 25          | []                            | 0    | 0            | 0         | 0        | 0             |
| 45          | ['skincare', 'men_fragrance'] | 0    | 0            | 0         | 1        | 1             |

I have tried this link, but when i splice it, it treats it as if its a string and only creates a column for each character in the string; what i used:

df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]

I then tried to split the strings and turn into a list using:

df['new_column'] = df['department'].apply(lambda x: x.split(","))

Then tried it again and still did the same thing of only creating columns for each character.

Any suggestions?

Edit: I found the answer using the link that anky sent over, specifically i used this one: https://stackoverflow.com/a/29036042

What worked for me:

df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df, df1, right_index=True, left_index=True, how = 'left')

Welcome to Stack Overflow. Please read how to ask good questions. Make sure your question covers these 3 elements: 1. Problem Statement 2. Your Code (it should be Minimal, Reproducible Example 3. Error Message (preferably full Traceback to help others review and provide feedback). Sometimes the same question may have already been asked. Make sure your question is not a duplicate — Joe Ferndz
– Joe Ferndz, Commented Apr 25, 2021 at 1:32
@anky, yes that link was helpful, i specifically used this one; stackoverflow.com/a/29036042 — TealSeal
– TealSeal, Commented Apr 25, 2021 at 3:59

Anurag Dabas · Accepted Answer · 2021-04-25 02:30:16Z

3

import pandas as pd

You can do this by explode() ,value_counts() and fillna() method:

data=df.explode('department').fillna('empty')

Now use crosstab() method:

data=pd.crosstab(data['customer_id'],data['department'])

Since concat() method is giving you an error so use merge() method and drop() method:

data=pd.merge(df.set_index('customer_id'),data,left_index=True,right_index=True).drop(columns=['empty'])

Now if you print data you will get your desired output:

edited Apr 25, 2021 at 2:30

answered Apr 25, 2021 at 1:47

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

TealSeal Over a year ago

Wow how did you know this so quickly! I got an error on the last code snipped: "object of type 'int' has no len()"

Anurag Dabas Over a year ago

in concat() method?

TealSeal Over a year ago

Yes, when running: data=pd.concat((df.set_index('customer_id'),data),axis=1).reset_index()

Anurag Dabas Over a year ago

Updated answer with shorter code...Kindly check :)

TealSeal Over a year ago

Now getting a different error, for the concat code :( ValueError: Shape of passed values is (179910, 5183), indices imply (35990, 5183)

|

Scott Boston · Accepted Answer · 2021-04-25 04:10:54Z

Try:

df.merge(pd.get_dummies(df.set_index('customer_id')
                          .explode('department'), 
                        prefix='', 
                        prefix_sep='').sum(level=0),
        left_on='customer_id', right_index=True)

Output:

   customer_id                 department  fragrance  men_fragrance  men_skincare  nail  skincare
0           11       [nail, men_skincare]          0              0             1     1         0
1           23          [nail, fragrance]          1              0             0     1         0
2           25                         []          0              0             0     0         0
3           45  [skincare, men_fragrance]          0              1             0     0         1

Collectives™ on Stack Overflow

Convert dataframe column string values into dummy variable columns

2 Answers 2

11 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related