1

I have the following dataframe (excluded rest of columns):

| customer_id | department                    |
| ----------- | ----------------------------- |
| 11          | ['nail', 'men_skincare']      |
| 23          | ['nail', 'fragrance']         |
| 25          | []                            |
| 45          | ['skincare', 'men_fragrance'] |

I am working on preprocessing my data to be fit into a model. I want to turn the department variable into dummy variables for each unique department category (for however many unique departments there could be, not just limited to what is here).

Want to get this result:

| customer_id | department                    | nail | men_skincare | fragrance | skincare | men_fragrance |
| ----------- | ----------                    | ---- | ------------ | --------- | -------- | ------------- |
| 11          | ['nail', 'men_skincare']      | 1    | 1            | 0         | 0        | 0             |
| 23          | ['nail', 'fragrance']         | 1    | 0            | 1         | 0        | 0             |
| 25          | []                            | 0    | 0            | 0         | 0        | 0             |
| 45          | ['skincare', 'men_fragrance'] | 0    | 0            | 0         | 1        | 1             |

I have tried this link, but when i splice it, it treats it as if its a string and only creates a column for each character in the string; what i used:

df['1st'] = df['department'].str[0]
df['2nd'] = df['department'].str[1]
df['3rd'] = df['department'].str[2]
df['4th'] = df['department'].str[3]
df['5th'] = df['department'].str[4]
df['6th'] = df['department'].str[5]
df['7th'] = df['department'].str[6]
df['8th'] = df['department'].str[7]
df['9th'] = df['department'].str[8]
df['10th'] = df['department'].str[9]

I then tried to split the strings and turn into a list using:

df['new_column'] = df['department'].apply(lambda x: x.split(","))

Then tried it again and still did the same thing of only creating columns for each character.

Any suggestions?

Edit: I found the answer using the link that anky sent over, specifically i used this one: https://stackoverflow.com/a/29036042

What worked for me:

df['department'] = df['department'].str.replace("'",'').str.replace("]",'').str.replace("[",'').str.replace(' ','')
df['department'] = df['department'].apply(lambda x: x.split(","))
s = df['department']
df1 = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.merge(df, df1, right_index=True, left_index=True, how = 'left')
5
  • Welcome to Stack Overflow. Please read how to ask good questions. Make sure your question covers these 3 elements: 1. Problem Statement 2. Your Code (it should be Minimal, Reproducible Example 3. Error Message (preferably full Traceback to help others review and provide feedback). Sometimes the same question may have already been asked. Make sure your question is not a duplicate Commented Apr 25, 2021 at 1:32
  • Can you share what you have done so far please? Commented Apr 25, 2021 at 1:32
  • @JoeFerndz, sure i edited the question. Commented Apr 25, 2021 at 1:47
  • 2
    Does this help? stackoverflow.com/a/51420716/9840637 Commented Apr 25, 2021 at 2:25
  • @anky, yes that link was helpful, i specifically used this one; stackoverflow.com/a/29036042 Commented Apr 25, 2021 at 3:59

2 Answers 2

3
import pandas as pd

You can do this by explode() ,value_counts() and fillna() method:

data=df.explode('department').fillna('empty')

Now use crosstab() method:

data=pd.crosstab(data['customer_id'],data['department'])

Since concat() method is giving you an error so use merge() method and drop() method:

data=pd.merge(df.set_index('customer_id'),data,left_index=True,right_index=True).drop(columns=['empty'])

Now if you print data you will get your desired output:

enter image description here

Sign up to request clarification or add additional context in comments.

11 Comments

Wow how did you know this so quickly! I got an error on the last code snipped: "object of type 'int' has no len()"
in concat() method?
Yes, when running: data=pd.concat((df.set_index('customer_id'),data),axis=1).reset_index()
Updated answer with shorter code...Kindly check :)
Now getting a different error, for the concat code :( ValueError: Shape of passed values is (179910, 5183), indices imply (35990, 5183)
|
0

Try:

df.merge(pd.get_dummies(df.set_index('customer_id')
                          .explode('department'), 
                        prefix='', 
                        prefix_sep='').sum(level=0),
        left_on='customer_id', right_index=True)

Output:

   customer_id                 department  fragrance  men_fragrance  men_skincare  nail  skincare
0           11       [nail, men_skincare]          0              0             1     1         0
1           23          [nail, fragrance]          1              0             0     1         0
2           25                         []          0              0             0     0         0
3           45  [skincare, men_fragrance]          0              1             0     0         1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.