0

I have a variable in Pandas dataframe called "label" which contains multiple string values (for example: 'label1', "label2', 'label3'...).

label
label1
label1
label23
label3
label11

I output all unique values into a list and then create new variables

unique_labels = df['label'].unique()

for i in unique_labels: # create new single label variable holders
    df[str(i)] = 0

Now I have

label    label1    label2 .... label23
label1     0         0            0
label23    0         0            0

I want to assign corresponding value based on 'label' onto the new single label variables, as following

label    label1    label2 .... label23
label1     1         0            0
label23    0         0            1

Here is my code

def single_label(df):
for i in range(len(unique_labels)):
    if df['label'] == str(unique_labels[i]):
        df[unique_labels[i]] == 1


df = df.applymap(single_label)

Getting this error

TypeError: ("'int' object is not subscriptable", 'occurred at index Unnamed: 0')

1 Answer 1

2

IIUC, you can use pd.get_dummies, after you drop duplicates, which will be faster and result in cleaner code than doing it iteratively:

df.drop_duplicates().join(pd.get_dummies(df.drop_duplicates()))

     label  label_label1  label_label11  label_label23  label_label3
0   label1             1              0              0             0
2  label23             0              0              1             0
3   label3             0              0              0             1
4  label11             0              1              0             0

You can get rid of those label prefixes and underscores using the prefix and prefix_sep arguments:

df.drop_duplicates().join(pd.get_dummies(df.drop_duplicates(),
                                         prefix='', prefix_sep=''))

     label  label1  label11  label23  label3
0   label1       1        0        0       0
2  label23       0        0        1       0
3   label3       0        0        0       1
4  label11       0        1        0       0

Edit: with a second column, i.e.:

>>> df
     label second_column
0   label1             a
1   label1             b
2  label23             c
3   label3             d
4  label11             e

Just call pd.get_dummies on only the label column:

df.drop_duplicates('label').join(pd.get_dummies(df['label'].drop_duplicates(),
                                         prefix='', prefix_sep=''))

     label second_column  label1  label11  label23  label3
0   label1             a       1        0        0       0
2  label23             c       0        0        1       0
3   label3             d       0        0        0       1
4  label11             e       0        1        0       0

But then you're getting rid of the rows without duplicates, and I don't think that's what you want (unless I'm mistaken). If not, just omit the drop duplicates calls:

df.join(pd.get_dummies(df['label'], prefix='', prefix_sep=''))

     label second_column  label1  label11  label23  label3
0   label1             a       1        0        0       0
1   label1             b       1        0        0       0
2  label23             c       0        0        1       0
3   label3             d       0        0        0       1
4  label11             e       0        1        0       0
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, but can you please show how to specify with the 'label' column (since my actual data contains multiple columns)? I tried df['label'], but didn't work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.