pandas dataframe label columns encoding

Question

Have a pandas dataframe with string input columns. df looks like:

news                          label1      label2      label3  label4
COVID Hospitalizations ....   health
will pets contract covid....  health      pets
High temperature will cause.. health      weather
...

Expected output

news                          health      pets      weather  tech
COVID Hospitalizations ....   1           0         0        0 
will pets contract covid....  1           1         0        0
High temperature will cause.. 1           0         1        0
...

Currently I used sklean

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df['labels'] = df[['label1','label2','label3','label4']].values.tolist()
mlb.fit(df['labels'])
temp = mlb.transform(df['labels'])
ff = pd.DataFrame(temp, columns = list(mlb.classes_))
df_final = pd.concat([df['news'],ff], axis=1)

this works so far. Just wondering if there is a way to avoid to use sklearn.preprocessing.MultiLabelBinarizer ?

jezrael · Accepted Answer · 2022-04-26 06:25:57Z

One idea is join values by | and then use Series.str.get_dummies:

#if missing values NaNs
#df = df.fillna('')
df_final = df.set_index('news').agg('|'.join, 1).str.get_dummies().reset_index()
print (df_final)
                            news  health  pets  weather
0    COVID Hospitalizations ....       1     0        0
1   will pets contract covid....       1     1        0
2  High temperature will cause..       1     0        1

Or use get_dummies:

df_final = (pd.get_dummies(df.set_index('news'), prefix='', prefix_sep='')
              .groupby(level=0,axis=1)
              .max()
              .reset_index())

#second column name is empty string, so dfference with solution above
print (df_final)
                            news     health  pets  weather
0    COVID Hospitalizations ....  1       1     0        0
1   will pets contract covid....  1       1     1        0
2  High temperature will cause..  1       1     0        1

Collectives™ on Stack Overflow

pandas dataframe label columns encoding

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related