exploding a pandas dataframe column

Question

I have a Pandas Dataframe that looks something like this:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2", 
          "label_1, label_3, label_2", 
          "label_2, label_4", 
          "label_1, label_2, label_5", 
          "label_2, label_3", 
          "label_3, label_5, label_1, label_2", 
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df



   text                              labels
0  abcd                    label_1, label_2
1  efgh           label_1, label_3, label_2
2  ijkl                    label_2, label_4
3  mnop           label_1, label_2, label_5
4  qrst                    label_2, label_3
5  uvwx  label_3, label_5, label_1, label_2
6    yz                    label_1, label_3

I would like to format the dataframe into something like this:

text  label_1  label_2  label_3  label_4  label_5

abcd        1.0      1.0      0.0      0.0      0.0
efgh        1.0      1.0      1.0      0.0      0.0
ijkl        0.0      1.0      0.0      1.0      0.0
mnop        1.0      1.0      0.0      0.0      1.0
qrst        0.0      1.0      1.0      0.0      0.0
uvwx        1.0      1.0      1.0      0.0      1.0
yz          1.0      0.0      1.0      0.0      0.0

How can I accomplish this? (I know I can split the strings in the labels and convert them into lists by doing something like df.labels.str.split(",") but not sure as to how to proceed from there.

(so basically I'd like to convert those keywords in the labels columns into its own columns and fill in 1 whenever they appear as shown in expected output)

Possible duplicate of pandas: How do I split text in a column into multiple rows? — Mohit Motwani
– Mohit Motwani, Commented Aug 8, 2018 at 9:36
@MohitMotwani I've tried that, it does not produce the required solution — ultron
– ultron, Commented Aug 8, 2018 at 9:41

jpp · Accepted Answer · 2018-08-08 10:40:15Z

5

You can use pd.Series.str.get_dummies and combine with the text series:

dummies = df['labels'].str.replace(' ', '').str.get_dummies(',')
res = df['text'].to_frame().join(dummies)

print(res)

   text  label_1  label_2  label_3  label_4  label_5
0  abcd        1        1        0        0        0
1  efgh        1        1        1        0        0
2  ijkl        0        1        0        1        0
3  mnop        1        1        0        0        1
4  qrst        0        1        1        0        0
5  uvwx        1        1        1        0        1
6    yz        1        0        1        0        0

answered Aug 8, 2018 at 10:40

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shaido · Accepted Answer · 2018-08-09 00:56:43Z

2

A simle solution would be to use pd.get_dummies as follows:

pd.get_dummies(
    df.set_index('text')['labels'].str.split(', ', expand=True).stack()
).groupby('text').sum()

edited Aug 9, 2018 at 0:56

answered Aug 8, 2018 at 9:49

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Comments

Nihal · Accepted Answer · 2018-08-09 05:50:12Z

code:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2",
          "label_1, label_3, label_2",
          "label_2, label_4",
          "label_1, label_2, label_5",
          "label_2, label_3",
          "label_3, label_5, label_1, label_2",
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df = df.drop('labels', axis=1).join(
             df.labels
             .str
             .split(', ', expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('labels')
             )

df['value'] = 1
df_new = df.pivot(values = 'value', index='text', columns = 'labels').fillna(0)
print(df_new)

output:

labels  text  label_1  label_2  label_3  label_4  label_5
0       abcd      1.0      1.0      0.0      0.0      0.0
1       efgh      1.0      1.0      1.0      0.0      0.0
2       ijkl      0.0      1.0      0.0      1.0      0.0
3       mnop      1.0      1.0      0.0      0.0      1.0
4       qrst      0.0      1.0      1.0      0.0      0.0
5       uvwx      1.0      1.0      1.0      0.0      1.0
6         yz      1.0      0.0      1.0      0.0      0.0

in this main thing is split use (,) with space, because of you string format, if you change that format than use appropriate split.

for example:

if you are using split with single comma like this

df = df.drop('labels', axis=1).join(
                 df.labels
                 .str
                 .split(',', expand=True)
                 .stack()
                 .reset_index(drop=True, level=1)
                 .rename('labels')
                 )

then you will need additional code for removing spaces

df['labels'] = df['labels'].str.replace(" ", "")

rest of the code will be same.

Nihal · Accepted Answer · 2019-02-14 11:52:29Z

If number of columns are dynamic, this will help find the possible ones.

unique = df['labels'].apply(lambda x: x.split(", ")).values.tolist()
unique = [i for sublist in unique for i in sublist]
unique = set(unique)

Hence, unique is now.
{'label_1', 'label_2', 'label_3', 'label_4', 'label_5'}

max_label = len(unique)

Which will give us the maximum number of columns.

Answer

def labeller(labels):
    value = [0] * max_label
    for label in labels:
        value[int(label[-1])-1] = 1
    return value

df['labels'] = df['labels'].apply(lambda x: x.split(", ")).apply(labeller)

df[['label_' + str(i+1) for i in range(max_label)]] = df.labels.apply(pd.Series)
df.drop(['labels'], axis=1, inplace=True)

    text    label_1 label_2 label_3 label_4 label_5
0   abcd    1       1       0       0       0
1   efgh    1       1       1       0       0
2   ijkl    0       1       0       1       0
3   mnop    1       1       0       0       1
4   qrst    0       1       1       0       0
5   uvwx    1       1       1       0       1
6   yz      1       0       1       0       0

Collectives™ on Stack Overflow

exploding a pandas dataframe column

4 Answers 4

Comments

Comments

Comments

Answer

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Answer

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related