4

I have a Pandas Dataframe that looks something like this:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2", 
          "label_1, label_3, label_2", 
          "label_2, label_4", 
          "label_1, label_2, label_5", 
          "label_2, label_3", 
          "label_3, label_5, label_1, label_2", 
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df



   text                              labels
0  abcd                    label_1, label_2
1  efgh           label_1, label_3, label_2
2  ijkl                    label_2, label_4
3  mnop           label_1, label_2, label_5
4  qrst                    label_2, label_3
5  uvwx  label_3, label_5, label_1, label_2
6    yz                    label_1, label_3

I would like to format the dataframe into something like this:

text  label_1  label_2  label_3  label_4  label_5

abcd        1.0      1.0      0.0      0.0      0.0
efgh        1.0      1.0      1.0      0.0      0.0
ijkl        0.0      1.0      0.0      1.0      0.0
mnop        1.0      1.0      0.0      0.0      1.0
qrst        0.0      1.0      1.0      0.0      0.0
uvwx        1.0      1.0      1.0      0.0      1.0
yz          1.0      0.0      1.0      0.0      0.0

How can I accomplish this? (I know I can split the strings in the labels and convert them into lists by doing something like df.labels.str.split(",") but not sure as to how to proceed from there.

(so basically I'd like to convert those keywords in the labels columns into its own columns and fill in 1 whenever they appear as shown in expected output)

4
  • Is there a maximum number of values in the labels column? Commented Aug 8, 2018 at 9:32
  • @MohitMotwani no, it is not fixed and it could vary. Commented Aug 8, 2018 at 9:33
  • 1
    Possible duplicate of pandas: How do I split text in a column into multiple rows? Commented Aug 8, 2018 at 9:36
  • @MohitMotwani I've tried that, it does not produce the required solution Commented Aug 8, 2018 at 9:41

4 Answers 4

5

You can use pd.Series.str.get_dummies and combine with the text series:

dummies = df['labels'].str.replace(' ', '').str.get_dummies(',')
res = df['text'].to_frame().join(dummies)

print(res)

   text  label_1  label_2  label_3  label_4  label_5
0  abcd        1        1        0        0        0
1  efgh        1        1        1        0        0
2  ijkl        0        1        0        1        0
3  mnop        1        1        0        0        1
4  qrst        0        1        1        0        0
5  uvwx        1        1        1        0        1
6    yz        1        0        1        0        0
Sign up to request clarification or add additional context in comments.

Comments

2

A simle solution would be to use pd.get_dummies as follows:

pd.get_dummies(
    df.set_index('text')['labels'].str.split(', ', expand=True).stack()
).groupby('text').sum()

Comments

2

code:

text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]

labels = ["label_1, label_2",
          "label_1, label_3, label_2",
          "label_2, label_4",
          "label_1, label_2, label_5",
          "label_2, label_3",
          "label_3, label_5, label_1, label_2",
          "label_1, label_3"]

df = pd.DataFrame(dict(text=text, labels=labels))
df = df.drop('labels', axis=1).join(
             df.labels
             .str
             .split(', ', expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('labels')
             )

df['value'] = 1
df_new = df.pivot(values = 'value', index='text', columns = 'labels').fillna(0)
print(df_new)

output:

labels  text  label_1  label_2  label_3  label_4  label_5
0       abcd      1.0      1.0      0.0      0.0      0.0
1       efgh      1.0      1.0      1.0      0.0      0.0
2       ijkl      0.0      1.0      0.0      1.0      0.0
3       mnop      1.0      1.0      0.0      0.0      1.0
4       qrst      0.0      1.0      1.0      0.0      0.0
5       uvwx      1.0      1.0      1.0      0.0      1.0
6         yz      1.0      0.0      1.0      0.0      0.0

in this main thing is split use (,) with space, because of you string format, if you change that format than use appropriate split.

for example:

if you are using split with single comma like this

df = df.drop('labels', axis=1).join(
                 df.labels
                 .str
                 .split(',', expand=True)
                 .stack()
                 .reset_index(drop=True, level=1)
                 .rename('labels')
                 )

then you will need additional code for removing spaces

df['labels'] = df['labels'].str.replace(" ", "")

rest of the code will be same.

Comments

1

If number of columns are dynamic, this will help find the possible ones.

unique = df['labels'].apply(lambda x: x.split(", ")).values.tolist()
unique = [i for sublist in unique for i in sublist]
unique = set(unique)

Hence, unique is now.
{'label_1', 'label_2', 'label_3', 'label_4', 'label_5'}

max_label = len(unique)

Which will give us the maximum number of columns.

Answer

def labeller(labels):
    value = [0] * max_label
    for label in labels:
        value[int(label[-1])-1] = 1
    return value

df['labels'] = df['labels'].apply(lambda x: x.split(", ")).apply(labeller)

df[['label_' + str(i+1) for i in range(max_label)]] = df.labels.apply(pd.Series)
df.drop(['labels'], axis=1, inplace=True)

    text    label_1 label_2 label_3 label_4 label_5
0   abcd    1       1       0       0       0
1   efgh    1       1       1       0       0
2   ijkl    0       1       0       1       0
3   mnop    1       1       0       0       1
4   qrst    0       1       1       0       0
5   uvwx    1       1       1       0       1
6   yz      1       0       1       0       0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.