Converting a dataframe stringcolumn into multiple columns and rearrange each column based on the labels

Question

I want to convert a stringcolumn with multiple labels into separate columns for each label and rearrange the dataframe that identical labels are in the same column. For e.g.:

ID	Label
0	apple, tom, car
1	apple, car
2	tom, apple

to

ID	Label	0	1	2
0	apple, tom, car	apple	car	tom
1	apple, car	apple	car	None
2	tom, apple	apple	None	tom

df["Label"].str.split(',',3, expand=True)

0	1	2
apple	tom	car
apple	car	None
tom	apple	None

I know how to split the stringcolumn, but I can't really figure out how to sort the label columns, especially since the number of labels per sample is different.

constantstranger · Accepted Answer · 2022-04-16 13:14:16Z

Here's a way to do this.

First call df['Label'].apply() to replace the csv strings with lists and also to populate a Python dict mapping labels to new column index values.

Then create a second data frame df2 that fills new label columns as specified in the question.

Finally, concatenate the two DataFrames horizontally and drop the 'Label' column.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'ID' : [0,1,2],
    'Label' : ['apple, tom, car', 'apple, car', 'tom, apple']
})

labelInfo = [labels := {}, curLabelIdx := 0]
def foo(x, labelInfo):
    theseLabels = [s.strip() for s in x.split(',')]
    labels, curLabelIdx = labelInfo
    for label in theseLabels:
        if label not in labels:
            labels[label] = curLabelIdx
            curLabelIdx += 1
    labelInfo[1] = curLabelIdx
    return theseLabels
df['Label'] = df['Label'].apply(foo, labelInfo=labelInfo)
df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()), 
    columns = list(labels.values()))
df = pd.concat([df, df2], axis=1).drop(columns=['Label'])

print(df)

Output:

   ID      0     1     2
0   0  apple   tom   car
1   1  apple  None   car
2   2  apple   tom  None

If you'd prefer to have the new columns named using the labels they contain, you can replace the df2 assignment line with this:

df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()), 
    columns = list(labels))

Now the output is:

   ID  apple   tom   car
0   0  apple   tom   car
1   1  apple  None   car
2   2  apple   tom  None

Andrej Kesely · Accepted Answer · 2022-04-16 12:36:52Z

1

Try:

df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
    df.set_index(["ID", "Label", "Col"])
    .unstack(2)
    .droplevel(0, axis=1)
    .reset_index()
)
df.columns.name = None
print(df)

Prints:

   ID            Label      0    1    2
0   0  apple, tom, car  apple  car  tom
1   1       apple, car  apple  car  NaN
2   2       tom, apple  apple  NaN  tom

answered Apr 16, 2022 at 12:36

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

1 Comment

constantstranger Over a year ago

This is really clever!

lmielke · Accepted Answer · 2022-04-16 13:21:15Z

1

I believe what you want is something like this:

import pandas as pd

data = {'Label': ['apple, tom, car', 'apple, car', 'tom, apple']}
df = pd.DataFrame(data)
print(f"df: \n{df}")

def norm_sort(series):
    mask = []
    for line in series:
        mask.extend([l.strip() for l in line.split(',')])
    mask = sorted(list(set(mask)))
    labels = []
    for line in series:
        labels.append(', '.join([m if m in line else 'None' for m in mask]))
    return labels

df.Label = norm_sort(df.loc[:, 'Label'])
df = df.Label.str.split(', ', expand=True)
print(f"df: \n{df}")

answered Apr 16, 2022 at 13:21

lmielke

1258 bronze badges

1 Comment

lmielke Over a year ago

Looking at my code, I suggest that for efficiency reasons, you better create mask as set instead as a list and then say mask.update instead of mask.extend.

user7375116 · Accepted Answer · 2022-04-16 12:30:39Z

0

The goal of your program is not clear. If you are curious which elements are present in the different rows, then we can just get them all and stack the dataframe like such:

df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)

>>> final_df
0     apple
1    banana
2     grape
3     apple
4    banana
5    banana
6     grape

At this point we can drop the duplicates or count the occurrence of each, depending on your use case.

answered Apr 16, 2022 at 12:30

user7375116

2131 silver badge8 bronze badges

Collectives™ on Stack Overflow

Converting a dataframe stringcolumn into multiple columns and rearrange each column based on the labels

4 Answers 4

Comments

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related