1

I want to convert a stringcolumn with multiple labels into separate columns for each label and rearrange the dataframe that identical labels are in the same column. For e.g.:

ID Label
0 apple, tom, car
1 apple, car
2 tom, apple

to

ID Label 0 1 2
0 apple, tom, car apple car tom
1 apple, car apple car None
2 tom, apple apple None tom
df["Label"].str.split(',',3, expand=True)
0 1 2
apple tom car
apple car None
tom apple None

I know how to split the stringcolumn, but I can't really figure out how to sort the label columns, especially since the number of labels per sample is different.

4 Answers 4

1

Here's a way to do this.

First call df['Label'].apply() to replace the csv strings with lists and also to populate a Python dict mapping labels to new column index values.

Then create a second data frame df2 that fills new label columns as specified in the question.

Finally, concatenate the two DataFrames horizontally and drop the 'Label' column.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'ID' : [0,1,2],
    'Label' : ['apple, tom, car', 'apple, car', 'tom, apple']
})

labelInfo = [labels := {}, curLabelIdx := 0]
def foo(x, labelInfo):
    theseLabels = [s.strip() for s in x.split(',')]
    labels, curLabelIdx = labelInfo
    for label in theseLabels:
        if label not in labels:
            labels[label] = curLabelIdx
            curLabelIdx += 1
    labelInfo[1] = curLabelIdx
    return theseLabels
df['Label'] = df['Label'].apply(foo, labelInfo=labelInfo)
df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()), 
    columns = list(labels.values()))
df = pd.concat([df, df2], axis=1).drop(columns=['Label'])

print(df)

Output:

   ID      0     1     2
0   0  apple   tom   car
1   1  apple  None   car
2   2  apple   tom  None

If you'd prefer to have the new columns named using the labels they contain, you can replace the df2 assignment line with this:

df2 = pd.DataFrame(np.array(df['Label'].apply(lambda x: [s if s in x else 'None' for s in labels]).to_list()), 
    columns = list(labels))

Now the output is:

   ID  apple   tom   car
0   0  apple   tom   car
1   1  apple  None   car
2   2  apple   tom  None
Sign up to request clarification or add additional context in comments.

Comments

1

Try:

df = df.assign(xxx=df.Label.str.split(r"\s*,\s*")).explode("xxx")
df["Col"] = df.groupby("xxx").ngroup()
df = (
    df.set_index(["ID", "Label", "Col"])
    .unstack(2)
    .droplevel(0, axis=1)
    .reset_index()
)
df.columns.name = None
print(df)

Prints:

   ID            Label      0    1    2
0   0  apple, tom, car  apple  car  tom
1   1       apple, car  apple  car  NaN
2   2       tom, apple  apple  NaN  tom

1 Comment

This is really clever!
1

I believe what you want is something like this:

import pandas as pd

data = {'Label': ['apple, tom, car', 'apple, car', 'tom, apple']}
df = pd.DataFrame(data)
print(f"df: \n{df}")

def norm_sort(series):
    mask = []
    for line in series:
        mask.extend([l.strip() for l in line.split(',')])
    mask = sorted(list(set(mask)))
    labels = []
    for line in series:
        labels.append(', '.join([m if m in line else 'None' for m in mask]))
    return labels

df.Label = norm_sort(df.loc[:, 'Label'])
df = df.Label.str.split(', ', expand=True)
print(f"df: \n{df}")

1 Comment

Looking at my code, I suggest that for efficiency reasons, you better create mask as set instead as a list and then say mask.update instead of mask.extend.
0

The goal of your program is not clear. If you are curious which elements are present in the different rows, then we can just get them all and stack the dataframe like such:

df = pd.DataFrame({'label': ['apple, banana, grape', 'apple, banana', 'banana, grape']})
final_df = df['label'].str.split(', ', expand=True).stack()
final_df.reset_index(drop=True, inplace=True)
>>> final_df
0     apple
1    banana
2     grape
3     apple
4    banana
5    banana
6     grape

At this point we can drop the duplicates or count the occurrence of each, depending on your use case.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.