Compare each string element in a dataframe to a list and assign it to a column, python pandas

Question

How to rearrange my dataframe according to column names while searching for specific strings in cells?

My dataframe:

0	1	2	3	4
apple pie	banana bread	orange juice	nan	nan
apple cookies	orange lemonade	nan	nan	nan
banana muffin	orange ice	berry candy	nan	nan
berry juice	nan	nan	nan	nan

I want to arrange the rows according to a list of column names, which look for specific strings of text.

apple	banana	orange	berry	lemon
apple pie	banana bread	orange juice	nan	nan
apple cookies	nan	orange lemonade	nan	nan
nan	banana muffin	orange ice	berry candy	nan
nan	nan	nan	berry juice	nan

I have tried to create a column/list for each fruit, searching for the right string and adding the cell if it matches, however I do not know how to iterate through the dataframe and assign values. I just get a column of Nan's.

col_names = ['apple', 'banana', 'orange', 'berry', 'lemonade']
apples = np.where(df_fruits.str.contains("apple", case=False, na=False), df_fruits, np.nan)
bananas = np.where(df_fruits.str.contains("banana", case=False, na=False), df_fruits, np.nan)
etc...

Edit: I got the dataframe from a csv-file, so the original data format is in rows of string: "apple pie, banana bread, orange juice, nan, nan" etc.

How do you get the input dataframe in the first place? Are you reading it in from a file? It would probably be easier to construct your expected dataframe directly rather than dissect the input dataframe and reconstruct it — Mortz
– Mortz, Commented Aug 5, 2022 at 11:20
@Morzt I get the input dataframe from a csv file, so originally the rows are in string format: "apple pie, babana bread, orange juice, nan, nan" etc. — heurn
– heurn, Commented Aug 5, 2022 at 11:34

Umar.H · Accepted Answer · 2022-08-05 11:40:26Z

2

we can do some re-shaping using .unstack and .str.extractall

pat = '|'.join(col_names)

s = df.stack()

s1 = s.to_frame('vals').join(
      s.str.extractall(f'({pat})').groupby(level=[0,1]).agg(list))


out = s1.explode(0).set_index(0,append=True).reset_index(1,drop=True).unstack(-1)

print(out)

            vals
0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN

# if you want to drop the level on the multi index.
out.columns = out.columns.droplevel(None)

0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN

answered Aug 5, 2022 at 11:40

Umar.H

23.1k7 gold badges50 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Mohamed Thasin ah Over a year ago

Good attempt. orange lemonade present in both orange and lemonade :) +1

Umar.H Over a year ago

@MohamedThasinah I don't make the rules, just following OP logic, his output didn't align with the request

Mohamed Thasin ah Over a year ago

Yeah I can understand that. Just for fun

heurn Over a year ago

Thank you for the quick answer! I have tried to use your method and it works really well for my purpose, but I am stuck with "ValueError: Index contains duplicate entries, cannot reshape" with .unstack.

heurn Over a year ago

Managed to fix the bug by renaming the indexes and then grouping them before unstacking. Now it works, thanks! out = s1.explode(0).set_index(0,append=True).reset_index(1,drop=True) out.index.names = ['fruit_list', 'fruit_group'] out = out.groupby(['fruit_list', 'fruit_group']).vals.first().unstack(-1)

Mouad Slimane · Accepted Answer · 2022-08-05 11:52:26Z

0

Try this:

list_values=[item for value in df_fruits.values for item in value]
list_series=[] 
for col in col_names:
   locals()[col+"series"]=pd.Series(map(lambda x:x*(col in str(x)),list_values)
   list_series.append(eval(col+"series"))

the first row is the get all your dataframe colums values into a list next we create a pandas series for every fruit type and append it into a list after we create a new data frame

new_df=pd.concat(list_series,axis=1)

answered Aug 5, 2022 at 11:52

Mouad Slimane

1,0635 silver badges18 bronze badges

1 Comment

heurn Over a year ago

Thanks @to_data! This works otherwise except I lose the original row ids (e.g. apple pie on row 1, banana bread on row 2, etc).

Collectives™ on Stack Overflow

Compare each string element in a dataframe to a list and assign it to a column, python pandas

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related