1

How to rearrange my dataframe according to column names while searching for specific strings in cells?

My dataframe:

0 1 2 3 4
apple pie banana bread orange juice nan nan
apple cookies orange lemonade nan nan nan
banana muffin orange ice berry candy nan nan
berry juice nan nan nan nan

I want to arrange the rows according to a list of column names, which look for specific strings of text.

apple banana orange berry lemon
apple pie banana bread orange juice nan nan
apple cookies nan orange lemonade nan nan
nan banana muffin orange ice berry candy nan
nan nan nan berry juice nan

I have tried to create a column/list for each fruit, searching for the right string and adding the cell if it matches, however I do not know how to iterate through the dataframe and assign values. I just get a column of Nan's.

col_names = ['apple', 'banana', 'orange', 'berry', 'lemonade']
apples = np.where(df_fruits.str.contains("apple", case=False, na=False), df_fruits, np.nan)
bananas = np.where(df_fruits.str.contains("banana", case=False, na=False), df_fruits, np.nan)
etc...

Edit: I got the dataframe from a csv-file, so the original data format is in rows of string: "apple pie, banana bread, orange juice, nan, nan" etc.

2
  • How do you get the input dataframe in the first place? Are you reading it in from a file? It would probably be easier to construct your expected dataframe directly rather than dissect the input dataframe and reconstruct it Commented Aug 5, 2022 at 11:20
  • 1
    @Morzt I get the input dataframe from a csv file, so originally the rows are in string format: "apple pie, babana bread, orange juice, nan, nan" etc. Commented Aug 5, 2022 at 11:34

2 Answers 2

2

we can do some re-shaping using .unstack and .str.extractall

pat = '|'.join(col_names)

s = df.stack()

s1 = s.to_frame('vals').join(
      s.str.extractall(f'({pat})').groupby(level=[0,1]).agg(list))


out = s1.explode(0).set_index(0,append=True).reset_index(1,drop=True).unstack(-1)

print(out)

            vals
0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN

# if you want to drop the level on the multi index.
out.columns = out.columns.droplevel(None)

0          apple         banana        berry         lemonade           orange
0      apple pie   banana bread          NaN              NaN     orange juice
1  apple cookies            NaN          NaN  orange lemonade  orange lemonade
2            NaN  banana muffin  berry candy              NaN       orange ice
3            NaN            NaN  berry juice              NaN              NaN
Sign up to request clarification or add additional context in comments.

5 Comments

Good attempt. orange lemonade present in both orange and lemonade :) +1
@MohamedThasinah I don't make the rules, just following OP logic, his output didn't align with the request
Yeah I can understand that. Just for fun
Thank you for the quick answer! I have tried to use your method and it works really well for my purpose, but I am stuck with "ValueError: Index contains duplicate entries, cannot reshape" with .unstack.
Managed to fix the bug by renaming the indexes and then grouping them before unstacking. Now it works, thanks! out = s1.explode(0).set_index(0,append=True).reset_index(1,drop=True) out.index.names = ['fruit_list', 'fruit_group'] out = out.groupby(['fruit_list', 'fruit_group']).vals.first().unstack(-1)
0

Try this:

list_values=[item for value in df_fruits.values for item in value]
list_series=[] 
for col in col_names:
   locals()[col+"series"]=pd.Series(map(lambda x:x*(col in str(x)),list_values)
   list_series.append(eval(col+"series"))

the first row is the get all your dataframe colums values into a list next we create a pandas series for every fruit type and append it into a list after we create a new data frame

new_df=pd.concat(list_series,axis=1)

1 Comment

Thanks @to_data! This works otherwise except I lose the original row ids (e.g. apple pie on row 1, banana bread on row 2, etc).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.