how to extract specific content from dataframe based on condition python

Question

Consider the following pandas dataframe:

this is an example of ingredients_text :

farine de blé 34% (france), pépites de chocolat 20g (ue) (sucre, pâte de cacao, beurre de cacao, émulsifiant lécithines (tournesol), arôme) (cacao : 44% minimum), matière grasse végétale (palme), sucre, 8,5% chocolat(sucre, pâte de cacao, cacao et cacao maigre en poudre) (cacao: 38% minimum), 5,5% éclats de noix de pécan (non ue), poudres à lever : diphosphates carbonates de sodium, blancs d’œufs, fibres d'acacia, lactose et protéines de lait, sel. dont lait.

oignon 18g oil hell: kartoffelstirke, milchzucker, maltodextrin, reismehl. 100g produkt enthalten: 1559KJ ,energie 369 kcal lt;0.5g lt;0.1g 909 fett davon gesättigte fettsāuren kohlenhydrate davon ,zucker 26g

I separated the ingredients of each line into words with the folowing code :

for i in df['ingredients_text'][:].index:
        words = df["ingredients_text"][i].split(',')
        df["ingredients_text"][i]=words

Any idea of how to extract the ingredients with % and g from the text in onether column called 'ingredient' ? For instance, the desired output should be:

['farine de blé 34%', 'pépites de chocolat 20g','cacao : 44%' ,'8,5% chocolat' ,'cacao: 38%', '5,5% éclats de noix de pécan']
['oignon 18g oil hell', '100g produkt enthalten', 'lt;0.5g', 'lt;0.1g' , '26g zucker']

What if g is present in the ingredient but it doesn't really represent grams, like in 'ginger'? — Rajesh
– Rajesh, Commented Apr 16, 2021 at 14:36

Rajesh · Accepted Answer · 2021-04-16 14:52:25Z

1

df = pd.DataFrame({'ingredient_text': ['a%bgC, abc, a%, cg', 'xyx']})

      ingredient_text
0  a%bgC, abc, a%, cg
1                 xyx

Split the ingredients into a list

df['ingredient_text'] = df['ingredient_text'].str.split(',')
           ingredient_text
0  [a%bgC,  abc,  a%,  cg]
1                    [xyx]

Search for your strings in the list

df['ingredient'] = df['ingredient_text'].apply(lambda x: [s for s in x if ('%' in s) or ('g' in s)])

           ingredient_text         ingredient
0  [a%bgC,  abc,  a%,  cg]  [a%bgC,  a%,  cg]
1                    [xyx]                 []

answered Apr 16, 2021 at 14:52

Rajesh

7865 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Yoss Over a year ago

thank you for your help it works for % but it doesn't work for the ingredients with g (gram) . Can you help me please ? ==> df['ingredient'] = df['ingredients_text'].apply(lambda x: [s for s in x if ('\d+\s*g' in s) or ('%' in s)])

Rajesh Over a year ago

You can import re and use [s for s in x if ('%' in s) or (len(re.findall('\d+g', s))>0)]

Collectives™ on Stack Overflow

how to extract specific content from dataframe based on condition python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related