0

Consider the following pandas dataframe: enter image description here

this is an example of ingredients_text :

farine de blé 34% (france), pépites de chocolat 20g (ue) (sucre, pâte de cacao, beurre de cacao, émulsifiant lécithines (tournesol), arôme) (cacao : 44% minimum), matière grasse végétale (palme), sucre, 8,5% chocolat(sucre, pâte de cacao, cacao et cacao maigre en poudre) (cacao: 38% minimum), 5,5% éclats de noix de pécan (non ue), poudres à lever : diphosphates carbonates de sodium, blancs d’œufs, fibres d'acacia, lactose et protéines de lait, sel. dont lait.

oignon 18g oil hell: kartoffelstirke, milchzucker, maltodextrin, reismehl. 100g produkt enthalten: 1559KJ ,energie 369 kcal lt;0.5g lt;0.1g 909 fett davon gesättigte fettsāuren kohlenhydrate davon ,zucker 26g

I separated the ingredients of each line into words with the folowing code :

for i in df['ingredients_text'][:].index:
        words = df["ingredients_text"][i].split(',')
        df["ingredients_text"][i]=words

Any idea of how to extract the ingredients with % and g from the text in onether column called 'ingredient' ? For instance, the desired output should be:

['farine de blé 34%', 'pépites de chocolat 20g','cacao : 44%' ,'8,5% chocolat' ,'cacao: 38%', '5,5% éclats de noix de pécan']
['oignon 18g oil hell', '100g produkt enthalten', 'lt;0.5g', 'lt;0.1g' , '26g zucker']
2
  • What if g is present in the ingredient but it doesn't really represent grams, like in 'ginger'? Commented Apr 16, 2021 at 14:36
  • g need to be with a digit number like 10g Commented Apr 18, 2021 at 22:00

1 Answer 1

1
df = pd.DataFrame({'ingredient_text': ['a%bgC, abc, a%, cg', 'xyx']})

      ingredient_text
0  a%bgC, abc, a%, cg
1                 xyx

Split the ingredients into a list

df['ingredient_text'] = df['ingredient_text'].str.split(',')
           ingredient_text
0  [a%bgC,  abc,  a%,  cg]
1                    [xyx]

Search for your strings in the list

df['ingredient'] = df['ingredient_text'].apply(lambda x: [s for s in x if ('%' in s) or ('g' in s)])

           ingredient_text         ingredient
0  [a%bgC,  abc,  a%,  cg]  [a%bgC,  a%,  cg]
1                    [xyx]                 []
Sign up to request clarification or add additional context in comments.

2 Comments

thank you for your help it works for % but it doesn't work for the ingredients with g (gram) . Can you help me please ? ==> df['ingredient'] = df['ingredients_text'].apply(lambda x: [s for s in x if ('\d+\s*g' in s) or ('%' in s)])
You can import re and use [s for s in x if ('%' in s) or (len(re.findall('\d+g', s))>0)]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.