1

I have the following DataFrame:

test = {'title': ['Undeclared milk in Burnbrae', 'Undeclared milk in certain Bumble', 'Certain cheese products may contain listeria', 'Ocean brand recalled due to Salmonella', 'IQF Raspberries due to Listeria']}
example = pd.DataFrame(test)
example
    title
0   Undeclared milk in Burnbrae
1   Undeclared milk in certain Bumble
2   Certain cheese products may contain listeria
3   Ocean brand recalled due to Salmonella
4   IQF Raspberries due to Listeria

And I want to extract the following strings in the same column. I want my result to look like this:

test = {'hazard': ['Undeclared milk', 'Undeclared milk', 'listeria', 'Salmonella', 'Listeria'], 'title': ['Undeclared milk in Burnbrae', 'Undeclared milk in certain Bumble', 'Certain cheese products may contain listeria', 'Ocean brand recalled due to Salmonella', 'IQF Raspberries due to Listeria']}
example2 = pd.DataFrame(test)
example2
     hazard          title
0   Undeclared milk Undeclared milk in Burnbrae
1   Undeclared milk Undeclared milk in certain Bumble
2   listeria        Certain cheese products may contain listeria
3   Salmonella      Ocean brand recalled due to Salmonella
4   Listeria        IQF Raspberries due to Listeria

Essentially my separators are in, may contain and due to


example['hazard'] = example['title'].str.extract(r'^(.*?) in\b')
example['hazard'] = example['title'].str.extract(r'\b may contain (.*)$')
example['hazard'] = example['title'].str.extract(r'\b due to (.*)$')

I wrote the code above to test each separator but would like to extract all in the same column.

How can I do this?

I appreciate all the help

2 Answers 2

3

You can join your seperators into list, and join them via "|".join to transform this into a larger pattern. From there, Series.str.extract can get all of the matches, and we reshape to match the original size.

seperators = [r"^(.*?) in\b", r"\b may contain (.*)$", r"\b due to (.*)$"]
sep_pattern = r"|".join(seperators)

example["hazard"] = (example["title"].str.extract(sep_pattern)
                       .stack()
                       .droplevel(1))

print(example)
                                          title           hazard
0                   Undeclared milk in Burnbrae  Undeclared milk
1             Undeclared milk in certain Bumble  Undeclared milk
2  Certain cheese products may contain listeria         listeria
3        Ocean brand recalled due to Salmonella       Salmonella
4               IQF Raspberries due to Listeria         Listeria
Sign up to request clarification or add additional context in comments.

3 Comments

It's returning a null value in all rows because it is not joining the "| " to the list of strings. So it;s not recognizing the "or" condition.
Just re-tested this snippet and it still works for me pandas 1.1.5 Not sure why it's giving you nulls.
I updated my pandas version and it works now, weird. Do you know why instead of writing example['title'].str.extract(r'^(.*?) in\\b|\\b may contain (.*)$|\\b due to (.*)$') it doesn't work? Thank you for the help!
2

A more first principles approach that gets the same outcome:

def func(s: str):
    check1 = re.search(r'^(.*?) in\b',s)
    check2 = re.search(r'\b may contain (.*)$',s)
    check3 = re.search(r'\b due to (.*)$',s)
    if check1:
        return check1.group(1)
    elif check2:
        return check2.group(1)
    elif check3:
        return check3.group(1)
    else:
        return np.nan

example["hazard"] = example["title"].apply(func)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.