How to extract strings from a list in a column in a python pandas dataframe?

Question

Let's say I have a list

lst = ["fi", "ap", "ko", "co", "ex"]

and we have this series

       Explanation 

a      "fi doesn't work correctly" 
b      "apples are cool" 
c      "this works but translation is ko"

and I'm looking to get something like this:

        Explanation                         Explanation Extracted

a      "fi doesn't work correctly"          "fi"
b      "apples are cool"                    "N/A"
c      "this works but translation is ko"   "ko"

What would be the return for "fi ex"?

ramzeek
– ramzeek

2022-04-02 21:32:58 +00:00
Commented Apr 2, 2022 at 21:32 — ramzeek
– ramzeek, Commented Apr 2, 2022 at 21:32

Timus · Accepted Answer · 2022-04-04 08:24:30Z

1

With a dataframe like

df = pd.DataFrame(
    {"Explanation": ["fi doesn't co work correctly",
                     "apples are cool",
                     "this works but translation is ko"]},
    index=["a", "b", "c"]
)

you can use .str.extract() to do

lst = ["fi", "ap", "ko", "co", "ex"]

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)

to get

                        Explanation Explanation Extracted
a      fi doesn't co work correctly                    fi
b                   apples are cool                   NaN
c  this works but translation is ko                    ko

The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.

If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
    df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)

Alternative without regex:

df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
    matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)

If you only want to match at the beginning or end of the sentences, then replace the first part with:

df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
    (splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...

edited Apr 4, 2022 at 8:24

answered Apr 2, 2022 at 21:07

Timus

11.4k5 gold badges20 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jon Clements Over a year ago

Always best to re.escape here just in case. And... sort by descending length of the search strings so the most complete match comes first in case of overlaps.

Timus Over a year ago

@JonClements Thanks! The items in the list didn't look like they need escaping, but you are right. Regarding the sorting: I don't think it matters here, since the parts are embedded in string-beginning/end and whitespace (I've run some tests and they seem to confirm my reasoning)?

Nipuna Upeksha · Accepted Answer · 2022-04-02 21:06:44Z

0

I think this solves your problem.

import pandas as pd

lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])

extracted =[] 
for index, row in df.iterrows():
    tempList =[] 
    rowSplit = row['Explanation'].split(" ")
    for val in rowSplit:
        if val in lst:
            tempList.append(val)
    if len(tempList)>0:
        extracted.append(','.join(tempList))
    else:
        extracted.append('N/A')

df['Explanation Extracted'] = extracted

answered Apr 2, 2022 at 21:06

Nipuna Upeksha

4283 silver badges16 bronze badges

Comments

Uchiha012 · Accepted Answer · 2022-04-02 21:12:25Z

0

apply function of Pandas might be helpful

def extract_explanation(dataframe):
    custom_substring = ["fi", "ap", "ko", "co", "ex"]
    substrings = dataframe['explanation'].split(" ")
    explanation = "N/A"
    for string in substrings:
        if string in custom_substring:
            explanation = string
    return explanation

df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)

The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.

answered Apr 2, 2022 at 21:12

Uchiha012

8516 silver badges10 bronze badges

Comments

Gonçalo Peres · Accepted Answer · 2022-10-07 09:55:44Z

Option 1

Assuming that one wants to extract the exact string in the list lst one can start by creating a regex

regex = f'\\b({"|".join(lst)})\\b'

where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.

And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE

import re

df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   NaN
2   3  this works but translation is ko                    ko

Option 2

One can also use pandas.Series.apply with a custom lambda function.

df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   N/A
2   3  this works but translation is ko                    ko

Notes:

.lower() is to make it case insensitive.
.split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.

Collectives™ on Stack Overflow

How to extract strings from a list in a column in a python pandas dataframe?

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related