0

Let's say I have a list

lst = ["fi", "ap", "ko", "co", "ex"]

and we have this series

       Explanation 

a      "fi doesn't work correctly" 
b      "apples are cool" 
c      "this works but translation is ko" 

and I'm looking to get something like this:

        Explanation                         Explanation Extracted

a      "fi doesn't work correctly"          "fi"
b      "apples are cool"                    "N/A"
c      "this works but translation is ko"   "ko"
1
  • What would be the return for "fi ex"? Commented Apr 2, 2022 at 21:32

4 Answers 4

1

With a dataframe like

df = pd.DataFrame(
    {"Explanation": ["fi doesn't co work correctly",
                     "apples are cool",
                     "this works but translation is ko"]},
    index=["a", "b", "c"]
)

you can use .str.extract() to do

lst = ["fi", "ap", "ko", "co", "ex"]

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False)

to get

                        Explanation Explanation Extracted
a      fi doesn't co work correctly                    fi
b                   apples are cool                   NaN
c  this works but translation is ko                    ko

The regex pattern r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" looks for an occurrence of one of the lst items either at the beginning with withespace afterwards, in the middle with whitespace before and after, or at the end with withespace before. str.extract() extracts the capture group (the part in the middle in ()). Without a match the return is NaN.

If you want to extract multiple matches, you could use .str.findall() and then ", ".join the results:

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)"
df["Explanation Extracted"] = (
    df.Explanation.str.findall(pattern).str.join(", ").replace({"": None})
)

Alternative without regex:

df.index = df.index.astype("category")
matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)]
df["Explanation Extracted"] = (
    matches.groupby(level=0).agg(set).str.join(", ").replace({"": None})
)

If you only want to match at the beginning or end of the sentences, then replace the first part with:

df.index = df.index.astype("category")
splitted = df.Explanation.str.split()
matches = (
    (splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)]
)
...
Sign up to request clarification or add additional context in comments.

2 Comments

Always best to re.escape here just in case. And... sort by descending length of the search strings so the most complete match comes first in case of overlaps.
@JonClements Thanks! The items in the list didn't look like they need escaping, but you are right. Regarding the sorting: I don't think it matters here, since the parts are embedded in string-beginning/end and whitespace (I've run some tests and they seem to confirm my reasoning)?
0

I think this solves your problem.

import pandas as pd

lst = ["fi", "ap", "ko", "co", "ex"]
df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"])

extracted =[] 
for index, row in df.iterrows():
    tempList =[] 
    rowSplit = row['Explanation'].split(" ")
    for val in rowSplit:
        if val in lst:
            tempList.append(val)
    if len(tempList)>0:
        extracted.append(','.join(tempList))
    else:
        extracted.append('N/A')

df['Explanation Extracted'] = extracted

Comments

0

apply function of Pandas might be helpful

def extract_explanation(dataframe):
    custom_substring = ["fi", "ap", "ko", "co", "ex"]
    substrings = dataframe['explanation'].split(" ")
    explanation = "N/A"
    for string in substrings:
        if string in custom_substring:
            explanation = string
    return explanation

df['Explanation Extracted'] = df.apply(extract_explanation, axis=1)

The catch here is assumption of only one explanation, but it can be converted into a list, if multiple explanations are expected.

Comments

0

Option 1

Assuming that one wants to extract the exact string in the list lst one can start by creating a regex

regex = f'\\b({"|".join(lst)})\\b'

where \b is the word boundary (beginning or end of a word) that indicates the word is not followed by additional characters, or with characters before. So, considering that one has the string ap in the list lst, if one has the word apple in the dataframe, that won't be considered.

And then, using pandas.Series.str.extract, and, to make it case insensitive, use re.IGNORECASE

import re

df['Explanation Extracted'] = df['Explanation'].str.extract(regex, flags=re.IGNORECASE, expand=False)

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   NaN
2   3  this works but translation is ko                    ko

Option 2

One can also use pandas.Series.apply with a custom lambda function.

df['Explanation Extracted'] = df['Explanation'].apply(lambda x: next((i for i in lst if i.lower() in x.lower().split()), 'N/A'))

[Out]:
   ID                       Explanation Explanation Extracted
0   1         fi doesn't work correctly                    fi
1   2                 cap ples are cool                   N/A
2   3  this works but translation is ko                    ko

Notes:

  • .lower() is to make it case insensitive.

  • .split() is one way to prevent that even though ap is in the list, the string apple doesn't appear in the Explanation Extracted column.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.