3

I have a list of names and a dataframe with a column of free form text. I am trying to scan through the column of text and if it contains a string from the list then append the string as an additional column on the data frame.

I have only found ways to make it appear as a binary or True/False in the additional column.

  sys_list = ['AAAA', 'BBBB', 'AD-12', 'B31-A']
  data = {'text': ['need help with AAAA system requesting help', 'AD-12 crashed, need 
  support', 'fuel system down', '/BBBB needs refresh']}

  df = pd.DataFrame(data)

with the end result being

                text                                System
0   need help with AAAA system requesting help      AAAA
1   AD-12 crashed, need support                     AD-12
2   fuel system down                                  0
3   /BBBB needs refresh                             BBBB

I have tried

# which gives True or False values 

 pattern = '|'.join(sys_list)
 df['System'] = df['text'].str.contains(pattern)
 
 # which gives 0 or 1 
 df['System'] = [int(any(w in sys_list for w in x.split())) for x in df['text']]

2 Answers 2

1
import pandas as pd
sys_list = ['AAAA', 'BBBB', 'AD-12', 'B31-A']
data = {'text': ['need help with AAAA system requesting help', 'AD-12 crashed, need support', 'fuel system down', '/BBBB needs refresh']}

df = pd.DataFrame(data)
def f(s):
    for symbol in sys_list:
        if symbol in s:
            return symbol
    return 0
df['System'] = df.text.apply(f)
print(df)

prints

index text System
0 need help with AAAA system requesting help AAAA
1 AD-12 crashed, need support AD-12
2 fuel system down 0
3 /BBBB needs refresh BBBB

Remark: this only uses the first symbol in sys_list that occurs in a string, i.e. assumes that the symbol occurrences are mutually exclusive.

Sign up to request clarification or add additional context in comments.

1 Comment

is there a way to separate the words in the text and looks for an exact match? some of the text uses a word that contains the system value but is not an exact match (ie 'the system is DAMAged' when there is a system named DAMA) it looks for any exact matches even within a word
0

Slightly modifying your second example using :=:

df["System"] = [
    word
    if any((word := ww) in w for w in x.split() for ww in sys_list)
    else "N/A"
    for x in df["text"]
]


print(df)

Prints:

                                         text System
0  need help with AAAA system requesting help   AAAA
1                 AD-12 crashed, need support  AD-12
2                            fuel system down    N/A
3                         /BBBB needs refresh   BBBB

1 Comment

is there a way to separate the words in the text and looks for an exact match? some of the text uses a word that contains the system value but is not an exact match (ie 'THE SYSTEM IS DAMAGED' when there is a system named DAMA) it looks for any exact matches even within a word

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.