4

I have the following list:

search_list = ['STEEL','IRON','GOLD','SILVER']

which I need to search within a dataframe (df):

      a    b             
0    123   'Blah Blah Steel'
1    456   'Blah Blah Blah'
2    789   'Blah Blah Gold'

and insert the matching rows into a new dataframe (newdf), adding a new column with the matching word from the list:

      a    b                   c
0    123   'Blah Blah Steel'   'STEEL'
1    789   'Blah Blah Gold'    'GOLD'

I can use the following code to extract the matching row:

newdf=df[df['b'].str.upper().str.contains('|'.join(search_list),na=False)]

but I can't figure out how to add the matching word from the list into column c.

I'm thinking that the match somehow needs to capture the index of the matching word in the list and then pull the value using the index number but I can't figure out how to do this.

Any help or pointers would be greatly appreciated

Thanks

7 Answers 7

6

You could use extract and filter out those that are nan (i.e. no match):

search_list = ['STEEL','IRON','GOLD','SILVER']

df['c'] = df.b.str.extract('({0})'.format('|'.join(search_list)), flags=re.IGNORECASE)
result = df[~pd.isna(df.c)]

print(result)

Output

              a       b      c
123 'Blah  Blah  Steel'  Steel
789 'Blah  Blah   Gold'   Gold

Note that you have to import the re module in order to use the re.IGNORECASE flag. As an alternative you could use 2 directly that is the value of the re.IGNORECASE flag.

UPDATE

As mentioned by @user3483203 you can save the import by using:

df['c'] = df.b.str.extract('(?i)({0})'.format('|'.join(search_list)))
Sign up to request clarification or add additional context in comments.

3 Comments

You don't need the extra import, df.b.str.extract('(?i)({0})'.format('|'.join(search_list))) works just fine
Brilliant. Thanks
If the first record said, say, 'Blah Blah Steel Gold', how might you report both of those words in column c? I am currently having this issue where I am hitting multiple words in my search list and would like to record all of them, preferably separated by a comma.
5

You can use set.intersection to find which words appear in column b:

search_list = set(['STEEL','IRON','GOLD','SILVER'])
df['c'] = df['b'].apply(lambda x: set.intersection(set(x.upper().split(' ')), search_list))

Output:

     a                b        c
0  123  Blah Blah Steel  {STEEL}
1  456   Blah Blah Blah       {}
2  789   Blah Blah Gold   {GOLD}

Use df[df['c'].astype(bool)] if you want to get rid of rows without a match

     a                b        c
0  123  Blah Blah Steel  {STEEL}
2  789   Blah Blah Gold   {GOLD}

Comments

2

One way to do it is

def get_word(my_string):
    for word in search_list:
         if word.lower() in my_string.lower():
               return word
    return None

new_df["c"]= new_df["b"].apply(get_word)

You can also do something along the lines of

new_df["c"]= new_df["b"].apply(lambda my_string: [word for word in search_list if word.lower() in my_string.lower()][0])

With the first one, you have the option of adding column c to df first, and then filtering out the Nones, while the second one will throw an error if b doesn't contain any of the words.

You can also see this question: Get the first item from an iterable that matches a condition

Applying the method from the top rated answer would give

new_df["c"]= new_df["b"].apply(lambda my_string: next(word for word in search_list if word.lower() in my_string.lower())

Comments

2

You can also do:

import pandas as pd

search_list = ('STEEL','IRON','GOLD','SILVER')

df = pd.DataFrame({'a':[123,456,789],'b':['blah blah Steel','blah blah blah','blah blah Gold']})

df.assign(c = df['b'].apply(lambda x: [j for j in x.split() if j.upper() in search_list]))

Updated for speed


import pandas as pd

search_list = set(['STEEL','IRON','GOLD','SILVER'])

df = pd.DataFrame({'a':[123,456,789],'b':['blah blah Steel','blah blah blah','blah blah Gold']})

df.assign(c = lambda d: d['b'].str.upper().str.split().map(lambda x: set(x).intersection(search_list)))

Results: enter image description here

7 Comments

Using this answer, I am getting an error "'float' object has no attribute 'split'"-Any idea why this might be?
Substitute x.split with str(x).split
When I make that substitution, I now get the error TypeError: 'builtin_function_or_method' object is not iterable.
What data type is your column data and is there missing values?
There is one blank value which gets automatically populated with nan upon import. Should I populate it with something else? When I do df.dtypes, column 'b' says 'object'. The only manipulation done to that column is str.upper() upon import.
|
1

Using

s=pd.DataFrame(df.b.str.upper().str.strip("'").str.split(' ').tolist())
s.where(s.isin(search_list),'').sum(1)
Out[492]: 
0    STEEL
1         
2     GOLD
dtype: object
df['New']=s.where(s.isin(search_list),'').sum(1)
df
Out[494]: 
     a                  b    New
0  123  'Blah Blah Steel'  STEEL
1  456   'Blah Blah Blah'       
2  789   'Blah Blah Gold'   GOLD

Comments

1

Here, the solution with final result like your display:

search_list = ['STEEL','IRON','GOLD','SILVER']

def process(x):
    for s in search_list:
        if s in x['b'].upper(): print("'"+ s +"'");return "'"+ s +"'"
    return ''

df['c']= df.apply(lambda x: process(x),axis=1)
df = df.drop(df[df['c'] == ''].index).reset_index(drop=True)

print(df)

output:

     a                 b        c
0  123  'Blah Blah Steel  'STEEL'
1  789  'Blah Blah Gold'   'GOLD'

1 Comment

Would there be a way to modify this function so that is there are multiple words from search_list in column b, all of them would be returned in column c, say comma separated?
1

You can use:

search_list = ['STEEL','IRON','GOLD','SILVER']
pat = r'\b|\b'.join(search_list)
pat2 = r'({})'.format('|'.join(search_list))

df_new= df.loc[df.b.str.contains(pat,case=False,na=False)].reset_index(drop=True)
df_new['new_col']=df_new.b.str.upper().str.extract(pat2)
print(df_new)

     a                  b new_col
0  123  'Blah Blah Steel'   STEEL
1  789   'Blah Blah Gold'    GOLD

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.