0

I have a question regarding matching strings in a list to a column in a df.

I read this question Check if String in List of Strings is in Pandas DataFrame Column and understand, but my need is little different.

Code :

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4', np.nan],
    'Price': [22000,25000,27000,35000, 29000],
    'Liscence Plate': ['ABC 123', 'XYZ 789', 'CBA 321', 'ZYX 987', 'DEF 456']}

df = pd.DataFrame(Cars,columns= ['Brand', 'Price', 'Liscence Plate'])

search_for_these_values = ['Honda', 'Toy', 'Ford Focus', 'Audi A4 2019']
pattern = '|'.join(search_for_these_values)


df['Match'] = df["Brand"].str.contains(pattern, na=False)
print (df)

Output I get :

            Brand  Price Liscence Plate  Match
0  Honda Civic     22000  ABC 123        True 
1  Toyota Corolla  25000  XYZ 789        True 
2  Ford Focus      27000  CBA 321        True 
3  Audi A4         35000  ZYX 987        False
4  NaN             29000  DEF 456        False

Output I want:

            Brand  Price Liscence Plate  Match
0  Honda Civic     22000  ABC 123        True 
1  Toyota Corolla  25000  XYZ 789        False
2  Ford Focus      27000  CBA 321        True 
3  Audi A4         35000  ZYX 987        True
4  NaN             29000  DEF 456        False
2
  • 1
    I guess your logic is vice versa (i.e. target data must be in any element of the list). But then how did Honda Civic got True? Commented Dec 6, 2021 at 8:14
  • If one word is true, then true. Like Honda is in Honda civic, and Audi is in Audi A4. But Toy is not in Toyota because not whole word Commented Dec 6, 2021 at 8:23

2 Answers 2

1

One way using word match:

pat = "|".join(search_for_these_values).replace(" ", "|")
match = df["Brand"].str.findall(r"\b(%s)\b" % pat)

Output:

0          [Honda]
1               []
2    [Ford, Focus]
3       [Audi, A4]
4              NaN
Name: Brand, dtype: object

You can then assign it back

df["match"] = match.str.len().ge(1)

Final output:

            Brand  Price Liscence Plate  match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False
Sign up to request clarification or add additional context in comments.

Comments

0

If we use the rule you outlined 'If one word is true, then true', then this means that if a row in Brand column has '2019', then True will be returned which I believe we don't want that. So

Having said that you can create a new list, which is the previous split() version of your search_for_these_values excluding years, using a list comprehension, and use isin with any:

# list comprehension
import re
s = [word for cars in search_for_these_values for word in cars.split() if not re.search(r'\d{4}',word)]

# Assign True / False
df['Match'] = df['Brand'].str.split(expand = True).isin(s).any(1)

Prints back:

            Brand  Price Liscence Plate  Match
0     Honda Civic  22000        ABC 123   True
1  Toyota Corolla  25000        XYZ 789  False
2      Ford Focus  27000        CBA 321   True
3         Audi A4  35000        ZYX 987   True
4             NaN  29000        DEF 456  False

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.