Check if Multiple Strings are present in a DataFrame Column

Question

I would like to check if items in a List are in a Column from my DF.

The basics where straightforward:

fruit = ['apple','banana']    # This items should be in the column 
fruit = ', '.join(fruit)      # Think this is the point where it goes wrong... 

fruit_resulst = df['all_fruit'].str.contains(fruit) # Check if column contains fruit 
df_new = df[fruit_resulst]   # Filter so that we only keep the TRUEs

This works, but not completely. It only works in this specific order, but I would like to have it working in all orders (e.g., if a column row contains ALL items from the list, then I would like to keep them. Else, remove.

df['all_fruit']

Apple, Banana             #Return! Because it contains apple and banana
Banana                    # Do not return 
Banana, Apple             #Return! Because it contains apple and banana    
Apple                     # Do not return
Apple, Banana, Peer       #Return! Because it contains apple and banana

Thanks a lot in advance!

jezrael · Accepted Answer · 2021-08-16 10:26:47Z

2

Convert values to lowercase, then split to lists and test issubset by convert fruit to set:

df1 = df[df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset)]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Your solution with list of boolean masks passed to np.logical_and.reduce:

df1 = df[np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

edited Aug 16, 2021 at 10:26

answered Aug 16, 2021 at 10:18

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

R overflow Over a year ago

This is perfect. Quick one (sorry to bother, hope you can help) - would it also be possible to add a column to the original DF, which shows FALSE, except when the match is True? Then it shows True?

jezrael Over a year ago

@Roverflow - For first solution df['test'] = ~df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset), for second df['test'] = ~np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])

jezrael Over a year ago

@Roverflow - So it means False, True, False, true, False ?

R overflow Over a year ago

thanks a lot! The other way around... True, False, True, False, True :-)

jezrael Over a year ago

@Roverflow - Then remove ~ for invert mask

Behzad Shayegh · Accepted Answer · 2021-08-16 10:21:22Z

1

df = pd.DataFrame({'all_fruit': [
    'Apple, Banana',
    'Banana',
    'Banana, Apple',
    'Apple',
    'Apple, Banana, Peer',
]})
fruit = ['apple','banana']
have_fruits = [df.all_fruit.str.contains(f, case=False) for f in fruit]
indexes = True
for f in have_fruits:
    indexes = indexes * f
df[indexes]

answered Aug 16, 2021 at 10:21

Behzad Shayegh

3331 silver badge10 bronze badges

Comments

U13-Forward · Accepted Answer · 2021-08-16 10:26:27Z

1

Try this code:

x = df['all_fruit'].str.split(',', expand=True)
print(df[x.replace('Apple', '').ne(x).any(1) & x.replace(' Banana', '').ne(x).any(1)])

Output:

             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

answered Aug 16, 2021 at 10:26

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

Collectives™ on Stack Overflow

Check if Multiple Strings are present in a DataFrame Column

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related