2

I would like to check if items in a List are in a Column from my DF.

The basics where straightforward:

fruit = ['apple','banana']    # This items should be in the column 
fruit = ', '.join(fruit)      # Think this is the point where it goes wrong... 

fruit_resulst = df['all_fruit'].str.contains(fruit) # Check if column contains fruit 
df_new = df[fruit_resulst]   # Filter so that we only keep the TRUEs 

This works, but not completely. It only works in this specific order, but I would like to have it working in all orders (e.g., if a column row contains ALL items from the list, then I would like to keep them. Else, remove.

df['all_fruit']

Apple, Banana             #Return! Because it contains apple and banana
Banana                    # Do not return 
Banana, Apple             #Return! Because it contains apple and banana    
Apple                     # Do not return
Apple, Banana, Peer       #Return! Because it contains apple and banana

Thanks a lot in advance!

0

3 Answers 3

2

Convert values to lowercase, then split to lists and test issubset by convert fruit to set:

df1 = df[df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset)]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Your solution with list of boolean masks passed to np.logical_and.reduce:

df1 = df[np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])]
print (df1)
             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer
Sign up to request clarification or add additional context in comments.

5 Comments

This is perfect. Quick one (sorry to bother, hope you can help) - would it also be possible to add a column to the original DF, which shows FALSE, except when the match is True? Then it shows True?
@Roverflow - For first solution df['test'] = ~df.all_fruit.str.lower().str.split(', ').map(set(fruit).issubset), for second df['test'] = ~np.logical_and.reduce([df.all_fruit.str.contains(f, case=False) for f in fruit])
@Roverflow - So it means False, True, False, true, False ?
thanks a lot! The other way around... True, False, True, False, True :-)
@Roverflow - Then remove ~ for invert mask
1
df = pd.DataFrame({'all_fruit': [
    'Apple, Banana',
    'Banana',
    'Banana, Apple',
    'Apple',
    'Apple, Banana, Peer',
]})
fruit = ['apple','banana']
have_fruits = [df.all_fruit.str.contains(f, case=False) for f in fruit]
indexes = True
for f in have_fruits:
    indexes = indexes * f
df[indexes]

Comments

1

Try this code:

x = df['all_fruit'].str.split(',', expand=True)
print(df[x.replace('Apple', '').ne(x).any(1) & x.replace(' Banana', '').ne(x).any(1)])

Output:

             all_fruit
0        Apple, Banana
2        Banana, Apple
4  Apple, Banana, Peer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.