0

How can I create a new pandas data frame from an existing data frame based on multiple partial string matches of values in one column?

For example if I had a data frame with one column that contains the partial strings of "Commercial", "Corporate", Private", I would like to create a new data frame with only rows that contain the partial strings of "Commercial" and "Corporate" while ignoring the rows that have the partial string of private.

Currently I am trying the code

 df = df[(df['text'].str.contains("Commercial") or 
    df['text'].str.contains("Corporate") or 
    df['text'].str.contains("SME"))]

but it gives me an error of

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

3
  • explain your question little bit more using some code, it will help us to understand much better. Commented Oct 12, 2022 at 12:33
  • Please provide enough code so others can better understand or reproduce the problem. Commented Oct 12, 2022 at 13:50
  • Ive added some of the code I have tried using Commented Oct 13, 2022 at 14:28

1 Answer 1

1

I interpreted your question as wanting to match the words 'Commercial' AND 'Corporate' AND NOT 'Private'.

data:

import pandas as pd
wantedWords = ['Commercial', 'Corporate']
notWantedWords = ['Private']
df = pd.DataFrame(['Commercial, Corporate, Private',
                   'Commercial, Corporate', 
                   'Commercial', 
                   'Corporate', 
                   'none of the words'], columns=['text'])

using regex:

reg = r'^{}'
ex = '(?=.*{})'
wantedWordMatch = reg.format(''.join(ex.format(w) for w in wantedWords))
notWantedWordMatch = reg.format(''.join(ex.format(w) for w in notWantedWords))

df['text'].str.contains(wantedWordMatch, regex=True)

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

~df['text'].str.contains(notWantedWordMatch, regex=True)

0    False
1     True
2     True
3     True
4     True
Name: text, dtype: bool

df[(df['text'].str.contains(wantedWordMatch, regex=True) & (~df['text'].str.contains(notWantedWordMatch, regex=True)))]

    text
1   Commercial, Corporate

using all()/any():

df.text.apply(lambda string: all(word in string for word in wantedWords))

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

df.text.apply(lambda string: any(word not in string for word in notWantedWords))

0    False
1     True
2     True
3     True
4     True
Name: text, dtype: bool

df[df['text'].apply(lambda string: (all(word in string for word in wantedWords) & any(word not in string for word in notWantedWords)))]

    text
1   Commercial, Corporate


Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.