Creating a new data frame from existing data frame based on multiple partial strings

Question

How can I create a new pandas data frame from an existing data frame based on multiple partial string matches of values in one column?

For example if I had a data frame with one column that contains the partial strings of "Commercial", "Corporate", Private", I would like to create a new data frame with only rows that contain the partial strings of "Commercial" and "Corporate" while ignoring the rows that have the partial string of private.

Currently I am trying the code

 df = df[(df['text'].str.contains("Commercial") or 
    df['text'].str.contains("Corporate") or 
    df['text'].str.contains("SME"))]

but it gives me an error of

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

explain your question little bit more using some code, it will help us to understand much better. — Mehmaam
– Mehmaam, Commented Oct 12, 2022 at 12:33
Please provide enough code so others can better understand or reproduce the problem. — Community
– Community Bot, Commented Oct 12, 2022 at 13:50

misterhuge · Accepted Answer · 2022-10-12 14:07:58Z

I interpreted your question as wanting to match the words 'Commercial' AND 'Corporate' AND NOT 'Private'.

data:

import pandas as pd
wantedWords = ['Commercial', 'Corporate']
notWantedWords = ['Private']
df = pd.DataFrame(['Commercial, Corporate, Private',
                   'Commercial, Corporate', 
                   'Commercial', 
                   'Corporate', 
                   'none of the words'], columns=['text'])

using regex:

reg = r'^{}'
ex = '(?=.*{})'
wantedWordMatch = reg.format(''.join(ex.format(w) for w in wantedWords))
notWantedWordMatch = reg.format(''.join(ex.format(w) for w in notWantedWords))

df['text'].str.contains(wantedWordMatch, regex=True)

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

~df['text'].str.contains(notWantedWordMatch, regex=True)

0    False
1     True
2     True
3     True
4     True
Name: text, dtype: bool

df[(df['text'].str.contains(wantedWordMatch, regex=True) & (~df['text'].str.contains(notWantedWordMatch, regex=True)))]

    text
1   Commercial, Corporate

using all()/any():

df.text.apply(lambda string: all(word in string for word in wantedWords))

0     True
1     True
2    False
3    False
4    False
Name: text, dtype: bool

df.text.apply(lambda string: any(word not in string for word in notWantedWords))

0    False
1     True
2     True
3     True
4     True
Name: text, dtype: bool

df[df['text'].apply(lambda string: (all(word in string for word in wantedWords) & any(word not in string for word in notWantedWords)))]

    text
1   Commercial, Corporate

Collectives™ on Stack Overflow

Creating a new data frame from existing data frame based on multiple partial strings

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related