Extract rows based on conditions Pandas Python

Question

I need to extract rows if certain conditions are applied.

the column col1 should contain all the words in the list list_words.
the last word should be Story
the last word in the next row should b ac:

This is my current code:

import pandas as pd

df = pd.DataFrame({'col1': ['Draft SW Quality Assurance Story', 'alex ac', 'anny ac', 'antoine ac','aze epic', 'bella ac', 'Complete SW Quality Assurance Plan Story', 'celine ac','wqas epic', 'karmen ac', 'kameilia ac', 'Update SW Quality Assurance Plan Story', 'joseph ac','Update SW Quality Assurance Plan ac', 'joseph ac'],
                   'col2': ['aa', 'bb', 'cc', 'dd','ee', 'ff', 'gg', 'hh', 'ii', 'jj', 'kk', 'll', 'mm', 'nn', 'oo']}) 
print(df)

list_words="SW Quality Plan Story"
set_words = set(list_words.split())
#check if list_words is in the cell
df['TrueFalse']=pd.concat([df.col1.str.contains(word,regex=False) for word in list_words.split()],axis=1).sum(1) > 1 

print('\n',df)
#extract last word
df["Suffix"] = df["col1"].str.split().str[-1]
print('\n',df)
df['ok']=''
for i in range (len(df)-1):
    if ((df["Suffix"].iloc[i]=='Story') & (df["TrueFalse"].iloc[i]=='True') & (df["Suffix"].iloc[i+1]=='ac')):
        df['ok'].iloc[i+1]=df["Suffix"].iloc[i+1]

print('\n',df)

output:

                                         col1 col2  TrueFalse Suffix ok
0           Draft SW Quality Assurance Story   aa       True  Story   
1                                    alex ac   bb      False     ac   
2                                    anny ac   cc      False     ac   
3                                 antoine ac   dd      False     ac   
4                                   aze epic   ee      False   epic   
5                                   bella ac   ff      False     ac   
6   Complete SW Quality Assurance Plan Story   gg       True  Story   
7                                  celine ac   hh      False     ac   
8                                  wqas epic   ii      False   epic   
9                                  karmen ac   jj      False     ac   
10                               kameilia ac   kk      False     ac   
11    Update SW Quality Assurance Plan Story   ll       True  Story   
12                                 joseph ac   mm      False     ac   
13       Update SW Quality Assurance Plan ac   nn       True     ac   
14                                 joseph ac   oo      False     ac

line 13 should be set to False

desired output :

                                        col1 col2  TrueFalse Suffix     
1   Complete SW Quality Assurance Plan Story   gg      True  Story   
2                                  celine ac   hh      True  ac   
3    Update SW Quality Assurance Plan Story    ll      True  Story   
4                                 joseph ac    mm      True  ac

arnaud · Accepted Answer · 2020-04-24 14:22:22Z

1

Here are your all different conditions, and their intersection:

# Condition 1: all words in col1 minus all words in set_words must be empty
df["condition_1"] = df.col1.apply(lambda x: not bool(set_words - set(x.split())))

# Condition 2: the last word should be 'Story'
df["condition_2"] = df.col1.str.endswith("Story") 

# Condition 3: the last word in the next row should be ac. See `shift(-1)`
df["condition_3"] = df.col1.str.endswith("ac").shift(-1) 

# When all three conditions meet: new column 'conditions'
df["conditions"] = df.condition_1 & df.condition_2 & df.condition_3

# Back to your notation:
# TrueFalse: rows that fulfill all three conditions along with their next rows
df["TrueFalse"] = df.conditions | df.conditions.shift(1)                                                                                         
df["Suffix"] = df.col1.apply(lambda x: x.split()[-1])

Now your desired output:

>>> print(df[["col1", "col2", "TrueFalse", "Suffix"]][df.TrueFalse])
                                        col1 col2  TrueFalse Suffix
6   Complete SW Quality Assurance Plan Story   gg       True  Story
7                                  celine ac   hh       True     ac
11    Update SW Quality Assurance Plan Story   ll       True  Story
12                                 joseph ac   mm       True     ac

FYI, all the dataframe:

>>> print(df[["col1", "col2", "TrueFalse", "Suffix"]])                                                                                                       
                                        col1 col2  TrueFalse Suffix
0           Draft SW Quality Assurance Story   aa      False  Story
1                                    alex ac   bb      False     ac
2                                    anny ac   cc      False     ac
3                                 antoine ac   dd      False     ac
4                                   aze epic   ee      False   epic
5                                   bella ac   ff      False     ac
6   Complete SW Quality Assurance Plan Story   gg       True  Story
7                                  celine ac   hh       True     ac
8                                  wqas epic   ii      False   epic
9                                  karmen ac   jj      False     ac
10                               kameilia ac   kk      False     ac
11    Update SW Quality Assurance Plan Story   ll       True  Story
12                                 joseph ac   mm       True     ac
13       Update SW Quality Assurance Plan ac   nn      False     ac
14                                 joseph ac   oo      False     ac

edited Apr 24, 2020 at 14:22

answered Apr 24, 2020 at 13:16

arnaud

3,4831 gold badge13 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

newbie Over a year ago

in line 7, shouldn't condition_3 be set to true?

arnaud Over a year ago

At line 7, celine ac is followed by wqas epic (line 8) which finishes by ic, and not ac, so condition 3 must be false. Or did I misunderstand your third condition?

newbie Over a year ago

the third condition should be true only if story is followed by ac

arnaud Over a year ago

No, from your question, third condition is that the last word in next row is ac. What you suggest here (story followed by ac) is second and third condition together: df.condition_2 & df.condition_3 that would keep only indices 0, 6 and 11 indeed.

arnaud Over a year ago

@A.khou sorry but I'll be honest here: I did my best to answer all of your updates already, regardless of how confusing the question was in the first place. It works: you could accept the answer. If your required output is now different, it's worth another question. Otherwise this is never ending. Thanks for your understanding.

|

jawsem · Accepted Answer · 2020-04-24 14:30:31Z

1

Here is one way you can accomplish this.

Pd.concat and use of .all to check if all the words are present.

Check if the same columns endswith story.

Check if the next column (df.shift(-1)) ends with ac.

Edit: After reading some comments it appears that you also want the next line that ends in ac to be True.

I added additional code at the end to add this condition.

import pandas as pd

df = pd.DataFrame({'col1': ['Draft SW Quality Assurance Story', 'alex ac', 'anny ac', 'antoine ac','aze epic', 'bella ac', 'Complete SW Quality Assurance Plan Story', 'celine ac','wqas epic', 'karmen ac', 'kameilia ac', 'Update SW Quality Assurance Plan Story', 'joseph ac','Update SW Quality Assurance Plan ac', 'joseph ac'],
                   'col2': ['aa', 'bb', 'cc', 'dd','ee', 'ff', 'gg', 'hh', 'ii', 'jj', 'kk', 'll', 'mm', 'nn', 'oo']}) 
print(df)

list_words="SW Quality Plan Story"
set_words = set(list_words.split())
#check if list_words is in the cell
df['TrueFalse']=(pd.concat([df['col1'].str.contains(word) for word in set_words],axis=1).all(axis=1)) & (df['col1'].str.endswith('Story')) & (df['col1'].shift(-1).str.endswith('ac'))

##Make sure line ends with ac and prev line follows conditions
df['AC_COL'] = df['TrueFalse'].shift(1).fillna(False)
df['Final_TrueFalse'] = df['TrueFalse'] | df['AC_COL']
print(df[['col1','col2','Final_TrueFalse']])

                                                         col1 col2  Final_TrueFalse
0           Draft SW Quality Assurance Story   aa            False
1                                    alex ac   bb            False
2                                    anny ac   cc            False
3                                 antoine ac   dd            False
4                                   aze epic   ee            False
5                                   bella ac   ff            False
6   Complete SW Quality Assurance Plan Story   gg             True
7                                  celine ac   hh             True
8                                  wqas epic   ii            False
9                                  karmen ac   jj            False
10                               kameilia ac   kk            False
11    Update SW Quality Assurance Plan Story   ll             True
12                                 joseph ac   mm             True
13       Update SW Quality Assurance Plan ac   nn            False
14                                 joseph ac   oo            False

edited Apr 24, 2020 at 14:30

answered Apr 24, 2020 at 13:17

jawsem

7715 silver badges8 bronze badges

5 Comments

newbie Over a year ago

but shouldn't the 7th and 12th lines be true?

arnaud Over a year ago

How could they be true when they don't contain all words in set_words?

newbie Over a year ago

It seems like I didn't explain it well in the question, but the conditions are combined, the 1st row should contain all the words in set_words and the last word in the next row should b ac

arnaud Over a year ago

How would that work for 7th line? It does not contain all the words in set_words and 8th line's last word is not ac.

newbie Over a year ago

the 6th line ends with story and the 7th line ends with ac

Collectives™ on Stack Overflow

Extract rows based on conditions Pandas Python

2 Answers 2

8 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related