1

I need to extract rows based on 3 conditions:

  1. the column col1 should contain all the words in the list list_words.

  2. the first row should end with the word Story

  3. the next rows should end with ac

I've managed to make it work with the help of this question Extract rows based on conditions Pandas Python , but the problem is that I need to extract every row that ends with Story and the rows after that rows that end with ac. this is my current code:

import pandas as pd

df = pd.DataFrame({'col1': ['Draft SW Quality Assurance Plan Story', 'alex ac', 'anny ac', 'antoine ac','aze epic', 'bella ac', 'Complete SW Quality Assurance Plan Story', 'celine ac','wqas epic', 'karmen ac', 'kameilia ac', 'Update SW Quality Assurance Plan Story', 'joseph ac','Update SW Quality Assurance Plan ac', 'joseph ac'],
                   'col2': ['aa', 'bb', 'cc', 'dd','ee', 'ff', 'gg', 'hh', 'ii', 'jj', 'kk', 'll', 'mm', 'nn', 'oo']}) 
print(df)

list_words="SW Quality Plan Story"
set_words = set(list_words.split())

df["Suffix"] = df.col1.apply(lambda x: x.split()[-1]) 


# Condition 1: all words in col1 minus all words in set_words must be empty
df["condition_1"] = df.col1.apply(lambda x: not bool(set_words - set(x.split())))

# Condition 2: the last word should be 'Story'
df["condition_2"] = df.col1.str.endswith("Story") 

# Condition 3: the last word in the next row should be ac. See `shift(-1)`
df["condition_3"] = df.col1.str.endswith("ac").shift(-1) 

# Condition 3: the last word in the next row should be ac. See `shift(-1)`
df["condition_4"] = df.col1.str.endswith("ac")

# When all three conditions meet: new column 'conditions'
df["conditions"] = df.condition_1 & df.condition_2 & df.condition_3

df["conditions&"] = df.conditions | df.conditions.shift(1)

print(df[['condition_1', 'condition_2','condition_3' ,'condition_4']])

df.to_excel('cond.xlsx', 'Sheet1', index=True) 

df["TrueFalse"] = df.conditions | df.conditions.shift(1)                                                                                         

df1=df[["col1", "col2", "TrueFalse", "Suffix"]][df.TrueFalse]
print(df1)

this is my output:

0      Draft SW Quality Assurance Plan Story   aa       True  Story
1                                    alex ac   bb       True     ac
6   Complete SW Quality Assurance Plan Story   gg       True  Story
7                                  celine ac   hh       True     ac
11    Update SW Quality Assurance Plan Story   ll       True  Story
12                                 joseph ac   mm       True     ac

this is the desired output:

0      Draft SW Quality Assurance Plan Story   aa       True  Story
1                                    alex ac   bb       True     ac
2                                    anny ac   cc       True     ac
3                                 antoine ac   dd       True     ac
6   Complete SW Quality Assurance Plan Story   gg       True  Story
7                                  celine ac   hh       True     ac
11    Update SW Quality Assurance Plan Story   ll       True  Story
12                                 joseph ac   mm       True     ac
13       Update SW Quality Assurance Plan ac   nn       True     ac
14                                 joseph ac   oo       True     ac

I need to extract all the rows that end with ac after the row that ends with Story( 2nd and 3rd row included), not just the first one. Is it doable?

1
  • because row 13 doesn't have all the words in list_words ( it ends with ac instead of Story) but you're right, it should be in my desired output Commented Apr 27, 2020 at 14:36

1 Answer 1

1

Maybe you can do it by creating a column meeting the two conditions endswith Story and all the words. Create the other column that endswith ac. Use groupby on the cumsum of the first column created, then do any on both columns 'gr' and 'ac' and cummin, meaning that per group, once it meets a False condition it will be False for the rest of the group even if the rows ends with ac. The groupby will create a mask with True for the row you want to keep, so use loc with this mask:

df['gr'] = (df['col1'].str.endswith('Story')
            &df['col1'].apply(lambda x: not bool(set_words - set(x.split()))))
df['ac'] = df['col1'].str.endswith('ac')

df_f = df.loc[df.groupby(df['gr'].cumsum())
                .apply(lambda x: np.any(x[['gr', 'ac']], axis=1).cummin())
                .to_numpy(), ['col1', 'col2']]
print (df_f)
                                        col1 col2
0      Draft SW Quality Assurance Plan Story   aa
1                                    alex ac   bb
2                                    anny ac   cc
3                                 antoine ac   dd
6   Complete SW Quality Assurance Plan Story   gg
7                                  celine ac   hh
11    Update SW Quality Assurance Plan Story   ll
12                                 joseph ac   mm
13       Update SW Quality Assurance Plan ac   nn
14                                 joseph ac   oo
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.