python regex extract based on a specific substring

Question

I have a dataframe containing sentences like the following but with more rows:

data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}

I would like to extract the sentences containing 'five minutes' in the manner presented below:

desired output:

     first part              desired part     
0    see you in              five minutes.
1    NaN                     NaN
2    she goes to school in   five minutes.

I am using the following code but it returns NaN :

data.text.str.extract(r"(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+")

Jan · Accepted Answer · 2020-06-25 08:00:47Z

1

You require a whitespace when there's none:

(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+
#                                              ^^^

Either use the star quantifier (zero or more time) or rethink your expression. The following works:

import pandas as pd

data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}

df = pd.DataFrame(data)
df2 = df.text.str.extract(r"(?i)(?P<before>.*?)(?=five minutes)(?P<after>.*)")
print(df2)

And yields

                   before          after
0             see you in   five minutes.
1                     NaN            NaN
2  she goes to school in   five minutes.

edited Jun 25, 2020 at 8:00

answered Jun 25, 2020 at 7:54

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

python regex extract based on a specific substring

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related