How can I put multiple conditions for detecting a pattern in pandas using regex

Question

I have a Dataframe like this:

text

Is it possible to apply [NUM] times
Is it possible to apply [NUM] time
Called [NUM] hour ago
waited [NUM] hours
waiting [NUM] minute
waiting [NUM] minutes???
Are you kidding me !
Waiting?

I want to be able to detect pattern that have "[NUM] time" or "[NUM] times" or "[NUM] minute" or "[NUM] minutes" or "[NUM] hour" or "[NUM] hours". Also, if it has "!" (or more than one !) or "??" (at least two ?).

So the result would look like this:

text.                                  available

Is it possible to apply [NUM] times.   True
Is it possible to apply [NUM] time.    True
Called [NUM] hour ago                  True
waited [NUM] hours                     True
waiting [NUM] minute                   True
waiting [NUM] minutes???               True
Are you kidding me !                   True
Waiting?                               False
I didn't like it                       False

So I want something like this but don't know how to put all these condition together:

df["available"] = df['text'].apply(lambda x: re.match(r'[\!* | \?+ | [NUM] time | [NUM] hour | [NUM] minute]')

Wiktor Stribiżew · Accepted Answer · 2021-09-27 15:58:14Z

1

You can use Series.str.contains with a regex:

import pandas as pd
df = pd.DataFrame({'text':["Is it possible to apply [NUM] times","Is it possible to apply [NUM] time","Called [NUM] hour ago","waited [NUM] hours","waiting [NUM] minute","waiting [NUM] minutes???","Are you kidding me !","Waiting?", "I didn't like it"]})
df['available'] = df['text'].str.contains(r'\[NUM]\s*(?:hour|minute|time)s?\b|!|\?{2}', regex=True)
## => df['available']
#     0     True
#     1     True
#     2     True
#     3     True
#     4     True
#     5     True
#     6     True
#     7    False
#     8    False

See the regex demo. Details:

\[NUM] - [NUM] string
\s* - zero or more whitespaces
(?:hour|minute|time) - a non-capturing group matching hour, minute or time
s? - an optional s
\b - a word boundary
| - or
! - a ! char
| - or
\?{2} - two question marks.

answered Sep 27, 2021 at 15:58

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sariii Over a year ago

Do you happen to know why the same code does not work when I am using a text rather than the text inside a data frame? text="Been on hold for [NUM] minutes at that number, AFTER it wouldn't let me cancel the reservation." and then available = re.match('r\[NUM]\s*(?:hour|minute|time|number|hr|Hr)s?\b|!{2}|\?{2}', text) the available is NONE. Sorry if I did not open a new question as Im new in regex I thought this may be easy question and I received many negative points :((((

Wiktor Stribiżew Over a year ago

@sariii Correct, you would get lots of downvotes on such a question. The answer is "use re.search". See What is the difference between re.search and re.match?

sariii Over a year ago

yea I figured if I post that question my score will down to -1000 :))). Thanks for sharing the link. However, neither search nor match do not return any reasonable output. Both returned None. The way I understood the difference between them is just for the cases where either new line or ^ exist in the sentence. However, my case is just one line so I think both match and search should be able to do the job. Am I missing something here?

Wiktor Stribiżew Over a year ago

@sariii Your regex works, you just made a typo by moving the raw string literal r prefix into the string literal itself. See this Python demo.

sariii Over a year ago

Ahhhh that's true, Thanks sooo0 much :)

Collectives™ on Stack Overflow

How can I put multiple conditions for detecting a pattern in pandas using regex

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related