1

I have a Dataframe like this:

text

Is it possible to apply [NUM] times
Is it possible to apply [NUM] time
Called [NUM] hour ago
waited [NUM] hours
waiting [NUM] minute
waiting [NUM] minutes???
Are you kidding me !
Waiting?

I want to be able to detect pattern that have "[NUM] time" or "[NUM] times" or "[NUM] minute" or "[NUM] minutes" or "[NUM] hour" or "[NUM] hours". Also, if it has "!" (or more than one !) or "??" (at least two ?).

So the result would look like this:

text.                                  available

Is it possible to apply [NUM] times.   True
Is it possible to apply [NUM] time.    True
Called [NUM] hour ago                  True
waited [NUM] hours                     True
waiting [NUM] minute                   True
waiting [NUM] minutes???               True
Are you kidding me !                   True
Waiting?                               False
I didn't like it                       False

So I want something like this but don't know how to put all these condition together:

df["available"] = df['text'].apply(lambda x: re.match(r'[\!* | \?+ | [NUM] time | [NUM] hour | [NUM] minute]')
0

1 Answer 1

1

You can use Series.str.contains with a regex:

import pandas as pd
df = pd.DataFrame({'text':["Is it possible to apply [NUM] times","Is it possible to apply [NUM] time","Called [NUM] hour ago","waited [NUM] hours","waiting [NUM] minute","waiting [NUM] minutes???","Are you kidding me !","Waiting?", "I didn't like it"]})
df['available'] = df['text'].str.contains(r'\[NUM]\s*(?:hour|minute|time)s?\b|!|\?{2}', regex=True)
## => df['available']
#     0     True
#     1     True
#     2     True
#     3     True
#     4     True
#     5     True
#     6     True
#     7    False
#     8    False

See the regex demo. Details:

  • \[NUM] - [NUM] string
  • \s* - zero or more whitespaces
  • (?:hour|minute|time) - a non-capturing group matching hour, minute or time
  • s? - an optional s
  • \b - a word boundary
  • | - or
  • ! - a ! char
  • | - or
  • \?{2} - two question marks.
Sign up to request clarification or add additional context in comments.

5 Comments

Do you happen to know why the same code does not work when I am using a text rather than the text inside a data frame? text="Been on hold for [NUM] minutes at that number, AFTER it wouldn't let me cancel the reservation." and then available = re.match('r\[NUM]\s*(?:hour|minute|time|number|hr|Hr)s?\b|!{2}|\?{2}', text) the available is NONE. Sorry if I did not open a new question as Im new in regex I thought this may be easy question and I received many negative points :((((
@sariii Correct, you would get lots of downvotes on such a question. The answer is "use re.search". See What is the difference between re.search and re.match?
yea I figured if I post that question my score will down to -1000 :))). Thanks for sharing the link. However, neither search nor match do not return any reasonable output. Both returned None. The way I understood the difference between them is just for the cases where either new line or ^ exist in the sentence. However, my case is just one line so I think both match and search should be able to do the job. Am I missing something here?
@sariii Your regex works, you just made a typo by moving the raw string literal r prefix into the string literal itself. See this Python demo.
Ahhhh that's true, Thanks sooo0 much :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.