Use regex to extract characters after a substring in python

Question

I have a string that looks something like this -

text = 'during the day, the color of the sky is blue. at sunset, the color of the sky is orange.'

I need to extract the words after a particular sub-string, in this case, 'sky is'. That is, I want a list that gives me this -

['blue', 'orange']

I have tried the following -

p1 =re.compile(r"is (.+?) ",re.I)
re.findall(p1,text)

But this gives the output only as

['blue']

If, however, my text is

text = 'during the day, the color of the sky is blue at sunset, the color of the sky is orange or yellow.'

and I run

p1 = re.compile(r"is (.+?) ",re.I)
re.findall(p1,text)

I get the output as -

['blue', 'orange']

Please help! I am new to regular expressions and I am stuck!

You are matching a space after the group, which has meaning in regex. There is a space after blue, there is not after orange in the first example. See regex101.com/r/Zvtuyz/1 — The fourth bird
– The fourth bird, Commented Jul 3, 2020 at 15:02
If you click on the regex101.com/r/Zvtuyz/1 you will see that in the first example, there is only 1 match highlighted in green because the sky is orange. ends with a dot. If you want to match either a space or dot \bis (.+?)[ .] or only a single word \bis (\w+)[ .] — The fourth bird
– The fourth bird, Commented Jul 3, 2020 at 15:04

LukasNeugebauer · Accepted Answer · 2020-07-03 15:04:25Z

1

It's not a very general solution, but it works for your string.

my_str = 'during the day, the color of the sky is blue. at sunset, the color of the sky is orange.'
r = re.compile('sky is [a-z]+')
out = [x.split()[-1] for x in r.findall(my_str)]

answered Jul 3, 2020 at 15:04

LukasNeugebauer

1,3478 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

singrium · Accepted Answer · 2020-07-03 16:04:02Z

1

In you regex pattern, you only capture the string that is followed by a blank space, however 'orange' is followed by a dot '.', that's why it is not captured.
You have to include the dot '.' in your pattern.

p1 = re.compile(r"is (.+?)[ \.]", re.I)
re.findall(p1,text)
# ['blue', 'orange']

Demo:
https://regex101.com/r/B8jhdF/2

EDIT:
If the word is at the end of the sentence and not followed by a dot '.', I suggest this:

text = 'during the day, the color of the sky is blue at sunset, the color of the sky is orange'
p1 = re.compile(r"is (.+?)([ \.]|$)")
found_patterns = re.findall(p1,text)
[elt[0] for elt in found_patterns]
# ['blue', 'orange']

edited Jul 3, 2020 at 16:04

answered Jul 3, 2020 at 15:07

singrium

3,0566 gold badges36 silver badges50 bronze badges

3 Comments

ekhumoro Over a year ago

This will fail if the last character isn't a space of dot, or if there are no other characters after the last word.

singrium Over a year ago

That's right, but these are the cases he is dealing with (in his question)

rbc-2019 Over a year ago

thank you so much! but what if there is nothing after 'orange', as in, the text was 'during the day, the color of the sky is blue. at sunset, the color of the sky is orange'. how should i extract then?

Collectives™ on Stack Overflow

Use regex to extract characters after a substring in python

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related