0

I have a string that looks something like this -

text = 'during the day, the color of the sky is blue. at sunset, the color of the sky is orange.'

I need to extract the words after a particular sub-string, in this case, 'sky is'. That is, I want a list that gives me this -

['blue', 'orange']

I have tried the following -

p1 =re.compile(r"is (.+?) ",re.I)
re.findall(p1,text)

But this gives the output only as

['blue']

If, however, my text is

text = 'during the day, the color of the sky is blue at sunset, the color of the sky is orange or yellow.'

and I run

p1 = re.compile(r"is (.+?) ",re.I)
re.findall(p1,text)

I get the output as -

['blue', 'orange']

Please help! I am new to regular expressions and I am stuck!

6
  • 2
    You are matching a space after the group, which has meaning in regex. There is a space after blue, there is not after orange in the first example. See regex101.com/r/Zvtuyz/1 Commented Jul 3, 2020 at 15:02
  • could you please elaborate? Commented Jul 3, 2020 at 15:03
  • 1
    try this: re.compile(r"is (.+?)( |\.)", re.I) Commented Jul 3, 2020 at 15:04
  • 2
    If you click on the regex101.com/r/Zvtuyz/1 you will see that in the first example, there is only 1 match highlighted in green because the sky is orange. ends with a dot. If you want to match either a space or dot \bis (.+?)[ .] or only a single word \bis (\w+)[ .] Commented Jul 3, 2020 at 15:04
  • 2
    Just use re.findall(r'(?i)\bsky\s+is\s+(\w+)', text) Commented Jul 3, 2020 at 15:07

2 Answers 2

1

It's not a very general solution, but it works for your string.

my_str = 'during the day, the color of the sky is blue. at sunset, the color of the sky is orange.'
r = re.compile('sky is [a-z]+')
out = [x.split()[-1] for x in r.findall(my_str)]
Sign up to request clarification or add additional context in comments.

Comments

1

In you regex pattern, you only capture the string that is followed by a blank space, however 'orange' is followed by a dot '.', that's why it is not captured.
You have to include the dot '.' in your pattern.

p1 = re.compile(r"is (.+?)[ \.]", re.I)
re.findall(p1,text)
# ['blue', 'orange']

Demo:
https://regex101.com/r/B8jhdF/2

EDIT:
If the word is at the end of the sentence and not followed by a dot '.', I suggest this:

text = 'during the day, the color of the sky is blue at sunset, the color of the sky is orange'
p1 = re.compile(r"is (.+?)([ \.]|$)")
found_patterns = re.findall(p1,text)
[elt[0] for elt in found_patterns]
# ['blue', 'orange']

3 Comments

This will fail if the last character isn't a space of dot, or if there are no other characters after the last word.
That's right, but these are the cases he is dealing with (in his question)
thank you so much! but what if there is nothing after 'orange', as in, the text was 'during the day, the color of the sky is blue. at sunset, the color of the sky is orange'. how should i extract then?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.