Going backwards in regex Python

Question

I have been trying at this all day, and can't find a solution. Here is my current code:

stranger = re.search(r"Stranger:</strong> <span>.+?</span></p></div></div></div>", html2)

I am wanting an outcome like this:

"Stranger:</strong> <span>What now?</span></p></div></div></div>" = True

from a string like this:

"<div class=\"logitem\"><p class=\"strangermsg\"><strong class=\"msgsource\">Stranger:</strong> <span>Wow</span></p></div><div class=\"logitem\"><p class=\"youmsg\"><strong class="msgsource">You:</strong> <span>Eek</span></p></div><div class=\"logitem\"><p class=\"strangermsg\"><strong class=\"msgsource\">Stranger:</strong> <span>What now?</span></p></div></div></div>"

Instead I get this:

"Stranger:</strong> <span>Wow</span></p></div><div class=\"logitem\"><p class=\"youmsg\"><strong class=\"msgsource\">You:</strong> <span>Eek</span></p></div><div class=\"logitem\"><p class=\"strangermsg\"><strong class=\"msgsource\">Stranger:</strong> <span>What now?</span></p></div></div></div>" = True

Basically I am wanting to get everything from before the "/span p div div div" and after the previous instance of "span" (no /). I've tried all kinds of things, but I don't know what I could possibly do. Anyone able to help here?

Don't use a regular expression to parse HTML. Use a DOM parser like Beautiful Soup. — Barmar
– Barmar, Commented Apr 28, 2020 at 23:01

Alexander Wu · Accepted Answer · 2020-04-28 23:33:31Z

1

Try specifying that between the two inner tags, don't allow special control sequences. For example,

stranger = re.search(r"Stranger:</strong> <span>[^<>]+?</span></p></div></div></div>", html2)

This means that whatever is between those two inner tags, there cannot be other < or > characters.

edited Apr 28, 2020 at 23:33

answered Apr 28, 2020 at 23:00

Alexander Wu

4834 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alexander Wu Over a year ago

Oh I see, sorry about that.

ColinTyphoon81 Over a year ago

Thank you for your response. However, now the code is not catching instances within the last "Stranger:" of line breaks, like, for instance, the response "Hi<br>I'm Mike!". Is there any way to fix it so that it makes an exception for line breaks, only in that particular section of the string?

Alexander Wu Over a year ago

It depends on what exactly you want to match. If the only html you want to match is <br>, then you could exclude tags that begin with "</" by using a negative lookahead: "((?!</).)". However, if your requirements are so complicated, it may be better to use a different parser instead of regex.

Collectives™ on Stack Overflow

Going backwards in regex Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related