0

I have been trying at this all day, and can't find a solution. Here is my current code:

stranger = re.search(r"Stranger:</strong> <span>.+?</span></p></div></div></div>", html2)

I am wanting an outcome like this:

"Stranger:</strong> <span>What now?</span></p></div></div></div>" = True

from a string like this:

"<div class=\"logitem\"><p class=\"strangermsg\"><strong class=\"msgsource\">Stranger:</strong> <span>Wow</span></p></div><div class=\"logitem\"><p class=\"youmsg\"><strong class="msgsource">You:</strong> <span>Eek</span></p></div><div class=\"logitem\"><p class=\"strangermsg\"><strong class=\"msgsource\">Stranger:</strong> <span>What now?</span></p></div></div></div>"

Instead I get this:

"Stranger:</strong> <span>Wow</span></p></div><div class=\"logitem\"><p class=\"youmsg\"><strong class=\"msgsource\">You:</strong> <span>Eek</span></p></div><div class=\"logitem\"><p class=\"strangermsg\"><strong class=\"msgsource\">Stranger:</strong> <span>What now?</span></p></div></div></div>" = True

Basically I am wanting to get everything from before the "/span p div div div" and after the previous instance of "span" (no /). I've tried all kinds of things, but I don't know what I could possibly do. Anyone able to help here?

1
  • 2
    Don't use a regular expression to parse HTML. Use a DOM parser like Beautiful Soup. Commented Apr 28, 2020 at 23:01

1 Answer 1

1

Try specifying that between the two inner tags, don't allow special control sequences. For example,

stranger = re.search(r"Stranger:</strong> <span>[^<>]+?</span></p></div></div></div>", html2)

This means that whatever is between those two inner tags, there cannot be other < or > characters.

Sign up to request clarification or add additional context in comments.

3 Comments

Oh I see, sorry about that.
Thank you for your response. However, now the code is not catching instances within the last "Stranger:" of line breaks, like, for instance, the response "Hi<br>I'm Mike!". Is there any way to fix it so that it makes an exception for line breaks, only in that particular section of the string?
It depends on what exactly you want to match. If the only html you want to match is <br>, then you could exclude tags that begin with "</" by using a negative lookahead: "((?!</).)". However, if your requirements are so complicated, it may be better to use a different parser instead of regex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.