I'm trying to match only sequential occurrences of a specific tag in a snippet of html.
For the test string "blah <em>BAD</em> blah blah blah <em>Time</em> <em>Warner</em> <em>Satan</em>. The blah ..", I want to only match 'Time', 'Warner' and 'Satan' (either as separate strings or one group, doesn't matter) but not 'BAD'.
My closest attempt so far is (<em>(?P<match>.*?)</em>[\s\.]){2,}, which gives me 'Satan'. At least it seems to be enforcing the 2 or more, but not returning everything in that match. I'm guessing a solution involving positive lookaheads is what I need but I can't seem to get anywhere with those.
I've looked at various other related questions but couldn't seem to find a suitable solution. Most related questions are simply filled with answers stating that HTML should never be parsed with regex, instead of answering the question. I'd be happy with an lxml/BeautifulSoup solution, as long as it enforces the sequential property of my requirements but I'm most interested in the regex, even just from a curiosity point of view. I know that what I'm looking for must be possible with regex.
Thanks for your help and input.
Edit: I've realised that I could get around this by using a more simple approach, by matching all instances of the tag with <em>(?P<match>.*?)</em>, iterating over each match object and comparing the start and end position of each match. It'd work but I'd rather find a neater solution.