0

I'm trying to match only sequential occurrences of a specific tag in a snippet of html. For the test string "blah <em>BAD</em> blah blah blah <em>Time</em> <em>Warner</em> <em>Satan</em>. The blah ..", I want to only match 'Time', 'Warner' and 'Satan' (either as separate strings or one group, doesn't matter) but not 'BAD'.

My closest attempt so far is (<em>(?P<match>.*?)</em>[\s\.]){2,}, which gives me 'Satan'. At least it seems to be enforcing the 2 or more, but not returning everything in that match. I'm guessing a solution involving positive lookaheads is what I need but I can't seem to get anywhere with those.

I've looked at various other related questions but couldn't seem to find a suitable solution. Most related questions are simply filled with answers stating that HTML should never be parsed with regex, instead of answering the question. I'd be happy with an lxml/BeautifulSoup solution, as long as it enforces the sequential property of my requirements but I'm most interested in the regex, even just from a curiosity point of view. I know that what I'm looking for must be possible with regex.

Thanks for your help and input.

Edit: I've realised that I could get around this by using a more simple approach, by matching all instances of the tag with <em>(?P<match>.*?)</em>, iterating over each match object and comparing the start and end position of each match. It'd work but I'd rather find a neater solution.

11
  • 3
    Obligatory link to a glorious SO post on HTML and regex Commented Jan 23, 2014 at 12:08
  • 1
    Why are you trying to parse HTML with regular expressions? Really, BeautifulSoup is the superior tool to handle HTML. Commented Jan 23, 2014 at 12:09
  • So you want to match sequences of matching tags, with nothing in between them other than whitespace, of at least 2 tags or more? Commented Jan 23, 2014 at 12:10
  • I really don't get your requirements. Why should Time, Warner be matched and BAD shouldn't ? Commented Jan 23, 2014 at 12:21
  • 1
    @HamZa, don't think that that's what he wants. Example. Commented Jan 23, 2014 at 12:40

1 Answer 1

1

If you're curious about a re solution, it might look like this:

html = "blah <em>BAD</em> blah blah blah <em>Time</em> <em>Warner</em> <em>Satan</em>. The blah .."

rx = r"""(?x)          # extended mode - enable comments
    (                  # match a tag
        <em            # tag name
          [^<>]*       # maybe also attributes
        >              # open tag matched
        (              # now match the tag body
            (?<!</em)  # there must be no closing tag before a character
            .          # a body character
        ) *            # some more characters like this
        </em>          # closing tag
        \s*            # maybe some spaces after it
    ){2,}              # repeat the whole thing twice or more
"""

print re.sub(rx, r'{{\g<0>}}', html)
# blah <em>BAD</em> blah blah blah {{<em>Time</em> <em>Warner</em> <em>Satan</em>}}. The blah ..
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you! At first look this seems like the answer I was looking for. I don't think I'll ultimately use this (I'll probably use the approach mentioned in the edit) but it is precisely the solution I wanted to see. Would you mind updating your answer with a breakdown of the regex expression please? I'll test this a bit later when I have time and accept the answer if I can't break it.
Ok, so I played around with this a bit. I realise now I could've been more clear with requirements, I don't actually need to support attributes. I am in control of the tags, it's not actually some random html input. Anyway, that works fine with re.sub, but it doesn't to work with findall. Do you know why that might be?
@IcelandicHoney: you might want to enclose the whole thing into another pair of () so it gets captured by findall. But you'll get the whole block at once ['<em>Time</em> <em>Warner</em> <em>Satan</em>']. I don't see any way to get ['<em>Time</em>', '<em>Warner</em>', '<em>Satan</em>'] as separate matches.
Well that would do, just a bit confused about findall's behaviour with groups. Incidentally, my original attempt works (for my needs) when used in re.sub but I had only been testing with findall.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.