Python regex match sequential html tags

Question

I'm trying to match only sequential occurrences of a specific tag in a snippet of html. For the test string "blah BAD blah blah blah Time Warner Satan. The blah ..", I want to only match 'Time', 'Warner' and 'Satan' (either as separate strings or one group, doesn't matter) but not 'BAD'.

My closest attempt so far is ((?P<match>.*?)[\s\.]){2,}, which gives me 'Satan'. At least it seems to be enforcing the 2 or more, but not returning everything in that match. I'm guessing a solution involving positive lookaheads is what I need but I can't seem to get anywhere with those.

I've looked at various other related questions but couldn't seem to find a suitable solution. Most related questions are simply filled with answers stating that HTML should never be parsed with regex, instead of answering the question. I'd be happy with an lxml/BeautifulSoup solution, as long as it enforces the sequential property of my requirements but I'm most interested in the regex, even just from a curiosity point of view. I know that what I'm looking for must be possible with regex.

Thanks for your help and input.

Edit: I've realised that I could get around this by using a more simple approach, by matching all instances of the tag with (?P<match>.*?), iterating over each match object and comparing the start and end position of each match. It'd work but I'd rather find a neater solution.

Why are you trying to parse HTML with regular expressions? Really, BeautifulSoup is the superior tool to handle HTML. — Martijn Pieters
– Martijn Pieters, Commented Jan 23, 2014 at 12:09
So you want to match sequences of matching tags, with nothing in between them other than whitespace, of at least 2 tags or more? — Martijn Pieters
– Martijn Pieters, Commented Jan 23, 2014 at 12:10
I really don't get your requirements. Why should Time, Warner be matched and BAD shouldn't ? — HamZa
– HamZa, Commented Jan 23, 2014 at 12:21

georg · Accepted Answer · 2014-01-23 13:57:34Z

1

If you're curious about a re solution, it might look like this:

html = "blah <em>BAD</em> blah blah blah <em>Time</em> <em>Warner</em> <em>Satan</em>. The blah .."

rx = r"""(?x)          # extended mode - enable comments
    (                  # match a tag
        <em            # tag name
          [^<>]*       # maybe also attributes
        >              # open tag matched
        (              # now match the tag body
            (?<!</em)  # there must be no closing tag before a character
            .          # a body character
        ) *            # some more characters like this
        </em>          # closing tag
        \s*            # maybe some spaces after it
    ){2,}              # repeat the whole thing twice or more
"""

print re.sub(rx, r'{{\g<0>}}', html)
# blah <em>BAD</em> blah blah blah {{<em>Time</em> <em>Warner</em> <em>Satan</em>}}. The blah ..

edited Jan 23, 2014 at 13:57

answered Jan 23, 2014 at 12:54

georg

216k57 gold badges324 silver badges401 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Icelandic Honey Over a year ago

Thank you! At first look this seems like the answer I was looking for. I don't think I'll ultimately use this (I'll probably use the approach mentioned in the edit) but it is precisely the solution I wanted to see. Would you mind updating your answer with a breakdown of the regex expression please? I'll test this a bit later when I have time and accept the answer if I can't break it.

Icelandic Honey Over a year ago

Ok, so I played around with this a bit. I realise now I could've been more clear with requirements, I don't actually need to support attributes. I am in control of the tags, it's not actually some random html input. Anyway, that works fine with re.sub, but it doesn't to work with findall. Do you know why that might be?

georg Over a year ago

@IcelandicHoney: you might want to enclose the whole thing into another pair of () so it gets captured by findall. But you'll get the whole block at once ['Time Warner Satan']. I don't see any way to get ['Time', 'Warner', 'Satan'] as separate matches.

Icelandic Honey Over a year ago

Well that would do, just a bit confused about findall's behaviour with groups. Incidentally, my original attempt works (for my needs) when used in re.sub but I had only been testing with findall.

Collectives™ on Stack Overflow

Python regex match sequential html tags

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related