Python, regex and html: match final tag on line

Question

I'm confused about python greedy/not-greedy characters.

"Given multi-line html, return the final tag on each line."

I would think this would be correct:

re.findall('<.*?>$', html, re.MULTILINE)

I'm irked because I expected a list of single tags like:

"</html>", "<ul>", "</td>".

My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."

So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?

You shouldn't be using RegEx to parse HTML. You should be using an (x)html parser like BeautifulSoup or minidom. — g.d.d.c
– g.d.d.c, Commented Nov 10, 2011 at 20:37
See the top-voted answer to this question: stackoverflow.com/questions/1732348 — Jim Garrison
– Jim Garrison, Commented Nov 10, 2011 at 20:41
In the interest of brevity, I didn't mention that I was just toying around to better understand regex. I didn't realize I accidentally asked one of the most commonly mal-framed questions on SO. — MockWhy
– MockWhy, Commented Nov 10, 2011 at 21:51

Firstrock · Accepted Answer · 2011-11-10 20:52:54Z

1

Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.

Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.

edited Nov 10, 2011 at 20:52

answered Nov 10, 2011 at 20:44

Firstrock

9818 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python, regex and html: match final tag on line

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related