1

I'm confused about python greedy/not-greedy characters.

"Given multi-line html, return the final tag on each line."

I would think this would be correct:

re.findall('<.*?>$', html, re.MULTILINE)

I'm irked because I expected a list of single tags like:

"</html>", "<ul>", "</td>".

My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."

So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?

3
  • You shouldn't be using RegEx to parse HTML. You should be using an (x)html parser like BeautifulSoup or minidom. Commented Nov 10, 2011 at 20:37
  • See the top-voted answer to this question: stackoverflow.com/questions/1732348 Commented Nov 10, 2011 at 20:41
  • In the interest of brevity, I didn't mention that I was just toying around to better understand regex. I didn't realize I accidentally asked one of the most commonly mal-framed questions on SO. Commented Nov 10, 2011 at 21:51

1 Answer 1

1

Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.

Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.