Need help parsing HTML with a regex in python

Question

My string is

mystring = "<tr><td><span class='para'><b>Total Amount : </b>INR (Indian Rupees) 
100.00</span></td></tr>"

My problem here is I have to search and get the total amount

test = re.search("(Indian Rupees)(\d{2})(?:\D|$)", mystring)

but my test give me None. How can I get the values and values can be 10.00, 100.00, 1000.00

Thanks

Eli Bendersky · Accepted Answer · 2010-03-27 05:07:06Z

7

I strongly recommend using a real HTML parser for this, instead of a custom regular-expression.

Here's an example with the BeautifulSoup library:

from BeautifulSoup import BeautifulSoup

str = r'''
<tr><td><span class='para'><b>Total Amount : </b>INR (Indian Rupees) 100.00</span></td></tr>
'''

soup = BeautifulSoup(str)

amount = soup.findAll('span', attrs={'class': 'para'})
amount_tokens = amount[0].text.split()
print amount_tokens[-1]

answered Mar 27, 2010 at 5:07

Eli Bendersky

276k92 gold badges371 silver badges427 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

twasbrillig · Accepted Answer · 2014-11-13 07:45:31Z

3

I second Eli's response - you'll be better off using an HTML parser.

Personally I would highly recommend lxml library for parsing HTML: http://lxml.de/

It's extremely fast and feature-rich.

from lxml.html import fromstring

s = """
<tr><td><span class='para'><b>Total Amount : </b>INR (Indian Rupees)
100.00</span></td></tr>
"""

doc = fromstring(s)
for span in doc.cssselect('span.para'):
    print span.text_content().split()[-1]

edited Nov 13, 2014 at 7:45

twasbrillig

19.2k9 gold badges47 silver badges71 bronze badges

answered Mar 27, 2010 at 6:02

Ruslan Spivak

1,7801 gold badge11 silver badges5 bronze badges

3 Comments

vy32 Over a year ago

lxml is great with well-formatted HTML; BeautifulSoup is great with HTML that isn't.

Ruslan Spivak Over a year ago

lxml can deal with a broken HTML pretty well unless it's a complete "tag soup" of course

Devin Jeanpierre Over a year ago

@vy32 lxml works better than BS on a lot of stuff, and when it doesn't it can use BS's parsing (via lxml.html.soupparser). It can also use html5lib's (lxml.html.html5parser) if you want to use the HTML5 parsing rules. So, use lxml, it gives you the most options, is actually maintained, etc.

Justin Peel · Accepted Answer · 2010-03-27 06:11:29Z

1

I agree that a parser is a great way to go, but since you asked how to do it with regex, here's a way:

mystring = """<tr><td><span class='para'><b>Total Amount :
</b>INR (Indian Rupees) 100.00</span></td></tr>"""

test = re.search("\(Indian Rupees\) ([^<]+)", mystring)

Then you'll get the number with:

test.group(1)

answered Mar 27, 2010 at 6:11

Justin Peel

47.1k6 gold badges62 silver badges81 bronze badges

5 Comments

Devin Jeanpierre Over a year ago

You should maybe read weblogs.asp.net/alex_papadimoulis/archive/2005/05/25/…

Justin Peel Over a year ago

@Devin Please read the first line of my answer. I agreed that a (HTML) parser was the way to go (which had already been posted about by others) but showed the asker how to modify his code to make it work in the manner he intended it to work. Hopefully, the asker at least learned a little more about regex which isn't a bad thing. Thanks anyway though I felt that your suggested link was somewhat inappropriate and a bit rude. I got your point, but I hope that you get my point as well.

Devin Jeanpierre Over a year ago

Knowledge isn't always a good thing. If you only wanted to educate, there are far better things to teach. For example, you could explain why parsers are right, and regexes wrong. They don't work! They're fragile and fundamentally incapable of handling the full power of HTML. Instead you put a token half-line about how maybe they aren't the right tool. But you didn't back this up convincingly-- even though you claim that, you spend the rest of the post contradicting it with action: a regex-based solution. It sends the wrong message and enables the wrong choice. It's the wrong answer.

Justin Peel Over a year ago

@Devin Regexes aren't all that bad of a thing to learn. For all I knew, this could have been a Regex exercise like the one in Google's Python Class, code.google.com/edu/languages/google-python-class/exercises/…. Maybe you should write to the maker of that class and correct him and anyone else who makes a regex exercise that uses HTML. My answer was not as well formulated as some of my other ones, but I didn't think that anyone would be so offended by it because it was posted after two other answers were already posted and upvoted and mine was intended as a side note.

Devin Jeanpierre Over a year ago

Argh! I can certainly express my disagreement with what Google's decided to teach, but my words don't hold much weight (argument from authority is so tragic). Anyway, I am not offended; I merely disagree. I can't agree with helping people use regular expressions for irregular problems, or even for regular subsets in a case such as this. Regular expressions are fundamentally incapable of handling the full strength of HTML, and that's why they're wrong, and teaching them for HTML is wrong. It's not like they're worth it, HTML parsers are easy in Python (easier than regex, at any rate).

Collectives™ on Stack Overflow

Need help parsing HTML with a regex in python

3 Answers 3

Comments

3 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related