1

My string is

mystring = "<tr><td><span class='para'><b>Total Amount : </b>INR (Indian Rupees) 
100.00</span></td></tr>"

My problem here is I have to search and get the total amount

test = re.search("(Indian Rupees)(\d{2})(?:\D|$)", mystring)

but my test give me None. How can I get the values and values can be 10.00, 100.00, 1000.00

Thanks

3 Answers 3

7

I strongly recommend using a real HTML parser for this, instead of a custom regular-expression.

Here's an example with the BeautifulSoup library:

from BeautifulSoup import BeautifulSoup

str = r'''
<tr><td><span class='para'><b>Total Amount : </b>INR (Indian Rupees) 100.00</span></td></tr>
'''

soup = BeautifulSoup(str)

amount = soup.findAll('span', attrs={'class': 'para'})
amount_tokens = amount[0].text.split()
print amount_tokens[-1]
Sign up to request clarification or add additional context in comments.

Comments

3

I second Eli's response - you'll be better off using an HTML parser.

Personally I would highly recommend lxml library for parsing HTML: http://lxml.de/

It's extremely fast and feature-rich.

from lxml.html import fromstring

s = """
<tr><td><span class='para'><b>Total Amount : </b>INR (Indian Rupees)
100.00</span></td></tr>
"""

doc = fromstring(s)
for span in doc.cssselect('span.para'):
    print span.text_content().split()[-1]

3 Comments

lxml is great with well-formatted HTML; BeautifulSoup is great with HTML that isn't.
lxml can deal with a broken HTML pretty well unless it's a complete "tag soup" of course
@vy32 lxml works better than BS on a lot of stuff, and when it doesn't it can use BS's parsing (via lxml.html.soupparser). It can also use html5lib's (lxml.html.html5parser) if you want to use the HTML5 parsing rules. So, use lxml, it gives you the most options, is actually maintained, etc.
1

I agree that a parser is a great way to go, but since you asked how to do it with regex, here's a way:

mystring = """<tr><td><span class='para'><b>Total Amount :
</b>INR (Indian Rupees) 100.00</span></td></tr>"""

test = re.search("\(Indian Rupees\) ([^<]+)", mystring)

Then you'll get the number with:

test.group(1)

5 Comments

@Devin Please read the first line of my answer. I agreed that a (HTML) parser was the way to go (which had already been posted about by others) but showed the asker how to modify his code to make it work in the manner he intended it to work. Hopefully, the asker at least learned a little more about regex which isn't a bad thing. Thanks anyway though I felt that your suggested link was somewhat inappropriate and a bit rude. I got your point, but I hope that you get my point as well.
Knowledge isn't always a good thing. If you only wanted to educate, there are far better things to teach. For example, you could explain why parsers are right, and regexes wrong. They don't work! They're fragile and fundamentally incapable of handling the full power of HTML. Instead you put a token half-line about how maybe they aren't the right tool. But you didn't back this up convincingly-- even though you claim that, you spend the rest of the post contradicting it with action: a regex-based solution. It sends the wrong message and enables the wrong choice. It's the wrong answer.
@Devin Regexes aren't all that bad of a thing to learn. For all I knew, this could have been a Regex exercise like the one in Google's Python Class, code.google.com/edu/languages/google-python-class/exercises/…. Maybe you should write to the maker of that class and correct him and anyone else who makes a regex exercise that uses HTML. My answer was not as well formulated as some of my other ones, but I didn't think that anyone would be so offended by it because it was posted after two other answers were already posted and upvoted and mine was intended as a side note.
Argh! I can certainly express my disagreement with what Google's decided to teach, but my words don't hold much weight (argument from authority is so tragic). Anyway, I am not offended; I merely disagree. I can't agree with helping people use regular expressions for irregular problems, or even for regular subsets in a case such as this. Regular expressions are fundamentally incapable of handling the full strength of HTML, and that's why they're wrong, and teaching them for HTML is wrong. It's not like they're worth it, HTML parsers are easy in Python (easier than regex, at any rate).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.