0

I'm trying to regexp a line from a webpage. The line is as follows:

<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>

This is what I tried, but it doesn't seem to work, can anyone help me out? 'htmlbody' contains the html page and no, I did not forget to import 're'.

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody)
print 'Value is', value

3 Answers 3

4

There is no surefire way to do this with a regex. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why. What you need is an HTML parser like HTMLParser:

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindTDs(HTMLParser):
        def __init__(self):
                HTMLParser.__init__(self)
                self.level = 0

        def handle_starttag(self, tag, attrs):
                if tag == 'td':
                        self.level = self.level + 1

        def handle_endtag(self, tag):
                if tag == 'td':
                        self.level = self.level - 1

        def handle_data(self, data):
                if self.level > 0:
                        print data

find = FindTDs()

html = "<table>\n"
for i in range(3):
        html += "\t<tr>"
        for j in range(5):
                html += "<td>%s.%s</td>" % (i, j)
        html += "</tr>\n"
html += "</table>"

find.feed(html)
Sign up to request clarification or add additional context in comments.

Comments

1

It sounds like you may want to use findall rather than search:

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.findall(htmlbody)
print 'Found %i match(es)' % len(value)

I have to caution you, though, that regular expressions are notoriously poor at handling HTML. You're better off using a proper parser using the HTMLParser module built in to Python.

Comments

1

This

import re

htmlbody = "<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>"

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody).group(1)
print 'Value is', value

prints out

Value is random Value

Is this what you want?

2 Comments

Not completely. It works when the <tr>... string is appointed to htmlbody. However in my script htmlbody is a whole HTML-page. And in that case it doesn't seem to work. I forgot to tell: the page contains multiple instances of this line...
Do you mean that <tr> may be on previous line? Is it possible to exclude it from regexp? You can try reading all the lines, glue them together without linebreaks and search for all occurrences of specific regexp. Or you can try to make regexp more general.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.