Python Regexp problem

Question

I'm trying to regexp a line from a webpage. The line is as follows:

<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>

This is what I tried, but it doesn't seem to work, can anyone help me out? 'htmlbody' contains the html page and no, I did not forget to import 're'.

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody)
print 'Value is', value

Community · Accepted Answer · 2017-05-23 12:08:37Z

There is no surefire way to do this with a regex. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why. What you need is an HTML parser like HTMLParser:

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindTDs(HTMLParser):
        def __init__(self):
                HTMLParser.__init__(self)
                self.level = 0

        def handle_starttag(self, tag, attrs):
                if tag == 'td':
                        self.level = self.level + 1

        def handle_endtag(self, tag):
                if tag == 'td':
                        self.level = self.level - 1

        def handle_data(self, data):
                if self.level > 0:
                        print data

find = FindTDs()

html = "<table>\n"
for i in range(3):
        html += "\t<tr>"
        for j in range(5):
                html += "<td>%s.%s</td>" % (i, j)
        html += "</tr>\n"
html += "</table>"

find.feed(html)

Ben Blank · Accepted Answer · 2009-04-17 23:26:50Z

1

It sounds like you may want to use findall rather than search:

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.findall(htmlbody)
print 'Found %i match(es)' % len(value)

I have to caution you, though, that regular expressions are notoriously poor at handling HTML. You're better off using a proper parser using the HTMLParser module built in to Python.

answered Apr 17, 2009 at 23:26

Ben Blank

57k28 gold badges133 silver badges164 bronze badges

Comments

clorz · Accepted Answer · 2009-04-17 22:56:45Z

1

This

import re

htmlbody = "<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>"

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody).group(1)
print 'Value is', value

prints out

Value is random Value

Is this what you want?

answered Apr 17, 2009 at 22:56

clorz

1,1293 gold badges14 silver badges30 bronze badges

2 Comments

MarcoW Over a year ago

Not completely. It works when the <tr>... string is appointed to htmlbody. However in my script htmlbody is a whole HTML-page. And in that case it doesn't seem to work. I forgot to tell: the page contains multiple instances of this line...

clorz Over a year ago

Do you mean that <tr> may be on previous line? Is it possible to exclude it from regexp? You can try reading all the lines, glue them together without linebreaks and search for all occurrences of specific regexp. Or you can try to make regexp more general.

Collectives™ on Stack Overflow

Python Regexp problem

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related