2

I'm new on stackoverflow and this is my first question.

I'm writing script in Python for parsing HTML page. Page looks like this:

<TABLE style="border: 1px solid black">

<TR>
    <TD colspan="2"><span id="text1" style="color: white">DATA1</span></TD>
</TR>
<TR>    
    <TD class="rowLabel" valign="top">Data name</TD>
    <TD valign="top" width="100"><span id="somename1" class="alsoname">DATA2</span></TD>
</TR>   
<TR>    
    <TD class="rowLabel" valign="top">Data name</TD>
    <TD valign="top" width="100"><span id="somename2" class="alsoname">DATA3</span></TD>
</TR>                                               
<TR>
    <TD class="rowLabel" valign="top">Data name</TD>
    <TD valign="top" width="100"><span id="somename3" class="alsoname">DATA4</span></TD>
</TR>
<TR>
    <TD class="rowLabel" valign="top">Data name</TD>
    <TD valign="top" width="100"><span id="somename4" class="alsoname">DATA5</span></TD>
</TR>
<TR>
    <TD class="rowLabel" valign="top">Data name</TD>
    <TD valign="top" width="100"><span id="somename5" class="alsoname">DATA6</span></TD>
</TR>                                               
<TR>
    <TD class="rowLabel" valign="top">Data name</TD>
    <TD valign="top" width="100"><span id="somename6" class="alsoname">DATA7</span></TD>
</TR>
<TR>
    <TD class="rowLabel" valign="top">Data name</TD>
    <TD valign="top" width="100"><span id="somename7" class="alsoname">DATA8</span></TD>
</TR>                           

I would like to collect DATA values from brackets based on span id name. If span ID == somename1 then put it's DATA value in variable.

so far I have this code:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'span':
            for name, value in attrs:
                if name == 'id' and value == 'somename1':
                    print 'ID', value
                elif name == 'id' and value == 'somename2':
                    print 'ID', value
                elif name == 'id' and value == 'somename3':
                    print 'ID', value
                else :
                    print 'NO DATA'

p = MyHTMLParser()
p.feed(flush)  

Can anybody help me?

1

2 Answers 2

2

I find that using BeautifulSoup with any sort of HTML is much easier.

from BeautifulSoup import BeautifulSoup as bs
from urllib2 import urlopen

data = urlopen('wherever').read()

soup = bs(data)

for span in soup.findAll('span'):
    print span['id'], span.text

You may have to refine some parts of it, since you only provided a table.

Sign up to request clarification or add additional context in comments.

2 Comments

This seams more easier and I get data printed out but I get some error in the end 'Traceback (most recent call last): File "g1.py", line 99, in <module> print span['id'], span.text File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 601, in getitem return self._getAttrMap()[key] KeyError: 'id''
It's possible that another <span> tag is present on the page and one or more of them don't have an 'id' attribute. Like, I said, you'll have to refine it to match just the table that contains the appropriate <span> tags. Also, I apologize for the delay, I hope this helps.
0

Overriding the handle_starttag method is not enough. Unfortunately the basic HTMLParser is not quite... usable in my opinion, maybe you have a look at BeautifulSoup. You could do it like this:

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.collect_data = False
        self.tagname = None
        self.id = None

    def handle_starttag(self, tag, attrs):
        if tag == 'span':
            for name, value in attrs:
                if name == 'id' and value == 'somename1':
                    self.collect_data = True
                    self.tagname = tag
                    self.id = value

    def handle_data(self, data):
        if self.collect_data:
            self.somevar = data
            self.collect_data = False

            print "Tag: %s ID: %s" % (self.tagname, self.id)
            print "Data: %s" % data

With the collect_data we state that we want to put the next data incoming (in the handle_data method) into a variable. We turn this boolean on, when id is somename1 and turn it off, when we have collected the data. Not really beautiful, isn't it?

4 Comments

maybe this could work. how to I print out collected DATA and somename according to DATA?
I edited my answer, now you also have the information about the tag/id when you access the data.
uf. this is ugly :) I think I'm going to do it in Beautiful Soup as proposed in letter answer from @kryptn thanks anyway :)
Yep it is. Thats why I recommended it. Good luck with it :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.