Scraping HTML table in Python lxml

Question

The question may sound easy, but I am facing difficulty in solving it. I have a table like following:

<table><tbody>
<tr>
<td>2003</td>
<td><span class="positive">1.19</span> </td>
<td><span class="negative">-0.48</span> </td>
</tr>

My code is following:

 from lxml import etree

 for elem in tree.xpath('//*[@id="printcontent"]/div[8]/div/table/tbody/tr'):
    for c in elem.xpath("//td"):
        if(c.getchildren()): # for the <span> thing
            text = c.xpath("//span/text()")
        else:
             text = c.text

But I am unable to iterate over the "td" elements. I have been trying this whole day but of no avail!! I want to get 2003. 1.19, and -0.48.

Kindly help!

unutbu · Accepted Answer · 2014-12-06 13:21:07Z

6

It looks like you have HTML, not XML. Therefore, use lxml.html, not lxml.etree to parse the data. If data.html looks like this:

<table><tbody>
<tr>
<td>2003</td>
<td><span class="positive">1.19</span> </td>
<td><span class="negative">-0.48</span> </td>
</tr>

then

import lxml.html as LH
tree = LH.parse('data.html')
print([td.text_content() for td in tree.xpath('//td')])

yields

['2003', '1.19 ', '-0.48 ']

If

for elem in tree.xpath('//*[@id="printcontent"]/div[8]/div/table/tbody/tr'):

is not returning any elems, then you need to show us enough HTML to help us debug why this XPath is not working.

answered Dec 6, 2014 at 13:21

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3001408 Over a year ago

bravo! Yes I made this XML - HTML mistake

Collectives™ on Stack Overflow

Scraping HTML table in Python lxml

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related