Extracting lxml xpath for html table

Question

I have a html doc similar to following:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
    <div id="Symbols" class="cb">
    <table class="quotes">
    <tr><th>Code</th><th>Name</th>
        <th style="text-align:right;">High</th>
        <th style="text-align:right;">Low</th>
    </tr>
    <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
        <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
        <td>A Inc.</td>
        <td align="right">45.44</td>
        <td align="right">44.26</td>
    <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
        <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
        <td>B Inc.</td>
        <td align="right">18.29</td>
        <td align="right">17.92</td>
</div></html>

I need to extract code/name/high/low information from the table.

I used following code from one of the similar examples in Stack Over Flow:

#############################
import urllib2
from lxml import html, etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)

for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
    for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
        print column.strip(),
    print

#############################

I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[@class="quotes"]/tbody/tr')

I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr') not work.

i found my problem. Somehow the <tbody> tag got removed. From Firebug the <tbody> does show up right after <table class="quotes"> and before <tr> tag. — mkt2012
– mkt2012, Commented Apr 7, 2011 at 19:45
Yes this is a FAQ: browsers add mandatory (X)HTML elements (like head and tbody) to DOM. By the way this is exactly what @samplebias' answer say. — user357812
– user357812, Commented Apr 7, 2011 at 20:14

samplebias · Accepted Answer · 2011-04-07 19:38:30Z

43

You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.

Omit the tbody level in your XPath. For example, this works:

tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]

answered Apr 7, 2011 at 19:38

samplebias

38k6 gold badges110 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ficuscr Over a year ago

I experienced this with Chrome as well. Was using its Copy XPath feature in the 'Inspect' right click menu. Kinda goofy.

lajarre Over a year ago

Do you know of any other "path-changing rules" that can happen in FF/Chrome? It would be interesting to compile them.

Collectives™ on Stack Overflow

Extracting lxml xpath for html table

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related