Parsing HTML table with LXML in Python

Question

I need to parse html table of the following structure:

<table class="table1" width="620" cellspacing="0" cellpadding="0" border="0">
 <tbody>
   <tr width="620">
     <th width="620">Smth1</th>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth2</td>
     ...
   </tr>
   <tr bgcolor="E4E4E4" width="620">
     <td width="620">Smth3</td>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth4</td>
     ...
   </tr>
 </tbody>
</table>

Python code:

r = requests.post(url,data)
html = lxml.html.document_fromstring(r.text)
rows = html.xpath(xpath1)[0].findall("tr")
#Getting Xpath with FireBug
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

But I get this on the third line:

IndexError: list index out of range

The task is to form python dict from this. Number of rows could be different.

UPD. Changed the way I'm getting html code to avoid possible problems with requests lib. Now it's a simple url:

html = lxml.html.parse(test_url)

This proves everyting is Ok with html:

lxml.html.open_in_browser(html)

But still the same problem:

rows = html.xpath(xpath1)[0].findall('tr')
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

Here is the xpath1:

'/html/body/table/tbody/tr[5]/td/table/tbody/tr/td[2]/table/tbody/tr/td/center/table'

UPD2. It was found experimentally, that xpath crashes on:

xpath1 = '/html/body/table/tbody'
print html.xpath(xpath1)
#print returns []

If xpath1 is shorter, then it seeem to work well and returns [<Element table at 0x2cbadb0>] for xpath1 = '/html/body/table'

Pro tip: please include the full traceback of python errors to reduce the need guess for anyone helping you. — Martijn Pieters
– Martijn Pieters, Commented Jan 17, 2013 at 22:46

Ehsan Kia · Accepted Answer · 2013-01-18 00:20:01Z

5

You didn't include the XPath, so I'm not sure what you're trying to do, but if I understood correctly, this should work

xpath1 = "tbody/tr"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
rows = html.xpath(xpath1)
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

This is making a list of one item lists though, like this:

[['Smth1'], ['Smth2'], ['Smth3'], ['Smth4']]

To have a simple list of the values, you can use this code

xpath1 = "tbody/tr/*/text()"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
data = html.xpath(xpath1)

This is all assuming that r.text is exactly what you posted up there.

answered Jan 18, 2013 at 0:20

Ehsan Kia

1,64420 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anatoly Maltsev Over a year ago

Described all changes in UPD, but the problem is still there

Martijn Pieters · Accepted Answer · 2013-01-17 22:45:47Z

0

Your .xpath(xpath1) XPath expression failed to find any elements. Check that expression for errors.

answered Jan 17, 2013 at 22:45

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

3 Comments

Anatoly Maltsev Over a year ago

Included XPath1 into description, checked it one more time with FireBug

Martijn Pieters Over a year ago

run print html.xpath(xpath1) to test, not in FireBug.

Anatoly Maltsev Over a year ago

Described the situation in UPD2

Collectives™ on Stack Overflow

Parsing HTML table with LXML in Python

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related