1

I need to parse html table of the following structure:

<table class="table1" width="620" cellspacing="0" cellpadding="0" border="0">
 <tbody>
   <tr width="620">
     <th width="620">Smth1</th>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth2</td>
     ...
   </tr>
   <tr bgcolor="E4E4E4" width="620">
     <td width="620">Smth3</td>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth4</td>
     ...
   </tr>
 </tbody>
</table>

Python code:

r = requests.post(url,data)
html = lxml.html.document_fromstring(r.text)
rows = html.xpath(xpath1)[0].findall("tr")
#Getting Xpath with FireBug
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

But I get this on the third line:

IndexError: list index out of range

The task is to form python dict from this. Number of rows could be different.

UPD. Changed the way I'm getting html code to avoid possible problems with requests lib. Now it's a simple url:

html = lxml.html.parse(test_url)

This proves everyting is Ok with html:

lxml.html.open_in_browser(html)

But still the same problem:

rows = html.xpath(xpath1)[0].findall('tr')
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

Here is the xpath1:

'/html/body/table/tbody/tr[5]/td/table/tbody/tr/td[2]/table/tbody/tr/td/center/table'

UPD2. It was found experimentally, that xpath crashes on:

xpath1 = '/html/body/table/tbody'
print html.xpath(xpath1)
#print returns []

If xpath1 is shorter, then it seeem to work well and returns [<Element table at 0x2cbadb0>] for xpath1 = '/html/body/table'

1
  • 1
    Pro tip: please include the full traceback of python errors to reduce the need guess for anyone helping you. Commented Jan 17, 2013 at 22:46

2 Answers 2

5

You didn't include the XPath, so I'm not sure what you're trying to do, but if I understood correctly, this should work

xpath1 = "tbody/tr"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
rows = html.xpath(xpath1)
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

This is making a list of one item lists though, like this:

[['Smth1'], ['Smth2'], ['Smth3'], ['Smth4']]

To have a simple list of the values, you can use this code

xpath1 = "tbody/tr/*/text()"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
data = html.xpath(xpath1)

This is all assuming that r.text is exactly what you posted up there.

Sign up to request clarification or add additional context in comments.

1 Comment

Described all changes in UPD, but the problem is still there
0

Your .xpath(xpath1) XPath expression failed to find any elements. Check that expression for errors.

3 Comments

Included XPath1 into description, checked it one more time with FireBug
run print html.xpath(xpath1) to test, not in FireBug.
Described the situation in UPD2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.