Python lxml XPath problem

Question

I'm trying to print/save a certain element's HTML from a web-page.
I've retrieved the requested element's XPath from firebug.

All I wish is to save this element to a file. I don't seem to succeed in doing so.
(tried the XPath with and without a /text() at the end)

I would appreciate any help, or past experience.
10x, David

import urllib2,StringIO
from lxml import etree

url='http://www.tutiempo.net/en/Climate/Londres_Heathrow_Airport/12-2009/37720.htm'
seite = urllib2.urlopen(url)
html = seite.read()
seite.close()
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
xpath = "/html/body/table/tbody/tr/td[2]/div/table/tbody/tr[6]/td/table/tbody/tr/td[3]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/text()"
elem = tree.xpath(xpath)


print elem[0].strip().encode("utf-8")

This is a FAQ: Browsers add mandatory (X)HTML elements to the DOM (i.e. head and tbody). Don't trust Firebug. Take a look into the source document. — user357812
– user357812, Commented Mar 17, 2011 at 3:16
possible duplicate of Problem with lxml xpath for html table extracting — Yatharth Agarwal
– Yatharth Agarwal, Commented Jan 11, 2015 at 15:59

AndiDog · Accepted Answer · 2011-03-17 00:13:04Z

11

Your XPath is obviously a bit too long, why don't you try shorter ones and see if they match. One problem might be "tbody" which gets automatically created in the DOM by browsers but the HTML markup usually does not contain it.

Here's an example of how to use XPath results:

>>> from lxml import etree
>>> from StringIO import StringIO
>>> doc = etree.parse(StringIO("<html><body>a<something/>b</body></root>"), etree.HTMLParser())
>>> doc.xpath("/html/body/text()")
['a', 'b']

So you could just "".join(...) all text parts together if needed.

answered Mar 17, 2011 at 0:13

AndiDog

70.6k21 gold badges166 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Trevor Over a year ago

@It's not that it's too long. However, your tbody remark was absoloutley correct. Wouldnt've thought about that in a million years. Many thanks!

Jim · Accepted Answer · 2011-03-17 00:07:14Z

0

Not sure I completely follow what you are trying to accomplish, but ultimately I think you are looking for:

print etree.tostring(elem[0])

answered Mar 17, 2011 at 0:07

Jim

6878 silver badges22 bronze badges

Collectives™ on Stack Overflow

Python lxml XPath problem

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related