missing some text when iterating xml elements in python

Question

I am running the following code in Python 2.7.3 on Mac OS X 10.6.8.

import StringIO
from lxml import etree
f = open('./foo', 'r')
doc = ""
while 1:
    line = f.readline()
    doc += line
    if line == "":
        break
tree = etree.parse(StringIO.StringIO(doc), etree.HTMLParser())
r = tree.xpath('//foo')
for i in r:
    for j in i.iter():
        print j.tag, j.text

And the file foo contains

<foo> AAA <bar> BBB </bar> XXX </foo>

The output is

foo AAA
bar BBB

Why am I not getting the text XXX? How do I access it?

Thanks

mzjn · Accepted Answer · 2012-09-13 18:43:14Z

7

Try this:

from lxml import etree

tree = etree.fromstring("<foo> AAA <bar> BBB </bar> XXX </foo>")
foos = tree.xpath('//foo')

for foo in foos:
    for j in foo.iter():
        print j.tag, j.text, j.tail

Output:

foo  AAA  None
bar  BBB   XXX

The tail attribute holds the text after the end tag of the element.

tail is a peculiarity of lxml and ElementTree compared to other XML models, such as DOM. See http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html for more information.

edited Sep 13, 2012 at 18:43

answered Sep 13, 2012 at 18:28

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

APE Over a year ago

Thanks! That's an interesting quirk I wasn't aware of.

user2665694 · Accepted Answer · 2012-09-13 18:13:20Z

6

You also have to take

node.tail

into account (or check for it).

answered Sep 13, 2012 at 18:13

user2665694

Collectives™ on Stack Overflow

missing some text when iterating xml elements in python

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related