Parsing XML using lxml, unable to get text when there is another child node

Question

I am parsing a XML file, downloaded from internet, using lxml. It has a structure something similar to this:

<root>
    <a>Some text in A node</a>
    <b><c>Some text in C node</c>Some text in B node</b>
</root>

I want to print the text inside the nodes with the following piece of code:

from lxml import etree
doc = etree.parse('some.xml')
root = doc.getroot()
for ch in root:
    print ch.text

Output

Some text in A node
None

This is not printing the text for <B>. Why? When I change the XML (shown below), text first and then child nodes, I get the correct output. Is it something to do with the XML syntax or lxml? Since I cannot control the XML because it is directly downloaded from the internet, I need a way to get the text as it is in the previous format.

<root>
    <a>Some text in A node</a>
    <b>Some text in B node<c>Some text in C node</c></b>
</root>

Output

Some text in A node
Some text in B node

falsetru · Accepted Answer · 2014-09-02 08:57:02Z

3

According to lxml.etree._Element documentation:

text property returns a text before the first subelement. This is either a string or the value None, if there was no text.

To print any first text in the tag, try following which use xpath to get child text node:

for ch in root:
    print next((x for x in ch.xpath('text()')), None)

or:

for ch in root.xpath('/text()'):
    print ch

answered Sep 2, 2014 at 8:57

falsetru

371k69 gold badges769 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sk11 Over a year ago

Yes true, the documentation says it. Using xpath() gets the text no matter where it is placed. Thanks.

Collectives™ on Stack Overflow

Parsing XML using lxml, unable to get text when there is another child node

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related