2

I am parsing a XML file, downloaded from internet, using lxml. It has a structure something similar to this:

<root>
    <a>Some text in A node</a>
    <b><c>Some text in C node</c>Some text in B node</b>
</root>

I want to print the text inside the nodes with the following piece of code:

from lxml import etree
doc = etree.parse('some.xml')
root = doc.getroot()
for ch in root:
    print ch.text

Output

Some text in A node
None

This is not printing the text for <B>. Why? When I change the XML (shown below), text first and then child nodes, I get the correct output. Is it something to do with the XML syntax or lxml? Since I cannot control the XML because it is directly downloaded from the internet, I need a way to get the text as it is in the previous format.

<root>
    <a>Some text in A node</a>
    <b>Some text in B node<c>Some text in C node</c></b>
</root>

Output

Some text in A node
Some text in B node

1 Answer 1

3

According to lxml.etree._Element documentation:

text property returns a text before the first subelement. This is either a string or the value None, if there was no text.

To print any first text in the tag, try following which use xpath to get child text node:

for ch in root:
    print next((x for x in ch.xpath('text()')), None)

or:

for ch in root.xpath('/text()'):
    print ch
Sign up to request clarification or add additional context in comments.

1 Comment

Yes true, the documentation says it. Using xpath() gets the text no matter where it is placed. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.