1

I am parsing a large XML file, which essentially contains a table. The nodes in the XML don't always have names. Nested deep within several tags is what is basically an HTML-like table with <TD>s containing raw (numeric) data within row ( <TR> ) tags. Now before I can iterate through to the table there is a whole bunch of metadata tags that I'm not interested in. For instance:

<?xml version="1.0" ?>
<soap:Envelope xmlns:soap="--ommitted--" xmlns:xsi="--ommitted--">
    <soap:Body>
        <FetchReportResponse xmlns="URL1">
            <FetchReportResult xmlns="URL2">
                <REPORT>
                    <TITLE>CROSS VISITING REPORT</TITLE>
                    <SUBTITLE/>
                    <SUMMARY>
                        <GEOGRAPHY>--ommitted--</GEOGRAPHY>
                        <LOCATION>--ommitted--</LOCATION>
                        <TIMEPERIOD>--ommitted--</TIMEPERIOD>
                        <TARGET>--ommitted--</TARGET>
                        <MEDIA>--ommitted--</MEDIA>
                        <DATE>--ommitted--</DATE>
                        <USER>--ommitted--</USER>
                    </SUMMARY>
                    <TABLE>
                        <THEAD>
                            <TR>
                              <TH>--ommitted--</TH>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>
                              <TD>--ommitted--</TD>

I am new to XML parsing so I'm following this. I have the following code to read and XML file and create an ElementTree object.

import xml.etree.ElementTree as ET

tree = ET.parse('./../filename.xml')
print(root.find("./"))

This understandably prints the following:

<Element '{http://schemas.xmlsoap.org/soap/envelope/}Envelope' at 0x00000230CAC23318>

However, when I try to use the XPath convention to traverse it from here on, I'm unable to. For instance,

print(root.find("./Body"))

prints None, even though <Body> is clearly nested inside <Envelope>.

EDIT: Following Mark Tolonen's answer I was able to get to the Body tag, but how do I get beyond that? More specifically, I want to reach the <TABLE> tag.

2 Answers 2

1

In addition to the XPath section, you also need to pay attention to the Namespaces section of the documentation, since your XML contains various namespaces, with and without prefix (the latter known as default namespace). Notice that TABLE element inherits namespace from the nearest ancestor with default namespace: FetchReportResult. So to find TABLE you need to use the default namespace URI "URL2", either using curly braces syntax or using prefix-URI dictionary :

ns = { "u2": "URL2" }
tables = root.findall(".//u2:TABLE", ns)
Sign up to request clarification or add additional context in comments.

Comments

1

You need the fully qualified name, since it is soap:Body, you want to qualify body with the xmlns:soap value, which (implied from your Envelope example) is:

print(root.find("./{http://schemas.xmlsoap.org/soap/envelope/}Body"))

2 Comments

thank you. And what if I want to get to FetchReportResponse and FetchReportResult?
@wrahool Tack on more path /FetchReportResponse, etc. The remaining tags didn't have namespace qualifiers, so they don't need anything special.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.