Parsing XML data with nested tags in Python

Question

I am trying to parse XML data in the format shown below, using ElementTree:

<dataset>
<title>Birds of Kafiristan</title>
    <creator>
        <individualName>
            <givenName>James</givenName>
            <surName>Brooke</surName>
        </individualName>
    </creator>
    <creator>
        <organizationName>Bird Conservation Alliance</organizationName>
        <address>
            <deliveryPoint>P.O. Box 999</deliveryPoint>
            <deliveryPoint>Mailstop 1234</deliveryPoint>
            <city>Washington</city>
            <administrativeArea>DC</administrativeArea>
            <postalCode>9999</postalCode>
            <country>USA</country>
        </address>
        <phone phonetype="voice">999-999-9999 x 123</phone>
        <phone phonetype="fax">999-999-9999</phone>
        <electronicMailAddress>[email protected]</electronicMailAddress>
        <onlineUrl>http://www.birds.org/</onlineUrl>
    </creator>
    <contact>
        <individualName>
            <givenName>Josiah</givenName>
            <surName>Harlan</surName>
        </individualName>
    </contact>
    <pubDate>2010</pubDate>
    <abstract>
         <para>This dataset contains the results of a bird survey from Kafiristan</para>
    </abstract>
    <keywordSet>
         <keyword>birds</keyword>
         <keyword>biodiversity</keyword>
         <keyword>animal ecology</keyword>
    </keywordSet>
    <distribution>
        <online>
           <url>http://birds.org/datasets</url>
        </online>
   </distribution>
</dataset>

(Indeed this is just a fragment of a much larger dataset, which includes other tags, but it will suffice to ask my question.)

I want simply to get the values of the elements for each tag, using code like:

from xml.etree import ElementTree as ET

rootElement = ET.parse("example.xml").getroot()

for subelement in rootElement:
    for subsub in subelement:
        print subsub.tag,"-->", subsub.text #, subsub.attrib, subsub.items()
        for subsubsub in subsub:
            print subsubsub.tag, "-->", subsubsub.text

Ruiing the code snippet above, I get the values of some elements, but not all -- indeed, I cannot get the values for nested elements (as "givenName" and "surName", which are nested inside "individualName", which in turn is nested into "creator").

Any hints or tips?

As always, thanks in advance for any assistance you can provide1

Do you want to know the text associated with all tags in your document? Or are you looking at specific tags? In the latter case, Element.find might or perhaps Element.iter be helpful... — mgilson
– mgilson, Commented Oct 13, 2014 at 23:50

mgilson · Accepted Answer · 2014-10-13 23:58:36Z

1

It seems like a defaultdict might be useful here:

d = collections.defaultdict(list)
for element in rootElement.iter():
    d[element.tag].append(element.text)

this will give you a mapping of tags a list of "text" associated with each tag (one item for each element with that tag in the xml.)

answered Oct 13, 2014 at 23:58

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

maurobio Over a year ago

This has been quite helpful! Why is such great trick with a dictionary is not mentioned in the ElementTree documentation?

Collectives™ on Stack Overflow

Parsing XML data with nested tags in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related