0

I am trying to parse XML data in the format shown below, using ElementTree:

<dataset>
<title>Birds of Kafiristan</title>
    <creator>
        <individualName>
            <givenName>James</givenName>
            <surName>Brooke</surName>
        </individualName>
    </creator>
    <creator>
        <organizationName>Bird Conservation Alliance</organizationName>
        <address>
            <deliveryPoint>P.O. Box 999</deliveryPoint>
            <deliveryPoint>Mailstop 1234</deliveryPoint>
            <city>Washington</city>
            <administrativeArea>DC</administrativeArea>
            <postalCode>9999</postalCode>
            <country>USA</country>
        </address>
        <phone phonetype="voice">999-999-9999 x 123</phone>
        <phone phonetype="fax">999-999-9999</phone>
        <electronicMailAddress>[email protected]</electronicMailAddress>
        <onlineUrl>http://www.birds.org/</onlineUrl>
    </creator>
    <contact>
        <individualName>
            <givenName>Josiah</givenName>
            <surName>Harlan</surName>
        </individualName>
    </contact>
    <pubDate>2010</pubDate>
    <abstract>
         <para>This dataset contains the results of a bird survey from Kafiristan</para>
    </abstract>
    <keywordSet>
         <keyword>birds</keyword>
         <keyword>biodiversity</keyword>
         <keyword>animal ecology</keyword>
    </keywordSet>
    <distribution>
        <online>
           <url>http://birds.org/datasets</url>
        </online>
   </distribution>
</dataset>

(Indeed this is just a fragment of a much larger dataset, which includes other tags, but it will suffice to ask my question.)

I want simply to get the values of the elements for each tag, using code like:

from xml.etree import ElementTree as ET

rootElement = ET.parse("example.xml").getroot()

for subelement in rootElement:
    for subsub in subelement:
        print subsub.tag,"-->", subsub.text #, subsub.attrib, subsub.items()
        for subsubsub in subsub:
            print subsubsub.tag, "-->", subsubsub.text

Ruiing the code snippet above, I get the values of some elements, but not all -- indeed, I cannot get the values for nested elements (as "givenName" and "surName", which are nested inside "individualName", which in turn is nested into "creator").

Any hints or tips?

As always, thanks in advance for any assistance you can provide1

1
  • Do you want to know the text associated with all tags in your document? Or are you looking at specific tags? In the latter case, Element.find might or perhaps Element.iter be helpful... Commented Oct 13, 2014 at 23:50

1 Answer 1

1

It seems like a defaultdict might be useful here:

d = collections.defaultdict(list)
for element in rootElement.iter():
    d[element.tag].append(element.text)

this will give you a mapping of tags a list of "text" associated with each tag (one item for each element with that tag in the xml.)

Sign up to request clarification or add additional context in comments.

1 Comment

This has been quite helpful! Why is such great trick with a dictionary is not mentioned in the ElementTree documentation?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.