Parsing XML with ElementTree

Question

I'm trying to search for tags and attributes in a string of XML using ElementTree. Here is the string:

'<?xml version="1.0" encoding="UTF-8" ?>\n<uclassify xmlns="http://api.uclassify.com/1/ResponseSchema" version="1.01">\n\t<status success="true" statusCode="2000"/>\n\t<readCalls>\n\t<classify id="thing">\n\t\t<classification textCoverage="0">\n\t\t\t<class className="Astronomy" p="0.333333"/>\n\t\t\t<class className="Biology" p="0.333333"/>\n\t\t\t<class className="Mathematics" p="0.333333"/>\n\t\t</classification>\n\t</classify>\n\t</readCalls>\n</uclassify>'

Prettified:

<?xml version="1.0" encoding="UTF-8" ?>
<uclassify xmlns="http://api.uclassify.com/1/ResponseSchema" version="1.01">
    <status success="true" statusCode="2000"/>
    <readCalls>
        <classify id="thing">
        <classification textCoverage="0">
            <class className="Astronomy" p="0.333333"/>
            <class className="Biology" p="0.333333"/>
            <class className="Mathematics" p="0.333333"/>
        </classification>
        </classify>
    </readCalls>
</uclassify>

I used this little code to turn the string into a searchable XML tree:

>>> from xml.etree.ElementTree import fromstring, ElementTree
>>> tree = ElementTree(fromstring(a))

I thought that using tree.find('uclassify') would return that element/tag but it seems to return nothing. I also tried:

for i in tree.iter():
    print i

which prints something, but not what I want:

<Element '{http://api.uclassify.com/1/ResponseSchema}uclassify' at 0x1011ec410>
<Element '{http://api.uclassify.com/1/ResponseSchema}status' at 0x1011ec390>
<Element '{http://api.uclassify.com/1/ResponseSchema}readCalls' at 0x1011ec450>
<Element '{http://api.uclassify.com/1/ResponseSchema}classify' at 0x1011ec490>
<Element '{http://api.uclassify.com/1/ResponseSchema}classification' at 0x1011ec4d0>
<Element '{http://api.uclassify.com/1/ResponseSchema}class' at 0x1011ec510>
<Element '{http://api.uclassify.com/1/ResponseSchema}class' at 0x1011ec550>
<Element '{http://api.uclassify.com/1/ResponseSchema}class' at 0x1011ec590>

What's the easiest way to search for tags and attributes, such as in the BeautifulSoup module? For instance, how can I easily retrieve the className and p attributes for the class elements? I keep reading different things about lxml, xml.dom.minidom, and ElementTree, but I must be missing something because I can't seem to get what I want.

stderr · Accepted Answer · 2012-08-09 03:53:04Z

2

First of all uclassify is the root node so if you just print tree above you'll see:

>>> tree
<Element '{http://api.uclassify.com/1/ResponseSchema}uclassify' at 0x101f56410>

Find only looks at the current nodes children, so tree.find can only find the status and readCalls tags.

Finally, the xml namespace is tweaking the names of everything so you'll need to grab xmlns and use it to build your tag names:

xmlns = tree.tag.split("}")[0][1:]
readCalls = tree.find('{%s}readCalls' % (xmlns,))

For example to get the 3 class tags you'd need to:

classify = readCalls.find('{%s}classify' % (xmlns,))
classification = classify.find('{%s}classification' %(xmlns,))
classes = classification.findall('{%s}classes'%(xmlns,))

answered Aug 9, 2012 at 3:53

stderr

8,7521 gold badge38 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing XML with ElementTree

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related