How can I parse an XML document into a Python object?

Question

I'm trying to consume an XML API. I'd like to have some Python objects that represent the XML data. I have several XSD and some example API responses from the documentation.

Here's one example XML response:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
                         xmlns:title="http://www.isan.org/schema/v1.11/common/title"
                         xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
                         xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
                         xmlns:common="http://www.isan.org/schema/v1.11/common/common"
                         xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
                         xmlns:language="http://www.isan.org/schema/v1.11/common/language"
                         xmlns:country="http://www.isan.org/schema/v1.11/common/country">
    <common:status>
        <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
        <common:ISAN root="0000-0002-3B9F"/>
        <common:WorkStatus>ACTIVE</common:WorkStatus>
    </common:status>
    <serial:SerialHeaderId root="0000-0002-3B9F"/>
    <serial:MainTitles>
        <title:TitleDetail>
            <title:Title>Braquo</title:Title>
            <title:Language>
                <language:LanguageLabel>French</language:LanguageLabel>
                <language:LanguageCode>
                    <language:CodingSystem>ISO639_2</language:CodingSystem>
                    <language:ISO639_2Code>FRE</language:ISO639_2Code>
                </language:LanguageCode>
            </title:Language>
            <title:TitleKind>ORIGINAL</title:TitleKind>
        </title:TitleDetail>
    </serial:MainTitles>
    <serial:TotalEpisodes>11</serial:TotalEpisodes>
    <serial:TotalSeasons>0</serial:TotalSeasons>
    <serial:MinDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>45</common:TimeValue>
    </serial:MinDuration>
    <serial:MaxDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>144</common:TimeValue>
    </serial:MaxDuration>
    <serial:MinYear>2009</serial:MinYear>
    <serial:MaxYear>2009</serial:MaxYear>
    <serial:MainParticipantList>
        <participant:Participant>
            <participant:FirstName>Frédéric</participant:FirstName>
            <participant:LastName>Schoendoerffer</participant:LastName>
            <participant:RoleCode>DIR</participant:RoleCode>
        </participant:Participant>
        <participant:Participant>
            <participant:FirstName>Karole</participant:FirstName>
            <participant:LastName>Rocher</participant:LastName>
            <participant:RoleCode>ACT</participant:RoleCode>
        </participant:Participant>
    </serial:MainParticipantList>
    <serial:CompanyList>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>R.T.B.F.</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Capa Drama</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Marathon</common:CompanyName>
        </common:Company>
    </serial:CompanyList>
</serial:serialHeaderType>

I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. I had a problem with namespaces. Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code.

from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace

So then I tried generateDS to create some Python class definitions for me. I've lost the error messages that this attempt gave me but I couldn't get it to work. It would generate a module for each XSD that I gave it but it wouldn't parse the example XML.

I'm now trying pyxb and this seems much nicer so far. It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML:

from models import serial
obj = serial.CreateFromDocument(response)

Traceback (most recent call last):
  ...
  File "/vagrant/isan/isan.py", line 58, in lookup
    return serial.CreateFromDocument(resp.content)
  File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
    instance = handler.rootObject()
  File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
    raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>

The unrecognised node is the <serial:serialHeaderType> node from the example. Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it.

I've run out of steam for trying to explore this, I don't know what to do next.

Danny Dircz · Accepted Answer · 2015-06-25 15:01:45Z

2

I have had a lot of luck parsing XML into Python using Beautiful Soup. It is extremely straightforward, and they provide pretty strong documentation. Check it out here: http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/

answered Jun 25, 2015 at 15:01

Danny Dircz

9011 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

WilliamMayor Over a year ago

I've used BeautifulSoup before (but for HTML). I might give it a go. Are there any gotchas to be aware of between how it copes with XML over HTML?

Danny Dircz Over a year ago

Honestly it's pretty easy and does not require much code. Here's an XML example I found to get a basic idea: stackoverflow.com/questions/7785831/…

WilliamMayor Over a year ago

I played around very briefly but I couldn't get very far. bs4 doesn't understand namespaces (at all it seems), there's nothing in the docs about them. This means I can't use normal object access like soup.MainTitles, I have to use soup.find('serial:MainTitles'). This is slightly better than the lxml option, but not much :(

Danny Dircz Over a year ago

Can you please specify what exactly it is you want to parse out from the xml? (Specifically what headers you need to pull from) and I can throw down a quick example for you.

pabigot · Accepted Answer · 2015-06-26 17:34:22Z

UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.

The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType. In fact it defines no top-level elements. So PyXB can't recognize it, and the XML does not validate.

Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name.

UPDATE: You can hand-craft an element in that namespace by adding the following to the generated bindings in serial.py:

serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)

If you do that, you won't get the UnrecognizedDOMRootNodeError but you will get an IncompleteElementContentError at:

<common:status>
    <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
    <common:ISAN root="0000-0002-3B9F"/>
    <common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>

which provides the following details:

The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
    An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
    An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
    An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed

Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required.

So it seems these documents are not meant to be validated, and PyXB is probably the wrong technology to use.

Thanks pabigot, a frustrating situation but at least I won't waste any more time trying to validate!

Collectives™ on Stack Overflow

How can I parse an XML document into a Python object?

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related