1

I'm trying to consume an XML API. I'd like to have some Python objects that represent the XML data. I have several XSD and some example API responses from the documentation.

Here's one example XML response:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
                         xmlns:title="http://www.isan.org/schema/v1.11/common/title"
                         xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
                         xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
                         xmlns:common="http://www.isan.org/schema/v1.11/common/common"
                         xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
                         xmlns:language="http://www.isan.org/schema/v1.11/common/language"
                         xmlns:country="http://www.isan.org/schema/v1.11/common/country">
    <common:status>
        <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
        <common:ISAN root="0000-0002-3B9F"/>
        <common:WorkStatus>ACTIVE</common:WorkStatus>
    </common:status>
    <serial:SerialHeaderId root="0000-0002-3B9F"/>
    <serial:MainTitles>
        <title:TitleDetail>
            <title:Title>Braquo</title:Title>
            <title:Language>
                <language:LanguageLabel>French</language:LanguageLabel>
                <language:LanguageCode>
                    <language:CodingSystem>ISO639_2</language:CodingSystem>
                    <language:ISO639_2Code>FRE</language:ISO639_2Code>
                </language:LanguageCode>
            </title:Language>
            <title:TitleKind>ORIGINAL</title:TitleKind>
        </title:TitleDetail>
    </serial:MainTitles>
    <serial:TotalEpisodes>11</serial:TotalEpisodes>
    <serial:TotalSeasons>0</serial:TotalSeasons>
    <serial:MinDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>45</common:TimeValue>
    </serial:MinDuration>
    <serial:MaxDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>144</common:TimeValue>
    </serial:MaxDuration>
    <serial:MinYear>2009</serial:MinYear>
    <serial:MaxYear>2009</serial:MaxYear>
    <serial:MainParticipantList>
        <participant:Participant>
            <participant:FirstName>Frédéric</participant:FirstName>
            <participant:LastName>Schoendoerffer</participant:LastName>
            <participant:RoleCode>DIR</participant:RoleCode>
        </participant:Participant>
        <participant:Participant>
            <participant:FirstName>Karole</participant:FirstName>
            <participant:LastName>Rocher</participant:LastName>
            <participant:RoleCode>ACT</participant:RoleCode>
        </participant:Participant>
    </serial:MainParticipantList>
    <serial:CompanyList>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>R.T.B.F.</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Capa Drama</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Marathon</common:CompanyName>
        </common:Company>
    </serial:CompanyList>
</serial:serialHeaderType>

I tried simply ignoring the XSD and using lxml.objectify on the XML I'd get from the API. I had a problem with namespaces. Having to refer to every child node with its explicit namespace was a real pain and doesn't make for readable code.

from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace

So then I tried generateDS to create some Python class definitions for me. I've lost the error messages that this attempt gave me but I couldn't get it to work. It would generate a module for each XSD that I gave it but it wouldn't parse the example XML.

I'm now trying pyxb and this seems much nicer so far. It's generating nicer definitions than generateDS (splitting them into multiple, reusable modules) but it won't parse the XML:

from models import serial
obj = serial.CreateFromDocument(response)

Traceback (most recent call last):
  ...
  File "/vagrant/isan/isan.py", line 58, in lookup
    return serial.CreateFromDocument(resp.content)
  File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
    instance = handler.rootObject()
  File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
    raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>

The unrecognised node is the <serial:serialHeaderType> node from the example. Looking at the pyxb source it seems that this error comes about "if the top-level element got processed as a DOM instance" but I don't know what this means or how to prevent it.

I've run out of steam for trying to explore this, I don't know what to do next.

2 Answers 2

2

I have had a lot of luck parsing XML into Python using Beautiful Soup. It is extremely straightforward, and they provide pretty strong documentation. Check it out here: http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Sign up to request clarification or add additional context in comments.

4 Comments

I've used BeautifulSoup before (but for HTML). I might give it a go. Are there any gotchas to be aware of between how it copes with XML over HTML?
Honestly it's pretty easy and does not require much code. Here's an XML example I found to get a basic idea: stackoverflow.com/questions/7785831/…
I played around very briefly but I couldn't get very far. bs4 doesn't understand namespaces (at all it seems), there's nothing in the docs about them. This means I can't use normal object access like soup.MainTitles, I have to use soup.find('serial:MainTitles'). This is slightly better than the lxml option, but not much :(
Can you please specify what exactly it is you want to parse out from the xml? (Specifically what headers you need to pull from) and I can throw down a quick example for you.
1

UnrecognizedDOMRootNodeError indicates that PyXB could not locate the element in a namespace for which it has bindings registered. In your case it fails on the first element, which is {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.

The schema for that namespace defines a complexType named SerialHeaderType but does not define an element with the name serialHeaderType. In fact it defines no top-level elements. So PyXB can't recognize it, and the XML does not validate.

Either there's an additional schema for the namespace that you'll need to locate which provides elements, or the message you're sending really doesn't validate. That may be because somebody's expecting a implicit mapping from a complex type to an element with that type, or because it's a fragment that would normally be found within some other element where that QName is a member element name.

UPDATE: You can hand-craft an element in that namespace by adding the following to the generated bindings in serial.py:

serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)

If you do that, you won't get the UnrecognizedDOMRootNodeError but you will get an IncompleteElementContentError at:

<common:status>
    <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
    <common:ISAN root="0000-0002-3B9F"/>
    <common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>

which provides the following details:

The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
    An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
    An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
    An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed

Reviewing the schema confirms that, at a minimum, a {http://www.isan.org/schema/v1.11/common/common}Description element is missing but required.

So it seems these documents are not meant to be validated, and PyXB is probably the wrong technology to use.

1 Comment

Thanks pabigot, a frustrating situation but at least I won't waste any more time trying to validate!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.