2

I am trying to parse a 10k line XML file with Python. The file, an excerpt of which is shown below, describes some physical properties of a long list of elements. Each element is described in the XML file by the "ble" element, and I already have Python classes written for each "ble" type I will encounter.

<header>
    <!--SlotModels-->
    <slotModels id="SpokeOptimusSlotModels" xml:base="spoke.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
        <slotModel id="spokeLEDP">
            <var id="g1" type="double"/>
            <ble id="DR0005" type="drift">
                <d id="l" type="double" unit="mm">240</d>
                <d id="r" type="double" unit="mm">20</d>
                <d id="ry" type="double" unit="mm">0</d>
            </ble>
            <ble id="QD0020" model="Quad310" type="quad">
                <d id="l" type="double" unit="mm">310</d>
                <d id="g" type="double" unit="T/m">g1</d>
                <d id="r" type="double" unit="mm">30</d>
            </ble>
        </slotModel>
        <slotModel id="spokeLwu">
            <var id="g1" type="double"/>
            <ble id="DR0010" type="drift">
                <d id="l" type="double" unit="mm">160</d>
                <d id="r" type="double" unit="mm">30</d>
                <d id="ry" type="double" unit="mm">0</d>
            </ble>
            <ble id="QD0020" model="Quad310" type="quad">
                <d id="l" type="double" unit="mm">310</d>
                <d id="g" type="double" unit="T/m">g1</d>
                <d id="r" type="double" unit="mm">30</d>
            </ble>
        </slotModel>
        <slotModel id="spokeCryomodule">
            <var id="xelmax1" type="double"/>
            <var id="rfpdeg1" type="double"/>
            <ble id="DR0010" type="drift">
                <d id="l" type="double" unit="mm">368.5</d>
                <d id="r" type="double" unit="mm">28</d>
                <d id="ry" type="double" unit="mm">0</d>
            </ble>
            <ble id="FM0020" model="spokeCavity" type="fieldMap">
                <d id="rfpdeg" type="double" unit="deg">rfpdeg1</d>
                <d id="xelmax" type="double" unit="unit">xelmax1</d>
                <d id="radiusmm" type="double" unit="mm">28</d>
                <d id="lengthmm" type="double" unit="mm">994</d>
                <d id="file" type="string" unit="unit">Spoke_F2F</d>
                <d id="scaleFactor" type="double" unit="unit">1.0</d>
            </ble>
        </slotModel>
    </slotModels>
    <slotModels id="medBetaSlotModels" xml:base="medBeta.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
        <slotModel id="medBetaLwu">
            <var id="g1" type="double"/>
            <ble id="DR0010" type="drift">
                <d id="l" type="double" unit="mm">256.2</d>
                <d id="r" type="double" unit="mm">50</d>
                <d id="ry" type="double" unit="mm">0</d>
            </ble>
            <ble id="QD0020" model="Quad410" type="quad">
                <d id="l" type="double" unit="mm">410</d>
                <d id="g" type="double" unit="T/m">g1</d>
                <d id="r" type="double" unit="mm">50</d>
            </ble>
        </slotModel>
        <slotModel id="medBetaCryomodule">
            <var id="xelmax1" type="double"/>
            <var id="rfpdeg1" type="double"/>
            <ble id="DR0010" type="drift">
                <d id="l" type="double" unit="mm">414.4</d>
                <d id="r" type="double" unit="mm">46.87</d>
                <d id="ry" type="double" unit="mm">0</d>
            </ble>
            <ble id="FM0020" model="medBetaCavity" type="fieldMap">
                <d id="rfpdeg" type="double" unit="deg">rfpdeg1</d>
                <d id="xelmax" type="double" unit="unit">xelmax1</d>
                <d id="radiusmm" type="double" unit="mm">46.87</d>
                <d id="lengthmm" type="double" unit="mm">1258.8</d>
                <d id="file" type="string" unit="unit">MB_F2F</d>
                <d id="scaleFactor" type="double" unit="unit">1.0</d>
            </ble>
        </slotModel>
    </slotModels>
    <cellModels id="SpokeOptimusCellModels" xml:base="spoke.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
        <cellModel id="spokeLEDPCell">
            <var id="g1" type="double"/>
            <var id="xelmax1" type="double"/>
            <var id="rfpdeg1" type="double"/>
            <slot id="slot010" model="spokeLEDP">
                <d id="g1" type="double">g1</d>
            </slot>
            <slot id="slot020" model="spokeCryomodule">
                <d id="xelmax1" type="double">xelmax1</d>
                <d id="rfpdeg1" type="double">rfpdeg1</d>
            </slot>
        </cellModel>
        <cellModel id="spokeCell">
            <var id="g1" type="double"/>
            <var id="xelmax1" type="double"/>
            <var id="rfpdeg1" type="double"/>
            <slot id="slot010" model="spokeLwu">
                <d id="g1" type="double">g1</d>
            </slot>
            <slot id="slot020" model="spokeCryomodule">
                <d id="xelmax1" type="double">xelmax1</d>
                <d id="rfpdeg1" type="double">rfpdeg1</d>
            </slot>
        </cellModel>
    </cellModels>
    <cellModels id="medBetaCellModels" xml:base="medBeta.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
        <cellModel id="medBetaCell">
            <var id="g1" type="double"/>
            <var id="xelmax1" type="double"/>
            <var id="rfpdeg1" type="double"/>
            <slot id="slot010" model="medBetaLwu">
                <d id="g1" type="double">g1</d>
            </slot>
            <slot id="slot020" model="medBetaCryomodule">
                <d id="xelmax1" type="double">xelmax1</d>
                <d id="rfpdeg1" type="double">rfpdeg1</d>
            </slot>
        </cellModel>
    </cellModels>
</header>
<linac>
    <section id="SPOK" rfHarmonic="1" xml:base="spoke.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
        <cell id="cell010" model="spokeLEDPCell">
            <d id="g1" type="double">5.23025</d>
            <d id="g2" type="double">-4.68975</d>
            <d id="xelmax1" type="double">0.868945</d>
            <d id="xelmax2" type="double">0.865525</d>
            <d id="rfpdeg1" type="double">-6.65943</d>
            <d id="rfpdeg2" type="double">4.03247</d>
        </cell>
        <cell id="cell020" model="spokeCell">
            <d id="g1" type="double">4.85226</d>
            <d id="g2" type="double">-4.77927</d>
            <d id="xelmax1" type="double">0.890626</d>
            <d id="xelmax2" type="double">0.891124</d>
            <d id="rfpdeg1" type="double">22.0298</d>
            <d id="rfpdeg2" type="double">31.6618</d>
        </cell>
        <cell id="cell030" model="spokeCell">
            <d id="g1" type="double">4.46164</d>
            <d id="g2" type="double">-4.45154</d>
            <d id="xelmax1" type="double">1</d>
            <d id="xelmax2" type="double">1</d>
            <d id="rfpdeg1" type="double">37.712</d>
            <d id="rfpdeg2" type="double">47.397</d>
        </cell>
    <!--SPOKE-->
    <section id="MBL" rfHarmonic="2" xml:base="medBeta.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
        <cell id="cell010" model="medBetaCell">
            <d id="g1" type="double">3.39121</d>
            <d id="g2" type="double">-3.31534</d>
            <d id="xelmax1" type="double">0.44729</d>
            <d id="xelmax2" type="double">0.4477</d>
            <d id="xelmax3" type="double">0.453125</d>
            <d id="xelmax4" type="double">0.453307</d>
            <d id="rfpdeg1" type="double">55.8358</d>
            <d id="rfpdeg2" type="double">61.7858</d>
            <d id="rfpdeg3" type="double">66.1437</d>
            <d id="rfpdeg4" type="double">72.2867</d>
        </cell>
        <cell id="cell020" model="medBetaCell">
            <d id="g1" type="double">3.60124</d>
            <d id="g2" type="double">-3.64339</d>
            <d id="xelmax1" type="double">0.512886</d>
            <d id="xelmax2" type="double">0.512886</d>
            <d id="xelmax3" type="double">0.512886</d>
            <d id="xelmax4" type="double">0.512886</d>
            <d id="rfpdeg1" type="double">77.201</d>
            <d id="rfpdeg2" type="double">84.17</d>
            <d id="rfpdeg3" type="double">91.207</d>
            <d id="rfpdeg4" type="double">98.296</d>
        </cell>
    </section>
</linac>

Note that the structure of the XML file is that the "slots" and "cells", which are basically lists of ble's with a discernable pattern, are defined in the header, while the data is kept in "linac".

What I would like to do

I would like to scan "linac", and expand each of the cells into a list of fully instantiated objects of the classes I have written to represent each of the ble's.

To do this, I would like to be able to parse the header in such a way that returns functions that can be called for each of the slotModels or cellModels. In other words, I would like to automatically generate functions something like the following:

def spokeLEDP(g1):
    bleList = []
    bleList.append(drift(l=240, r=20, ry=0)
    bleList.append(quad(l=310, g=g1, r=30)
    return bleList

def spokeLEDPcell(g1, xelmax1, rfpdeg1):
    myList = []
    myList.append(spokeLEDP(g1))
    myList.append(spokeCryomodule(xelmax1, rfpdeg1))
    return myList

I realise that I would need to flatten the list properly, but I hope you get the idea.

My plan

The only way I can currently see to proceed with dynamically generating the functions from the header, is a two step process. First, use Python to create a text file with the necessary functions. Then import this into the working code.

This seems very clunky and unwieldy.

My question

Is there a way to do what I want to do in one step, without a lot of repetitive parsing of the "cell" and "slot" elements in the XML header?

Many thanks for reading this far, and for any help you can offer.

(Note: I have no control over the structure of the XML file.)

5
  • So basically you want an xml parser that converts xml data into a list of custom objects? Commented Jul 31, 2014 at 20:44
  • this looks interesting enough and may do what you're looking for. I've never tested it though. Commented Jul 31, 2014 at 20:46
  • It's not clear why you need a 2 sytage parse where the first stage dynamically generates functions. Why can't you parse this file 'traditionally' (e.g. traversing over the xml tree) in a single pass? Commented Jul 31, 2014 at 21:05
  • @KronoS Thanks, yes that's what I'm looking for. I tested that function, but it didn't return anything sensible :( I might just have to do this the hard way. Commented Aug 1, 2014 at 14:26
  • @TomDalton In parsing the XML I am going to be returning many times to parse each of the slot/cell models, when all I am doing is updating one or two values in their structure. I was hoping that I could turn them into functions, and then call them each time, thus eliminating repetitive parsing of each slot/cell model. Commented Aug 1, 2014 at 14:28

1 Answer 1

1

You could go the DOM parser route where you setup handlers for each node type - basically just visiting each node, running your custom code.. etc. This is a pretty common pattern.

There are a couple XML to object solutions for python, none is 'awesome', some kind of work, others don't really work but are well intentioned (or very limited).

A fairly obscure but interesting package is https://github.com/scieloorg/porteira An ambitious and .. kinda works package is http://pyxb.sourceforge.net/, where you can do 'data binding' based on a schema. It's the right idea, but like most python XML packages has a few rough edges. I ran into namespace issues (which granted, is likely a corner-case for most python projects.. less so if you're into super hardcore XML with XSDs and all that jazz.

Another solution is just to learn XPath and use it to query the structure returning an object with properties or a dict. There's so many ways to skin the cat - hopefully I've given you some potential solutions.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.