Parsing blank XML tags with LXML and Python

Question

When parsing XML documents in the format of:

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model>Camaro</Model>
</Car>

I use the following code:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Color'] #Blue

This code will not work if a tag is empty such as :

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model/>
</Car>

Using the same code as above:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Model'] #Key Error

How would I parse this blank tag.

Charles Duffy · Accepted Answer · 2012-03-08 17:12:45Z

3

You're putting in a [text()] filter which explicitly asks only for elements which have text nodes them... and then you're unhappy when it doesn't give you elements without text nodes?

Leave that filter out, and you'll get your model element:

>>> s='''
... <root>
...   <Car>
...     <Color>Blue</Color>
...     <Make>Chevy</Make>
...     <Model/>
...   </Car>
... </root>'''
>>> e = lxml.etree.fromstring(s)
>>> carData = e.xpath('Car/node()')
>>> carData
[<Element Color at 0x23a5460>, <Element Make at 0x23a54b0>, <Element Model at 0x23a5500>]
>>> dict(((e.tag, e.text) for e in carData))
{'Color': 'Blue', 'Make': 'Chevy', 'Model': None}

That said -- if your immediate goal is to iterate over the nodes in the tree, you might consider using lxml.etree.iterparse() instead, which will avoid trying to build a full DOM tree in memory and otherwise be much more efficient than building a tree and then iterating over it with XPath. (Think SAX, but without the insane and painful API).

Implementing with iterparse could look like this:

def get_cars(infile):
    in_car = False
    current_car = {}
    for (event, element) in lxml.etree.iterparse(infile, events=('start', 'end')):
        if event == 'start':
            if element.tag == 'Car':
                in_car = True
                current_car = {}
            continue
        if not in_car: continue
        if element.tag == 'Car':
            yield current_car
            continue
        current_car[element.tag] = element.text

for car in get_cars(infile = cStringIO.StringIO('''<root><Car><Color>Blue</Color><Make>Chevy</Make><Model/></Car></root>''')):
  print car

...it's more code, but (if we weren't using StringIO for the example) it could process a file much larger than could fit in memory.

edited Mar 8, 2012 at 17:12

answered Mar 8, 2012 at 15:32

Charles Duffy

299k43 gold badges441 silver badges497 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

lodkkx Over a year ago

Right now I am doing an element = etree.parse(xmlfile). How would iterparse change my existing code base?

Charles Duffy Over a year ago

@lodkkx Using iterparse looks something like: for (event_type, element) in lxml.etree.iterparse(xmlfile): ..., deciding what action to take with each element in turn (typically by inspecting its tag).

Rik Poggi · Accepted Answer · 2012-03-08 15:32:05Z

1

I don't know if there's a better solution built inside lxml, but you could just use .get():

print parsedCarData[0].get('Model', '')

answered Mar 8, 2012 at 15:32

Rik Poggi

29.5k7 gold badges69 silver badges84 bronze badges

Comments

Eduardo Ivanec · Accepted Answer · 2012-03-08 16:10:03Z

0

I would catch the exception:

try:
    print parsedCarData[0]['Model']
except KeyError:
    print 'No model specified'

Exceptions in Python aren't exceptional in the same sense as in other languages, where they are more strictly linked to error conditions. Instead they are frequently part of the normal usage of modules, by design. An iterator raises StopIteration to signal it has reached the end of the iteration, for example.

Edit: If you're sure only this item can be empty @CharlesDuffy has it right in that using get() is probably better. But in general I'd consider using exceptions for handling diverse exceptional output easily.

edited Mar 8, 2012 at 16:10

answered Mar 8, 2012 at 15:33

Eduardo Ivanec

11.9k2 gold badges41 silver badges42 bronze badges

2 Comments

Charles Duffy Over a year ago

Using parsedCarModel[0].get('Model') to avoid the exception (returning None in the not-found case) is both shorter and faster than raising and handling an exception... though I think this is silly when removing an unnecessary restriction from the XPath query would make this moot in the first place.

Eduardo Ivanec Over a year ago

@CharlesDuffy: that's true, but I think this approach has advantages. I usually use try/except blocks to wrap lines in which I make several assumptions about the input across several of them. In that case using exceptions seems more natural than changing every line. Also, often you'd have to handle None with an if anyway.

Marcin · Accepted Answer · 2012-03-08 15:33:11Z

-2

The solution: use a try/except block to catch the key error.

answered Mar 8, 2012 at 15:33

Marcin

50.1k18 gold badges137 silver badges207 bronze badges

4 Comments

Charles Duffy Over a year ago

The error only happens in the first place because he's filtering out the elements with no text. Why catch it when you could avoid it, making your code shorter in the process?!

Marcin Over a year ago

@CharlesDuffy I assume OP has some reason for doing this, perhaps that he is using the datastructure created elsewhere.

Charles Duffy Over a year ago

Yes, but the structure is fine -- it's the filter he's using to retrieve portions of the structure which isn't, and that part is clearly his code.

Marcin Over a year ago

@CharlesDuffy Right, but this way, he can explicitly handle that case in whatever way is appropriate.

Collectives™ on Stack Overflow

Parsing blank XML tags with LXML and Python

4 Answers 4

2 Comments

Comments

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related