lxml xpath in python, how to handle missing tags?

Question

suppose I want to parse with an lxml xpath expression the folowing xml

<pack xmlns="http://ns.qubic.tv/2010/item">
    <packitem>
        <duration>520</duration>
        <max_count>14</max_count>
    </packitem>
    <packitem>
        <duration>12</duration>
    </packitem>
</pack>

which is a variation of what can be found at http://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.html

How can I achieve a parsing of the different elements that would give me once zipped (in the zip or izip python function sense)

[(520,14),(12,None)]

?

The missing max_count tag in the second packitem holds me back from getting what i want.

NiL · Accepted Answer · 2012-10-21 12:11:36Z

3

def lxml_empty_str(context, nodes):
    for node in nodes:
        node.text = node.text or ""
    return nodes

ns = etree.FunctionNamespace('http://ns.qubic.tv/lxmlfunctions')
ns['lxml_empty_str'] = lxml_empty_str

namespaces = {'i':"http://ns.qubic.tv/2010/item",
          'f': "http://ns.qubic.tv/lxmlfunctions"}
packitems_duration = root.xpath('f:lxml_empty_str('//b:pack/i:packitem/i:duration)/text()',
namespaces={'b':billing_ns, 'f' : 'http://ns.qubic.tv/lxmlfunctions'})
packitems_max_count = root.xpath('f:lxml_empty_str('//b:pack/i:packitem/i:max_count)    /text()',
namespaces={'b':billing_ns, 'f' : 'http://ns.qubic.tv/lxmlfunctions'})
packitems = zip(packitems_duration, packitems_max_count)

>>> packitems
[('520','14'), ('','23')]

http://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.html

answered Oct 21, 2012 at 12:11

NiL

3,9401 gold badge18 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

unutbu · Accepted Answer · 2012-06-13 20:53:19Z

1

You could use xpath to find the packitems, then call xpath again (or findtext as I do below) to find the duration and max_counts. Having to call xpath more than once may not be terrible speedy, but it works.

import lxml.etree as ET

content = '''<pack xmlns="http://ns.qubic.tv/2010/item">
    <packitem>
        <duration>520</duration>
        <max_count>14</max_count>
    </packitem>
    <packitem>
        <duration>12</duration>
    </packitem>
</pack>
'''

def make_int(text):
    try:
        return int(text)
    except TypeError:
        return None

namespaces = {'ns' : 'http://ns.qubic.tv/2010/item'}
doc = ET.fromstring(content)
result = [tuple([make_int(elt.findtext(path, namespaces = namespaces))
                           for path in ('ns:duration', 'ns:max_count')])
          for elt in doc.xpath('//ns:packitem', namespaces = namespaces) ]
print(result)
# [(520, 14), (12, None)]

An alternative approach would be to use a SAX parser. That might be a little faster, but it takes a bit more code and the speed difference may not be important if the XML is not huge.

edited Jun 13, 2012 at 20:53

answered Jun 13, 2012 at 20:46

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

1 Comment

NiL Over a year ago

thank you so much for the time you spent studying my use case. I had already a solution similar to yours and was wishing a full xpath approach if possible. best regards

Collectives™ on Stack Overflow

lxml xpath in python, how to handle missing tags?

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related