Python and libxml2: how to iterate in xml nodes with XPATH

Question

I have a problem with retrieving information from a XML tree.

My XML has this shape:

<?xml version="1.0"?>
<records xmlns="http://www.mysyte.com/foo">
  <record>
    <id>first</id>
    <name>john</name>
    <papers>
      <paper>john_1</paper>
      <paper>john_2</paper>
    </papers>
  </record>
  <record>
    <id>second</id>
    <name>mike</name>
    <papers>
      <paper>mike_a</paper>
      <paper>mike_b</paper>
    </papers>
  </record>
  <record>
    <id>third</id>
    <name>albert</name>
    <papers>
      <paper>paper of al</paper>
      <paper>other paper</paper>
    </papers>
  </record>
</records>

What I want to do is to extract tuples of data like the follow:

[{'code': 'first', 'name': 'john'}, 
 {'code': 'second', 'name': 'mike'}, 
 {'code': 'third', 'name': 'albert'}]

Now I wrote this python code:

try:
  doc = libxml2.parseDoc(xml)
except (libxml2.parserError, TypeError):
  print "Problems loading XML"

ctxt = doc.xpathNewContext()
ctxt.xpathRegisterNs("pre", "http://www.mysyte.com/foo")

record_nodes = ctxt.xpathEval('/pre:records/pre:record')

for record_node in record_nodes:
  id = record_node.xpathEval('id')[0].content
  name = record_node.xpathEval('name')[0].content
  ret_list.append({'code': id, 'name': name})

My problem is that I don't have any result and I have the impression that I'm doing something wrong with the XPATH when I iterate on the nodes.

I also tried with these XPATHs for the id and the name:

/id
/name
/record/id
/record/name
/pre:id
/pre:name

and so on, but with any result (BTW if I use the prefix in the sub queries I have an error).

Any idea?

mzjn · Accepted Answer · 2010-07-31 20:34:06Z

7

Here is a suggestion. Note the setContextNode() method:

import libxml2

xml = "test.xml"
doc = libxml2.parseFile(xml) 

ctxt = doc.xpathNewContext() 
ctxt.xpathRegisterNs("pre","http://www.mysyte.com/foo") 

ret_list = []
record_nodes = ctxt.xpathEval('/pre:records/pre:record') 

for node in record_nodes:
    ctxt.setContextNode(node)
    _id = ctxt.xpathEval('pre:id')[0].content
    name = ctxt.xpathEval('pre:name')[0].content
    ret_list.append({'code': _id, 'name': name}) 

print ret_list

answered Jul 31, 2010 at 20:34

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Giovanni Di Milia Over a year ago

Sorry! I forgot to sign this answer as the best one! It actually works in the way I want. Thanks!

Dimitre Novatchev · Accepted Answer · 2010-07-30 13:05:47Z

0

You can select all the elements you need with a single XPath expression:

/pre:records/pre:record/*[self::pre:id or self::pre:name]

Then just process the selected nodes in python.

answered Jul 30, 2010 at 13:05

Dimitre Novatchev

244k27 gold badges307 silver badges438 bronze badges

4 Comments

Giovanni Di Milia Over a year ago

Sorry but this doesn't answer my question

Dimitre Novatchev Over a year ago

@Giovanni-Di-Milia: This answers the XPath part -- I don't know Python. Having selected all nodes you want, you should be able to process them in Python and to produce the wanted result.

Andre Holzner Over a year ago

Does this guarantee any order in which the nodes are returned ? If not, this would add some complication on the python side in order to keep track which id belongs to which name.

Dimitre Novatchev Over a year ago

@Andre-Holzner: All XPath engines I know return the selected modes in document order. And libxml is no exception of this rule.

Community · Accepted Answer · 2017-05-23 10:27:43Z

0

If it is possible to switch to lxml, here is one way it could be done:

import lxml.etree as le
root=le.XML(content)
result=[]
namespaces={'pre':'http://www.mysyte.com/foo'}
for record in root:
    id=record.xpath('pre:id',namespaces=namespaces)[0]
    name=record.xpath('pre:name',namespaces=namespaces)[0]
    result.append({'code':id.text,'name':name.text})
print(result)
# [{'code': 'first', 'name': 'john'}, {'code': 'second', 'name': 'mike'}, {'code': 'third', 'name': 'albert'}]

Building off of Dimitre Novatchev's XPath expression, you could do this:

id_name_nodes = iter(ctxt.xpathEval('/pre:records/pre:record/*[self::pre:id or self::pre:name]'))

ret_list=[]
for id,name in zip(id_name_nodes,id_name_nodes):
    ret_list.append({'code':id.content,'name':name.content})
print(ret_list)

This libxml2 code, relies on every record having an id and name. If an id or name is missing, the ret_list will pair the wrong id and name, failing silently. Under the same circumstance, the lxml code would raise an error.

edited May 23, 2017 at 10:27

CommunityBot

11 silver badge

answered Jul 29, 2010 at 19:25

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

2 Comments

Giovanni Di Milia Over a year ago

I'm using libxml2 everywhere and I would like to keep using it also in this case. However thanks for your answer!

Tim McNamara Over a year ago

lxml also uses the libxml2 library (& libxslt). It's basically a layer on top to make tricky things like this easy.

Ilya Kharlamov · Accepted Answer · 2011-08-17 16:01:53Z

-1

libxslt lacks such an important namespace support for some reason, but we can pre-parse the xml file, pre-read namespaces from it and then call xsltproc with those namespaces

def xpath(xml, xpathexpression):
    f=open(xml)
    fcontent = f.read()
    f.close()

    doc=libxml2.parseFile(xml)
    xp = doc.xpathNewContext()
    for nsdeclaration in re.findall('xmlns:*\w*="[^"]*"', fcontent):
        m = re.match('xmlns:(\w+)=.*', nsdeclaration)
        if m:
            ns = m.group(1)
        else:
            ns = "default"
        url = nsdeclaration[nsdeclaration.find('"')+1:nsdeclaration.rfind('"')]
        xp.xpathRegisterNs(ns, url)
    a=xp.xpathEval(xpathexpression)
    if len(a):
        return a[0].content
    return ""

answered Aug 17, 2011 at 16:01

Ilya Kharlamov

3,9912 gold badges35 silver badges36 bronze badges

1 Comment

Giovanni Di Milia Over a year ago

I don't think this answers the questions or adds something more to what already written

Collectives™ on Stack Overflow

Python and libxml2: how to iterate in xml nodes with XPATH

4 Answers 4

1 Comment

4 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related