2

I'm trying to process an XML file using Python & xml.etree.ElementTree, and having a problem with multiple "hierarchical" default namespaces. What I need to do is change the content of some of the nodes' text fields, then save the file in the identical format.

Maybe an example file will help make it clear...

This is what my code looks like:

from xml.etree import ElementTree

ElementTree.register_namespace('pplv', 'whatever')
ElementTree.register_namespace('', 'blah') # Register the default namespace
parse_tree = ElementTree.parse(infile)

for node in parse_tree.iter():
    if node.tag == '...':
        node.text = '...'
    if ...

    parse_tree.write(outfile)

This is what my source file looks like

<?xml version="1.0" encoding="UTF-8"?>
<pplv:PPLVDocument xmlns:pplv="whatever">
  <pplv:node1>...</pplv:node1>
  <pplv:node2>...</pplv:node2>
  <pplv:node3 xmlns="blah">
    <node1>...</node1>
    <node2>...</node2>
  </pplv:node3>
  <pplv:node4 xmlns="blah2">
    <node1>...</node1>
    <node2>...</node2>
  </pplv:node4>
  <pplv:node5 xmlns="blah3">
    <node1>...</node1>
    <node2>...</node2>
  </pplv:node5>
</pplv:PPLVDocument>

When I parse it using ElementTree, registering the namespaces, I get:

<?xml version="1.0" encoding="UTF-8"?>
<pplv:PPLVDocument xmlns:pplv="whatever" xmlns="blah" xmlns:ns0="blah2" xmlns:ns1="blah3">
  <pplv:node1>...</pplv:node1>
  <pplv:node2>...</pplv:node2>
  <pplv:node3>
    <node1>...</node1>
    <node2>...</node2>
  </pplv:node3>
  <pplv:node4>
    <ns0:node1>...</ns0:node1>
    <ns0:node2>...</ns0:node2>
  </pplv:node4>
  <pplv:node5>
    <ns1:node1>...</ns1:node1>
    <ns1:node2>...</ns1:node2>
  </pplv:node5>
</pplv:PPLVDocument>

As you can see, all the name space definitions have been "rolled up" into a single node. In my original document, the default namespace keeps getting redefined ("blah", "blah1", "blah2"). While I can define a single default namespace ("blah"), in this case there's multiple default namespaces defined in the source document at different points; ElementTree doesn't seem to have a way of letting me save the altered file in this "shape".

As you can probably guess, the (off-the-shelf) code that consumes these files won't accept the files I'm creating, but works with the original file structure just fine.

Happy to switch to lxml if that's going to give me a way to resolve this; I just need a fix!

Thanks in advance

1 Answer 1

2

using lxml:

>>> parser = etree.XMLParser(remove_blank_text=True)
>>> root = etree.parse('in.xml', parser)
>>> root.xpath('//pplv:node2/text()', namespaces={'pplv': 'whatever'})
['...']
>>> root.write('out.xml', pretty_print=True)

$ cat out.xml 
<pplv:PPLVDocument xmlns:pplv="whatever">
  <pplv:node1>...</pplv:node1>
  <pplv:node2>...</pplv:node2>
  <pplv:node3 xmlns="blah">
    <node1>...</node1>
    <node2>...</node2>
  </pplv:node3>
  <pplv:node4 xmlns="blah2">
    <node1>...</node1>
    <node2>...</node2>
  </pplv:node4>
  <pplv:node5 xmlns="blah3">
    <node1>...</node1>
    <node2>...</node2>
  </pplv:node5>
</pplv:PPLVDocument>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.