1

I would like to edit the below XML as follows: The 'duplicateAndAddOne' block should be duplicated (name changed to 'newElements') and have all items in it incremented by one. Ideally the elements in it should not have to be read individually, but it should be done as a batch, as there will be many items.

<?xml version="1.0"?>
<data>
    <Strategy name="duplicateAndAddOne">
        <datapoint1>7</datapoint1>
        <datapoint2>9</datapoint2>
    </Strategy>
    <Strategy name="leaveMeAlone">
        <datapoint1>22</datapoint1>
        <datapoint2>23</datapoint2>
    </Strategy>
</data>

1 Answer 1

1

This seems to depend on whether you are using the built-in ElementTree, or lxml.

With lxml, you should be able to use copy:

from lxml import etree
e = etree.Element('root')
etree.SubElement(e, 'child1')
etree.SubElement(e, 'child2')

from copy import copy
f = copy(e)
f[0].tag = 'newchild1'
etree.dump(e)
<root>
  <child1/>
  <child2/>
</root>

etree.dump(f)
<root>
  <newchild1/>
  <child2/>
</root>

You can see that the new tree is actually separate from the old one; this is because lxml stores the parent in the element, and so can't reuse them - it has to create new elements for every child.

ElementTree doesn't keep the parent in the element, and so it's possible for the same element to coexist in several trees at once. As far as I can tell, there's no built-in way to force deep copying... deepcopy and element.copy() both do the exact same thing as copy - they copy the node, but then connect it to the children from the original node. So changes to the copy will change the original - not what you want.

The simplest way I've discovered to make this work properly is simply to serialize to a string, and then deserialize it again. This forces completely new elements to be created. It is pretty slow - but it also always works. Compare the following methods:

import xml.etree.ElementTree as etree
e = Element('root')
etree.SubElement(e, 'child1')
etree.SubElement(e, 'child2')

#f = copy(e)
#f[0].tag = 'newchild1'
# If you do the above, the first child of e will also be 'newchild1'
# So that won't work. 

# Simple method, slow but complete
In [194]: %timeit f = etree.fromstring(etree.tostring(e))
10000 loops, best of 3: 71.8 µs per loop

# Faster method, but you must implement the recursion - this only
# looks at a single level.
In [195]: %%timeit
   .....: f = etree.Element('newparent')
   .....: f.extend([x.copy() for x in e])
   .....:
100000 loops, best of 3: 9.49 µs per loop

This bottom method does create copies of the first-level children, and it is a lot faster than the first version. However, this only works for a single level of nesting; if any of these had children, you'd have to go down and copy those yourself as well. You may be able to write a recursive copy, and it might be faster; the places where I've done this haven't been performance-sensitive so I haven't bothered in my code. The tostring/fromstring routine is fairly inefficient, but straightforward, and always works no matter how deep the tree is.

Sign up to request clarification or add additional context in comments.

2 Comments

but how can I make a selection to only consider duplicateAndAddOne?
You'll have to walk through each element, and extract the 'name' element, and then decide what to do with it. although, if you have control over this XML, Strategy doesn't appear to be a good name for that section; something like DataFrame with a strategy='duplicate' attribute would seem to be more semantically consistent...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.