I would like to programmatically modify some XML files but I end up adding some modifications inadvertently. For example consider the following XML:
<?xml version="1.0" encoding="UTF-8"?>
<!-- A comment
-->
<abc:Tag xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:abc="http://www.mycompany.com" xmlns:def="http://www.anothercompany.com">
<abc:sometext oneattribute="Hello" anotherattribute="World">
Some random boring text.
</abc:sometext>
<def:somecode>
<![CDATA[
if a>=b:
print(a)
]]>
</def:somecode>
</abc:Tag>
I am trying to add a simple a comment in the code included in the CDATA section. To do so I am using the following python script that manages to handle the namespaces correctly and add the string. However, the CDATA is lost in the output:
import sys
from lxml import etree as ET
xml_file = sys.argv[1]
tree = ET.parse(xml_file)
root = tree.getroot()
ns = {}
element_tree = ET.iterparse(xml_file, events=["start-ns"])
try:
for event, (prefix, qualified_name) in element_tree:
ET.register_namespace(prefix, qualified_name)
ns[prefix] = qualified_name
except ET.ParseError as err:
sys.exit(1)
for somecode in tree.findall('def:somecode', namespaces=ns):
somecode.text = somecode.text + "# updated with a comment"
tree.write('output.xml',
xml_declaration=True,
encoding="UTF-8")
The resulting output is different than the input in two ways I didn't expect and don't know how to correct:
- Single quotes are replaced by double
- The code in CDATA is printed as normal text

strip_cdata=Falseparser option. lxml.de/api/lxml.etree.iterparse-class.html