I recently came to the realization that XML containing HTML tags in body text for some of the tags seem to make parsers like WP All Import choke.
So to mitigate this, I attempted to write a Python script to properly put out XML.
It starts with this XML file (this is just an excerpt):
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
</Row>
...
</Root>
The desired output is:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
Unfortunately, I'm getting the following with weird escape characters like:
<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
...
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
<Introduction>introduction-detian-waterfall.html</Introduction>
<Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
</Row>
...
</Root>
So I'd like to fix the following: 1) Output new XML file that preserves the text including the HTML in the newly introduced "Introduction_Body" tag as well as any other tags like "Waterfall_Name" 2) Is it possible to cleanly pretty print this (for human-readability)? How?
My Python code currently looks like this:
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import os
data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()
for element in root:
if element.find('File_directory') is not None:
directory = element.find('File_directory').text
if element.find('Introduction') is not None:
introduction = element.find('Introduction').text
intro_tree = directory+introduction
with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
intro_text = f.read()
intro_body = ET.SubElement(element,'Introduction_Body')
intro_body.text = '<![CDATA[' + intro_text + ']]>'
#tree.write('new_' + data_file) #same result but leaves out the xml header
f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()
Thanks, Johnny