How to write XML file with xml tags that include html tags in its text using Python?

Question

I recently came to the realization that XML containing HTML tags in body text for some of the tags seem to make parsers like WP All Import choke.

So to mitigate this, I attempted to write a Python script to properly put out XML.

It starts with this XML file (this is just an excerpt):

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
    </Row>
    ...
</Root>

The desired output is:

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
        <Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
    </Row>
    ...
</Root>

Unfortunately, I'm getting the following with weird escape characters like:

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (&#24503;&#22825;&#28689;&#24067; [D&#233;ti&#257;n P&#249;b&#249;])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
        <Introduction_Body>&lt;![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like &lt;a href="http://blah.com/blah.html"&gt;&lt;/a&gt;, &lt;br&gt;, &lt;img src="http://blahimg.jpg"&gt;, etc. It should also preserve carriage returns and characters like &#24503;&#22825;&#28689;&#24067; [D&#233;ti&#257;n P&#249;b&#249;]...]]&gt; </Introduction_Body>
    </Row>
    ...
</Root>

So I'd like to fix the following: 1) Output new XML file that preserves the text including the HTML in the newly introduced "Introduction_Body" tag as well as any other tags like "Waterfall_Name" 2) Is it possible to cleanly pretty print this (for human-readability)? How?

My Python code currently looks like this:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET
import os

data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()

for element in root:
    if element.find('File_directory') is not None:
        directory = element.find('File_directory').text
        if element.find('Introduction') is not None:
            introduction = element.find('Introduction').text
            intro_tree = directory+introduction
            with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
                intro_text = f.read()
                intro_body = ET.SubElement(element,'Introduction_Body')
                intro_body.text = '<![CDATA[' + intro_text + ']]>'

#tree.write('new_' + data_file)  #same result but leaves out the xml header

f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()

Thanks, Johnny

user2390182 · Accepted Answer · 2016-11-21 17:37:32Z

2

I would recommend you switch to lxml. It is well-documented and (almost) completely compatible with python's own xml. You might only have to minimally change your code. lxml supports CDATA very handily:

> from lxml import etree
> elmnt = etree.Element('root')
> elmnt.text = etree.CDATA('abcd')
> etree.dump(elmnt)

<root><![CDATA[abcd]]></root>

That aside, you should definitely use whatever library you use not only for parsing xml, but also for writing it! lxml will do the declaration for you:

> print(etree.tostring(elmnt, encoding="utf-8"))

<?xml version='1.0' encoding='utf-8'?>
<root><![CDATA[abcd]]></root>

edited Nov 21, 2016 at 17:37

answered Nov 21, 2016 at 17:22

user2390182

73.7k6 gold badges71 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Johnny Over a year ago

I'm on a windows machine. I looked at the lxml link provided, but I guess I'm a bit lost on how to install it (I'm not a Python expert nor an IT guy). Is it as simple as a .zip file or .exe file I can download and execute? What's the link?

Johnny Over a year ago

I've been playing around with lxml. Can't say it has been free of frustrations, but I upvoted the answer because I came back to this post and saw the etree.CDATA command that helped me get past some of the escape character issues with < and >. But there are still other lxml issues to resolve, which I've posted as a separate topic. Thanks for the answer though!

Collectives™ on Stack Overflow

How to write XML file with xml tags that include html tags in its text using Python?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related