0

I recently came to the realization that XML containing HTML tags in body text for some of the tags seem to make parsers like WP All Import choke.

So to mitigate this, I attempted to write a Python script to properly put out XML.

It starts with this XML file (this is just an excerpt):

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
    </Row>
    ...
</Root>

The desired output is:

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
        <Introduction_Body><![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like <a href="http://blah.com/blah.html"></a>, <br>, <img src="http://blahimg.jpg">, etc. It should also preserve carriage returns and characters like 德天瀑布 [Détiān Pùbù]...]]> </Introduction_Body>
    </Row>
    ...
</Root>

Unfortunately, I'm getting the following with weird escape characters like:

<?xml version="1.0" encoding="UTF-8" standalone="yes">
<Root>
    ...
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (&#24503;&#22825;&#28689;&#24067; [D&#233;ti&#257;n P&#249;b&#249;])</Waterfall_Name>
        <File_directory>./waterfall_writeups/657_Detian_Waterfall/</File_directory>
        <Introduction>introduction-detian-waterfall.html</Introduction>
        <Introduction_Body>&lt;![CDATA[Stuff parsed in from file './waterfall_writeups/657_Detian_Waterfall/introduction-detian-waterfall.html' as is, which includes html tags like &lt;a href="http://blah.com/blah.html"&gt;&lt;/a&gt;, &lt;br&gt;, &lt;img src="http://blahimg.jpg"&gt;, etc. It should also preserve carriage returns and characters like &#24503;&#22825;&#28689;&#24067; [D&#233;ti&#257;n P&#249;b&#249;]...]]&gt; </Introduction_Body>
    </Row>
    ...
</Root>

So I'd like to fix the following: 1) Output new XML file that preserves the text including the HTML in the newly introduced "Introduction_Body" tag as well as any other tags like "Waterfall_Name" 2) Is it possible to cleanly pretty print this (for human-readability)? How?

My Python code currently looks like this:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET
import os

data_file = 'test3_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()

for element in root:
    if element.find('File_directory') is not None:
        directory = element.find('File_directory').text
        if element.find('Introduction') is not None:
            introduction = element.find('Introduction').text
            intro_tree = directory+introduction
            with open(intro_tree, 'r') as f: #note this with statement eliminates need for f.close()
                intro_text = f.read()
                intro_body = ET.SubElement(element,'Introduction_Body')
                intro_body.text = '<![CDATA[' + intro_text + ']]>'

#tree.write('new_' + data_file)  #same result but leaves out the xml header

f = open('new_' + data_file, 'w')
f.write('<?xml version="1.0" encoding="UTF-8" standalone="yes">' + ET.tostring(root))
f.close()

Thanks, Johnny

1 Answer 1

2

I would recommend you switch to lxml. It is well-documented and (almost) completely compatible with python's own xml. You might only have to minimally change your code. lxml supports CDATA very handily:

> from lxml import etree
> elmnt = etree.Element('root')
> elmnt.text = etree.CDATA('abcd')
> etree.dump(elmnt)

<root><![CDATA[abcd]]></root>

That aside, you should definitely use whatever library you use not only for parsing xml, but also for writing it! lxml will do the declaration for you:

> print(etree.tostring(elmnt, encoding="utf-8"))

<?xml version='1.0' encoding='utf-8'?>
<root><![CDATA[abcd]]></root>
Sign up to request clarification or add additional context in comments.

2 Comments

I'm on a windows machine. I looked at the lxml link provided, but I guess I'm a bit lost on how to install it (I'm not a Python expert nor an IT guy). Is it as simple as a .zip file or .exe file I can download and execute? What's the link?
I've been playing around with lxml. Can't say it has been free of frustrations, but I upvoted the answer because I came back to this post and saw the etree.CDATA command that helped me get past some of the escape character issues with < and >. But there are still other lxml issues to resolve, which I've posted as a separate topic. Thanks for the answer though!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.