I am trying to use a python script to generate an HTML document with text from a data table using the XML.etree.ElementTree module. I would like to format some of the cells to include html tags, typically either <br /> or <sup></sup> tags. When I generate a string and write it to a file, I believe the XML parser is converting these tags to individual characters. The output the shows the tags as text rather than processing them as tags. Here is a trivial example:
import xml.etree.ElementTree as ET
root = ET.Element('html')
#extraneous code removed
td = ET.SubElement(tr, 'td')
td.text = 'This is the first line <br /> and the second'
tree = ET.tostring(root)
out = open('test.html', 'w+')
out.write(tree)
out.close()
When you open the resulting 'test.html' file, it displays the text string exactly as typed: 'This is the first line <br /> and the second'.
The HTML document itself shows the problem in the source. It appears that the parser substitutes the "less than" and "greater than" symbols in the tag to the HTML representations of those symbols:
<!--Extraneous code removed-->
<td>This is the first line %lt;br /> and the second</td>
Clearly, my intent is to have the document process the tag itself, not display it as text. I'm not sure if there are different parser options I can pass to get this to work, or if there is a different method I should be using. I am open to using other modules (e.g. lxml) if that will solve the problem. I am mainly using the built-in XML module for convenience.
The only thing I've figured out that works is to modify the final string with re substitutions before I write the file:
tree = ET.tostring(root)
tree = re.sub(r'<','<',tree)
tree = re.sub(r'>','>',tree)
This works, but seems like it should be avoidable by using a different setting in xml. Any suggestions?