3

I am trying to use a python script to generate an HTML document with text from a data table using the XML.etree.ElementTree module. I would like to format some of the cells to include html tags, typically either <br /> or <sup></sup> tags. When I generate a string and write it to a file, I believe the XML parser is converting these tags to individual characters. The output the shows the tags as text rather than processing them as tags. Here is a trivial example:

import xml.etree.ElementTree as ET

root = ET.Element('html')
   #extraneous code removed
td = ET.SubElement(tr, 'td')
td.text = 'This is the first line <br /> and the second'

tree = ET.tostring(root)
out = open('test.html', 'w+')           
out.write(tree)                     
out.close()

When you open the resulting 'test.html' file, it displays the text string exactly as typed: 'This is the first line <br /> and the second'.

The HTML document itself shows the problem in the source. It appears that the parser substitutes the "less than" and "greater than" symbols in the tag to the HTML representations of those symbols:

    <!--Extraneous code removed-->
<td>This is the first line %lt;br /&gt; and the second</td>

Clearly, my intent is to have the document process the tag itself, not display it as text. I'm not sure if there are different parser options I can pass to get this to work, or if there is a different method I should be using. I am open to using other modules (e.g. lxml) if that will solve the problem. I am mainly using the built-in XML module for convenience.

The only thing I've figured out that works is to modify the final string with re substitutions before I write the file:

tree = ET.tostring(root)
tree = re.sub(r'&lt;','<',tree)
tree = re.sub(r'&gt;','>',tree)

This works, but seems like it should be avoidable by using a different setting in xml. Any suggestions?

1 Answer 1

6

You can use tail attribute with td and br to construct the text exactly what you want:

import xml.etree.ElementTree as ET


root = ET.Element('html')
table = ET.SubElement(root, 'table')
tr = ET.SubElement(table, 'tr')
td = ET.SubElement(tr, 'td')
td.text = "This is the first line "
# note how to end td tail
td.tail = None
br = ET.SubElement(td, 'br')
# now continue your text with br.tail
br.tail = " and the second"

tree = ET.tostring(root)
# see the string
tree
'<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>'

with open('test.html', 'w+') as f:
    f.write(tree)

# and the output html file
cat test.html
<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>

As a side note, to include the <sup></sup> and append text but still within <td>, use tail will have the desire effect too:

...
td.text = "this is first line "
sup = ET.SubElement(td, 'sup')
sup.text = "this is second"
# use tail to continue your text
sup.tail = "well and the last"

print ET.tostring(root)
<html><table><tr><td>this is first line <sup>this is second</sup>well and the last</td></tr></table></html>
Sign up to request clarification or add additional context in comments.

1 Comment

This worked perfectly! It definitely added a bit of code to my product, but made the end result much more predictable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.