Python: null character breaking xml format

Question

I have some python code that processes input files and dumps certain fields from the input to XML files. This code broke when passing a null character from input -- throwing an invalid token error:

def pretty_print_xml(elem):

    rough_string = ET.tostring(elem, 'utf-8')
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent='    ')

This surprised me and I would like to know why it broke and what else might need to be sanitized from the input. I thought only an XML meta character could throw this error and these are already being handled by minidom.

Waitaminute. You have an element, you're serializing it as a UTF-8 string, and then... what's the purpose of getting minidom involved? — Charles Duffy
– Charles Duffy, Commented Aug 29, 2016 at 13:06
If your goal is just to pretty-print, do you have lxml handy? I can give you that off the top of my head; would need to do some research for the standard-library ElementTree. — Charles Duffy
– Charles Duffy, Commented Aug 29, 2016 at 13:07
[To be clear -- minidom is really rather awful; it was among Python's first attempts at a standard-library XML module, and is still around for backwards compatibility, but that doesn't mean it's actually worth using]. — Charles Duffy
– Charles Duffy, Commented Aug 29, 2016 at 13:08
@CharlesDuffy: Haha, I cribbed the code from here pymotw.com/2/xml/etree/ElementTree/create.html. Since doing so I have come to have doubts about it though. If you could recommend a better way, advice would be appreciated. — Schemer
– Schemer, Commented Aug 29, 2016 at 13:16
Hmm. My first recommendation would actually be coming from the lxml world. Are you OK with installing additional dependencies? — Charles Duffy
– Charles Duffy, Commented Aug 29, 2016 at 13:28

Community · Accepted Answer · 2020-06-20 09:12:55Z

NUL literals are not allowed in XML. See the XML standard, version 1.1:

2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char.]
[2]       Char       ::=      [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a]      RestrictedChar     ::=      [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

Note that Char is defined to allow (among other ranges) \x01 through \xD7FF -- but not \x00.

By the way -- if your goal is pretty-printing, I'd suggest using lxml.etree. If the pretty_print=True argument on serialization calls doesn't work out-of-the-box, see the relevant FAQ entry.

Collectives™ on Stack Overflow

Python: null character breaking xml format

1 Answer 1

2.2 Characters

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2.2 Characters

Comments

Your Answer

Sign up or log in

Post as a guest

Related