0

I have some python code that processes input files and dumps certain fields from the input to XML files. This code broke when passing a null character from input -- throwing an invalid token error:

def pretty_print_xml(elem):

    rough_string = ET.tostring(elem, 'utf-8')
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent='    ')

This surprised me and I would like to know why it broke and what else might need to be sanitized from the input. I thought only an XML meta character could throw this error and these are already being handled by minidom.

8
  • Waitaminute. You have an element, you're serializing it as a UTF-8 string, and then... what's the purpose of getting minidom involved? Commented Aug 29, 2016 at 13:06
  • If your goal is just to pretty-print, do you have lxml handy? I can give you that off the top of my head; would need to do some research for the standard-library ElementTree. Commented Aug 29, 2016 at 13:07
  • [To be clear -- minidom is really rather awful; it was among Python's first attempts at a standard-library XML module, and is still around for backwards compatibility, but that doesn't mean it's actually worth using]. Commented Aug 29, 2016 at 13:08
  • @CharlesDuffy: Haha, I cribbed the code from here pymotw.com/2/xml/etree/ElementTree/create.html. Since doing so I have come to have doubts about it though. If you could recommend a better way, advice would be appreciated. Commented Aug 29, 2016 at 13:16
  • 1
    Hmm. My first recommendation would actually be coming from the lxml world. Are you OK with installing additional dependencies? Commented Aug 29, 2016 at 13:28

1 Answer 1

1

NUL literals are not allowed in XML. See the XML standard, version 1.1:

2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char.]

[2]       Char       ::=      [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a]      RestrictedChar     ::=      [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

Note that Char is defined to allow (among other ranges) \x01 through \xD7FF -- but not \x00.


By the way -- if your goal is pretty-printing, I'd suggest using lxml.etree. If the pretty_print=True argument on serialization calls doesn't work out-of-the-box, see the relevant FAQ entry.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.