0

I use django and in my view i need to send a request as XML with some unicode character that received from html page with post method. I tried these (Note that i save that input in fname variable) :

xml = r"""my XML code with unicode {0} """.format(fname)

And

fname = u"%s".encode('utf8') % (fname)
xml = r"""my XML code with unicode {0} """.format(fname)

And

fname = fname.encode('ascii', 'ignore').decode('ascii')
xml = r"""my XML code with unicode {0} """.format(fname)

And every time i got this error:

'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
3
  • 2
    What type is the fname variable? str or bytes? Commented Dec 29, 2014 at 7:36
  • Looks like you are trying to convert a Unicode character into ASCII that is outside it's 7-bit range. The clue is in the error message. Commented Dec 29, 2014 at 7:36
  • fname is a persian string "محمد حسین" and i mean unicode :) Commented Dec 29, 2014 at 7:40

4 Answers 4

1

You could reproduce the error with this code:

>>> "{0}".format(u"\U0001F384"*4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

To fix this particular error, just use Unicode format string:

>>> u"{0}".format(u"\U0001F384"*4)
u'\U0001f384\U0001f384\U0001f384\U0001f384'

You could use xml.etree.ElementTree module to build your xml document instead of string formatting. xml is a complex format; it is easy to get it wrong. ElementTree will also serialize your Unicode string into bytes correctly making sure that the character encoding in the xml declaration is consistent with the actual encoding that is used in the document.

Sign up to request clarification or add additional context in comments.

Comments

1
xml = r"""my XML code with unicode {0} """.format(fname)

The .format method always produces the same output string type as the input format string. In the case your format string is a byte string r"""...""" so if fname is a Unicode string Python tries to force it into being a byte string. If frame contains characters that do not exist in the default encoding (ASCII) then bang.

Note that this differs from the old string formatting operator %, which tries to promote to Unicode string when either the format string or any of the arguments used are Unicode, which would work in this case as long as the my XML code was ASCII-compatible. This is a common problem when you convert code that uses % to .format().

This should work fine:

xml = ur"""my XML code with unicode {0} """.format(fname)

However the output will be a Unicode string so whatever you do next needs to cope with that (for example if you are writing it to a byte stream/file, you would probably want to .encode('utf-8') the whole thing). Alternatively encode it in place to get a byte string:

xml = r"""my XML code with unicode {0} """.format(fname.encode('utf-8'))

Note that this above:

fname = u"%s".encode('utf8') % (fname)

does not work because you are encoding the format string to bytes, not the fname argument. This is identical to saying just fname = '%s' % fname, which is effectively fname = fname.

I Solved that with this code:

fname = fname.encode('ascii', 'xmlcharrefreplace')

This smells bad. For input hello ☃, you are now generating hello &#9731; instead of the normal output hello ☃.

If both and &#9731; look the same to you in the output then probably you are doing something like this:

xml = '<element>{0}</element>'.format(some_text)

which is broken for XML-special characters like & and <. When you are generating XML you should take care to escape special characters (&<>"', to &amp;, &lt; etc), otherwise at best your output will break for these characters; at worst, when some_text includes user input you have an XML-injection vulnerability which may break the logic of your system in a security-senesitive way.

As J F Sebastian said (+1), it's a good idea to use existing known-good XML serialisation libraries like etree instead of trying to roll your own.

3 Comments

upvoted for pointing out that OP inserts unescaped fname into xml. To make it clear: It is an advantage that .format() refuse to promote result to Unicode implicitly compared to %.
Thank you for your answer.Can you explain how can i escape these special characters?
text.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&apos;')
0

You could do something like:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

but its an ugly hack-ish that only works in python 2.7.

or

fname.encode('GB18030').decode('utf-8')

it will bypass the error but may still look messy. If you're posting to an html file then make the charset utf-8

2 Comments

don't change the default encoding. It may break libraries that do not expect it. And it is not necessary in this case.
hence the ugliness, its why I added the "or"
0

I Solved that with this code:

fname = fname.encode('ascii', 'xmlcharrefreplace')
xml = r"""my XML code with unicode {0} """.format(fname)

Thank you for your help.

Update : And you can remove or replace special characters like > & < with this (Thanks to @bobince for notice this) :

fname = fname.replace("<", "")
fname = fname.replace(">", "")
fname = fname.replace("&", "")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.