Using unicode character in XML with python : 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Question

I use django and in my view i need to send a request as XML with some unicode character that received from html page with post method. I tried these (Note that i save that input in fname variable) :

xml = r"""my XML code with unicode {0} """.format(fname)

And

fname = u"%s".encode('utf8') % (fname)
xml = r"""my XML code with unicode {0} """.format(fname)

And

fname = fname.encode('ascii', 'ignore').decode('ascii')
xml = r"""my XML code with unicode {0} """.format(fname)

And every time i got this error:

'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

Looks like you are trying to convert a Unicode character into ASCII that is outside it's 7-bit range. The clue is in the error message. — cdarke
– cdarke, Commented Dec 29, 2014 at 7:36
fname is a persian string "محمد حسین" and i mean unicode :) — user4014811
– user4014811, Commented Dec 29, 2014 at 7:40

jfs · Accepted Answer · 2014-12-29 09:37:23Z

1

You could reproduce the error with this code:

>>> "{0}".format(u"\U0001F384"*4)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

To fix this particular error, just use Unicode format string:

>>> u"{0}".format(u"\U0001F384"*4)
u'\U0001f384\U0001f384\U0001f384\U0001f384'

You could use xml.etree.ElementTree module to build your xml document instead of string formatting. xml is a complex format; it is easy to get it wrong. ElementTree will also serialize your Unicode string into bytes correctly making sure that the character encoding in the xml declaration is consistent with the actual encoding that is used in the document.

answered Dec 29, 2014 at 9:37

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bobince · Accepted Answer · 2014-12-30 15:08:22Z

1

xml = r"""my XML code with unicode {0} """.format(fname)

The .format method always produces the same output string type as the input format string. In the case your format string is a byte string r"""...""" so if fname is a Unicode string Python tries to force it into being a byte string. If frame contains characters that do not exist in the default encoding (ASCII) then bang.

Note that this differs from the old string formatting operator %, which tries to promote to Unicode string when either the format string or any of the arguments used are Unicode, which would work in this case as long as the my XML code was ASCII-compatible. This is a common problem when you convert code that uses % to .format().

This should work fine:

xml = ur"""my XML code with unicode {0} """.format(fname)

However the output will be a Unicode string so whatever you do next needs to cope with that (for example if you are writing it to a byte stream/file, you would probably want to .encode('utf-8') the whole thing). Alternatively encode it in place to get a byte string:

xml = r"""my XML code with unicode {0} """.format(fname.encode('utf-8'))

Note that this above:

fname = u"%s".encode('utf8') % (fname)

does not work because you are encoding the format string to bytes, not the fname argument. This is identical to saying just fname = '%s' % fname, which is effectively fname = fname.

I Solved that with this code:

fname = fname.encode('ascii', 'xmlcharrefreplace')

This smells bad. For input hello ☃, you are now generating hello ☃ instead of the normal output hello ☃.

If both ☃ and ☃ look the same to you in the output then probably you are doing something like this:

xml = '<element>{0}</element>'.format(some_text)

which is broken for XML-special characters like & and <. When you are generating XML you should take care to escape special characters (&<>"', to &, < etc), otherwise at best your output will break for these characters; at worst, when some_text includes user input you have an XML-injection vulnerability which may break the logic of your system in a security-senesitive way.

As J F Sebastian said (+1), it's a good idea to use existing known-good XML serialisation libraries like etree instead of trying to roll your own.

answered Dec 30, 2014 at 15:08

bobince

538k111 gold badges675 silver badges846 bronze badges

3 Comments

jfs Over a year ago

upvoted for pointing out that OP inserts unescaped fname into xml. To make it clear: It is an advantage that .format() refuse to promote result to Unicode implicitly compared to %.

user4014811 Over a year ago

Thank you for your answer.Can you explain how can i escape these special characters?

bobince Over a year ago

text.replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"').replace("'", ''')

Michael · Accepted Answer · 2014-12-29 08:10:30Z

0

You could do something like:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

but its an ugly hack-ish that only works in python 2.7.

or

fname.encode('GB18030').decode('utf-8')

it will bypass the error but may still look messy. If you're posting to an html file then make the charset utf-8

edited Dec 29, 2014 at 8:10

answered Dec 29, 2014 at 8:01

Michael

379 bronze badges

2 Comments

jfs Over a year ago

don't change the default encoding. It may break libraries that do not expect it. And it is not necessary in this case.

Michael Over a year ago

hence the ugliness, its why I added the "or"

score 0 · Accepted Answer · 2014-12-31 06:00:19Z

0

I Solved that with this code:

fname = fname.encode('ascii', 'xmlcharrefreplace')
xml = r"""my XML code with unicode {0} """.format(fname)

Thank you for your help.

Update : And you can remove or replace special characters like > & < with this (Thanks to @bobince for notice this) :

fname = fname.replace("<", "")
fname = fname.replace(">", "")
fname = fname.replace("&", "")

edited Dec 31, 2014 at 6:00

answered Dec 29, 2014 at 7:49

user4014811

Collectives™ on Stack Overflow

Using unicode character in XML with python : 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

4 Answers 4

Comments

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related