Python UTF8 string confusion

Question

Been banging my head on this for a while and I've read a bunch of articles and the issue isn't any clearer. I have a bunch of strings stored in my database, imagine the following:

x = '\xd0\xa4'
y = '\x92'

At the Python shell I get the following:

print x
Ф
print y
?

Which is exactly what I want to see. However then there is the following:

print unicode(x, 'utf8')
Ф

But not this:

unicode(y, 'utf8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: unexpected code byte

My feeling is that our strings are getting mangled because Django tries to convert them to unicode, but I'm just guessing at this point. Any insights or workarounds appreciated.

UPDATE: When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding.

y is not a valid UTF-8 encoded string. Why do you expect Python to be able to decode this? — Thanatos
– Thanatos, Commented Jul 10, 2010 at 21:56
Also, I'm assuming that x = '\xd0\xa4 (there's an extra slash) — Thanatos
– Thanatos, Commented Jul 10, 2010 at 21:57
@Thanatos. I know that. But how can it print without specifying an encoding? Can the encoding be inferred? — dnolen
– dnolen, Commented Jul 10, 2010 at 22:17
btw, x = '\xd0\xa4' might do something completely different than loading a string from the db. — user3850
– user3850, Commented Jul 10, 2010 at 23:20
next time don't try to be smart and reduce your problem to something unrelated, thanks. — user3850
– user3850, Commented Jul 12, 2010 at 7:46

John Machin · Accepted Answer · 2010-07-11 22:53:06Z

7

Looks like you have a typo; should be x = '\xd0\xa4'. It helps very much if you use copy paste of what you actually ran and what appeared on the output.

"\x92" is not a valid UTF-8 string. This explains the exception that you got.

More of a puzzle is why print y produced ?. What are you calling "the Python console"?? It appears to be operating in "replace" mode and substituting "?" ... are you sure that it's a plain "?" and not a white "?" inside a black diamond? Why do you say that "?" is exactly what you expect to see?

UPDATE: You now say """When I look at the database at the row that contains the '\x92' value, I see this character as ’. An apostrophe. I'm viewing the contents of the database using a Unicode UTF-8 encoding."""

That's not an apostrophe. It seems that that piece of data has been encoded using one of the cp125X (aka windows-125X) encodings. Illustrating using cp1252 (the usual suspect):

IDLE 2.6.4      
>>> import unicodedata
>>> uc = '\x92'.decode('cp1252')
>>> print repr(uc)
u'\u2019'
>>> print uc
’
>>> unicodedata.name(uc)
'RIGHT SINGLE QUOTATION MARK'
>>>

Instead of "viewing the contents of the database using a Unicode UTF-8 encoding" (whatever that means), try writing a small snippet of Python code to extract the offending string and then do print repr(bad_string). Show us the code that you ran, plus the output of the repr(). Also tell us which version of Python, what platform (Windows or unix-based), and what version of what database software. And the part of the CREATE TABLE statement relevant to the column in question.

Also please read this and this.

edited Jul 11, 2010 at 22:53

answered Jul 10, 2010 at 22:29

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3850 Over a year ago

didn't i tell you this would happen? :)

John Machin Over a year ago

@hop: No, you said you suspected that there was a different underlying problem. And that was like saying that you suspected that the sun rises in the east -- an OP rarely asks the question they should have asked.

user3850 · Accepted Answer · 2010-07-10 22:28:15Z

5

\x92 is not a valid utf-8 encoded character.

You don't notice that because you use simple (non-unicode) strings for x and y until you try to decode them into unicode strings. When you then print them, they are simple dumped to the terminal "as is" and the terminal itself interprets the bytes according to its encoding setting.

There is a third parameter to unicode() that tells python what to do in case of encoding (decoding) errors:

>>> unicode('\x92', 'utf8', 'replace')
u'\ufffd'
>>> print _
�

answered Jul 10, 2010 at 22:28

user3850

8 Comments

John Machin Over a year ago

@hop: "You don't notice that because you use simple (non-unicode) strings for x and y until you try to decode them into unicode strings." -- so you're saying that the simple non-unicode string "\xd0\xa4" has been magically transmogrified into the unicode character U+0424 CYRILLIC CAPITAL LETTER EF without any decoding happening??

user3850 Over a year ago

@John: no, i don't say that at all. there is nothing magic about the terminal decoding a valid utf-8 sequence into a unicode character to display. it's just not python that does any decoding.

Thanatos Over a year ago

@John: The terminal decodes that "\xd0\xa4" to the U+0424 because your terminal is configured for UTF-8, which is typically the default nowadays. If it was set to something else, this would not work.

John Machin Over a year ago

@hop: The essence of the problem is that in this case "the terminal" decodes the byte string in a fashion inconsistent with unicode(y, 'utf8').

John Machin Over a year ago

@Thanatos: I'm well aware that utf8 is typically the default (for *x terminals). My point was that @hop's original text appeared to be saying that the terminal wasn't doing any decoding at all.

|

score 4 · Accepted Answer · 2010-07-10 22:34:27Z

4

I thought any unicode character other than the ASCII subset had a multi-byte representation in UTF-8. Your y makes sense as a single-byte-per-char string, but not as a UTF-8 string. Because the single byte is outside the 0x00 to 0x7F ASCII range, the codec will expect an extra byte or more for the conversion to a "real" unicode character.

I'm not as familiar with Python as I once was, though, and I'm not confident about this answer.

EDIT hops is the better answer IMO.

edited Jul 10, 2010 at 22:34

answered Jul 10, 2010 at 22:07

user180247

Comments

Thanatos · Accepted Answer · 2010-07-10 23:58:16Z

I see now where you're confused. Let's look at this:

x = '\xd0\xa4'
y = '\x92'

If I print x, I get Ф. This is because my terminal is using UTF-8 as its character encoding. Thus, when it gets D0 A4, it attempts to decode it as UTF-8, and gets a "Ф". If I change my terminal to use, say, ISO-8859-1 ("latin1"), and I say print x, my terminal will attempt to decode D0 A4 using ISO-8859-1, and since D0 A4 is also a valid ISO-8859-1 string, it will decode, but this time, to "Ð¤".

Now, for print y. This isn't a UTF-8 string, so my terminal can't decode this. It shows me this error, in my case, by printing "�". I'm wondering if you see "�" or "?" - you should probably see the former, but it depends on what your terminal does in the face of bad output.

Your terminal's encoding should match whatever $LANG says, and your program should output data in whatever encoding $LANG specifies. Nowadays, $LANG is typically ???.UTF-8, where the ??? varies. (Mine is en_US.UTF-8)

Now, when you say unicode(y, 'utf8'), Python attempts to decode this as UTF-8, and appropriately throws an exception.

I'm using Gnome Terminal, and can change my character encoding by going to Terminal → Set Character Encoding

AndiDog · Accepted Answer · 2010-07-10 22:36:42Z

1

0x92 (hex) = 10 010010 (binary)

As UTF-8 can represent 010010 in one byte, the "header" must be 0 (--> 00010010) instead of 10 (which can never be the header of the first byte). Characters may not be represented with more bytes than needed, so "\x92" is not a valid UTF-8 encoded string.

I guess your database uses some one-byte-per-character encoding (such as latin-1). If you're coding the database queries yourself, you must ensure that the connection encoding is correct or that strings are decoded correctly. With Django models, everything should work automatically.

answered Jul 10, 2010 at 22:36

AndiDog

70.6k21 gold badges166 silver badges208 bronze badges

Collectives™ on Stack Overflow

Python UTF8 string confusion

5 Answers 5

2 Comments

8 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

8 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related