7

I'm getting back from a library what looks to be an incorrect unicode string:

>>> title
u'Sopet\xc3\xb3n'

Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?

The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?

I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)

It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:

>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón

But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?

2 Answers 2

11

a) Try to put it through the method below.

b)

>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
Sign up to request clarification or add additional context in comments.

3 Comments

Note 1) there is not a general way to recognize utf-8; this will recognize it because the UTF-8 decoder will check that all the multiple-byte sequences it's given are valid, and will raise an exception if any are not, 2) the encode-to-Latin-1 trick works because your code points are all less than 256, and Unicode's code points 0-255 correspond exactly to Latin-1's representation.
I'm not sure I completely understand your comment. Perhaps a specific counterexample would help. So far as I understand, the ".encode('latin-1')" is a no-op except that the result is a str rather than a unicode. Is there a string for which that will not be the case? I agree that there won't be a general way to detect UTF-8 inside a unicode string, as the UTF-8 encoded bytes will have a valid (if incorrect) interpretation inside a unicode string. For my purposes, I'm really only interested in latin-1 (for now), so this is sufficient.
@Watts: u'\u03b5\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac means greek'.encode('latin1')
8

You should use:

>>> title.encode('raw_unicode_escape')

Python2:

print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))

Python3:

print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

1 Comment

you saved my day. I had a unicode object with utf-8 bytes inside, and had to decode it back to 'normal' unicode. This solved it for me: my_str.encode('raw_unicode_escape').decode('utf-8'). I think this is a more general solution that the accepted answer, because it decodes strings not just in the 'latin-1' range. Thanks! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.