1

I am using an API in Python v2.7 to obtain a string, the content of which is unknown. The content can be in English, German or French. The variable name assigned to the returned string is 'category'. An example of a returned value for the variable category is:-

"temp\\u00eate de poussi\\u00e8res"

I have tried category.decode('utf-8') to decode the string into, in the above case, French, but unfortunately it still returns the same value, with an additional unicode 'u' at the beginning when I print the result of category.decode('utf-8').

u'"temp\\u00eate de poussi\\u00e8res'

I also tried category.encode('utf-8') but it returns the same value (minus the 'u' that precedes the string:-

'"temp\\u00eate de poussi\\u00e8res"'

Any suggestions?

2 Answers 2

2

I think you have literal slashes in your string, not unicode characters.

That is, \u00ea is the unicode escape encoding for ê, but \\u00ea is actually a slash (escaped), two zeros and two letters.

Similar for the quotation marks, your first and last characters are literal double quotes ".

You can convert those slash plus codepoint into their equivalent characters with:

x = '"temp\\u00eate de poussi\\u00e8res"'
d = x.decode("unicode_escape")
print d

The output is:

"tempête de poussières"

Note that to see the proper international characters you have to use print. If instead you just write d in the interactive Python shell you get:

 u'"temp\xeate de poussi\xe8res"'

where \xea is equivalent as \u00ea, that is the escape sequence for ê.

Removing the quotes, if required, is left as an exercise to the reader ;-).

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks @rodrigo. Can you explain further what you mean at the end? I made the changes as you suggested but I get the below error returned. This is returned as a response to a print command:- UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 5: ordinal not in range(128)
@thefragileomen: Can you specify what sencence you want me explained? The one about the quotes? About your new error, ascii codec can only code for ASCII characters and ê is not an ASCII character. Why doing a print implies an ASCII encoding is another matter, usually this happens because you are redirecting the output of your program and python assumes that all files are ASCII unless said otherwise. Please see this answer for all the details.
Your print wants to convert the string to ASCII, probably because you haven't set it up to use a sane (ideally Unicode-compatible) system encoding. Lots of these issues go away if you simply switch to Python 3 and a properly Unicode-compatible locale.
Thanks @rodrigo. If you can explain more about the new error. I thought this was related sorry and not a new matter.
Unfortunately @tripleee, I am unable to move to Python 3 due to restrictions in the environment I am deploying this code (restrictions that are out of my control)
|
1

It looks like the API uses JSON. You can decode it with the json module:

>>> import json
>>> json.loads('"temp\\u00eate de poussi\\u00e8res"')
u'temp\xeate de poussi\xe8res'
>>> print(json.loads('"temp\\u00eate de poussi\\u00e8res"'))
tempête de poussières

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.