4

Well, character encoding and decoding sometimes frustrates me a lot.

So we know u'\u4f60\u597d' is the utf-8 encoding of 你好,

>>> print hellolist
[u'\u4f60\u597d']
>>> print hellolist[0]
你好

Now what I really want to get from the output or write to a file is [u'你好'], but it's [u'\u4f60\u597d'] all the time, so how do you do it?

0

4 Answers 4

5

You are misunderstanding.

u'' in python is not utf-8, it is simply Unicode (except on Windows in Python <= 3.2, where it is utf-16 instead).

utf-8 is an encoding of Unicode, which is necessarily a sequence of bytes.

Additionally, u'你' and u'\u4f60' are exactly the same thing. It's simply that in Python2 the repr of high characters uses escapes instead of raw values.

Since Python2 is heading for EOL very soon now, you should start to think seriously about switching to Python3. It is a lot easier to keep track of all this in Python3 since there's only one string type and it's much more clear when you .encode and .decode.

Sign up to request clarification or add additional context in comments.

2 Comments

Saying that u'' is utf-16 on certain platforms and Python versions is irrelevant - it uses some encoding internally, but which one is an implementation detail (and you got the detail slightly wrong anyway: how unicode characters are represented internally used to depend on how the interpreter was compiled; it now depends on the characters in the string). But +1 for u'你' and u'\u4f60' being the same thing, which is the important point here - these are different ways of printing the same object and the two spellings will be treated identically by Python in all situations.
@Ivc please do note, when you print the list , u'\u4f60' would be printed/converted as u'\\u4f60 before printing or writing to file, this is the issue that the OP is talking about, this is because of the internal use of repr() by lists.
4

When you print (or write to a file) a list it internally calls the str() method of the list , but list internally calls repr() on its elements. repr() returns the ugly unicode representation that you are seeing .

Example of repr -

>>> h = u'\u4f60\u597d'
>>> print h
\u4f60\u597d
>>> print repr(h)
u'\u4f60\u597d'

You would need to manually take the elements of the list and print them for them to print correctly.

Example -

>>> h1 = [h,u'\u4f77\u587f']
>>> print u'[' + u','.join([u"'" + unicode(i) + u"'" for i in h1]) + u']'

For lists containing sublists that may have unicode characters, you would need a recursive function , example -

>>> h1 = [h,(u'\u4f77\u587f',)]
>>> def listprinter(l):
...     if isinstance(l, list):
...             return u'[' + u','.join([listprinter(i) for i in l]) + u']'
...     elif isinstance(l, tuple):
...             return u'(' + u','.join([listprinter(i) for i in l]) + u')'
...     elif isinstance(l, (str, unicode)):
...             return u"'" + unicode(l) + u"'"
... 
>>> 
>>> 
>>> print listprinter(h1)

To save them to file, use the same list comprehension or recursive function. Example -

with open('<filename>','w') as f:
    f.write(listprinter(l))

Comments

0
 with open("some_file.txt","wb") as f:
    f.write(hellolist[0].encode("utf8"))

I think will resolve your issue

most text editors use utf8 encoding :)

while the other answers are correct none of them actually resolved your issue

>>> u'\u4f60\u597d'.encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'

if you want the brackets

>>> u'[u\u4f60\u597d,]'.encode("utf8")

Comments

0

one thing is the unicode character itself

hellolist = u'\u4f60\'

and another is how you can represent it.

You can represent it in many many ways depending on where you are going to display.

Web: UTF-8 Database: maybe UTF-16 or UTF-8 Web in Japan: EUC-JP or Shift JIS

For example 本 http://unicode.org/cgi-bin/GetUnihanData.pl?codepoint=672c http://www.fileformat.info/info/unicode/char/672c/index.htm

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.