Chinese encoding in Python

Question

When I output some Chinese character in Python (Pandas), it shows as below

\xe8\xbf\x99\xe7\xa7\x8d\xe6\x83\x85\xe5\x86\xb5\xe6\x98\xaf\xe6\xb2\xb9\xe6\xb3\xb5\xe6\x95\x85\xe9\x9a\x9c\xe7\x81\xaf\xef\xbc\x8c\xe6\xa3\x80\xe6\x9f\xa5\xe4\xb8\x80\xe4\xb8\x8b\xe6\xb2\xb9\xe6\xb3\xb5\xe6\x8f\x92\xe5\xa4\xb4\xe6\x98\xaf\xe5\x90\xa6\xe6\x8e\xa5\xe8\x99\x9a\xef\xbc\x8c\xe7\x84\xb6\xe5\x90\x8e\xe6\x9f\xa5\xe4\xb8\x80\xe4\xb8\x8b\xe6\xb2\xb9\xe6\xb3\xb5\xe5\x86\x85\xe7\xae\xa1\xe9\x81\x93\xe5\x8e\x8b\xe5\x8a\x9b\xe6\x98\xaf\xe5\x90\xa6\xe7\xac\xa6\xe5\x90\x88\xe6\xad\xa3\xe5\xb8\xb8\xe5\x80\xbc\xe3\x80\x82

What is the encoding format? It is not unicode as I know. Thanks!

Try putting # -*- coding: utf-8 -*- at the top of your python source file to force Pytohn into UTF-8 — Ben
– Ben, Commented Jul 13, 2018 at 22:24
@Ben A coding directive only affects how the interpreter decodes the script itself, it has no effect on what the script does to external data that it reads or writes. — PM 2Ring
– PM 2Ring, Commented Jul 13, 2018 at 22:25
That looks like UTF-8 encoded Chinese to me, although I don't read Chinese. 这种情况是油泵故障灯，检查一下油泵插头是否接虚，然后查一下油泵内管道压力是否符合正常值。 — PM 2Ring
– PM 2Ring, Commented Jul 13, 2018 at 22:28
Surely those online tools want to know what the encoding is as well? — Jongware
– Jongware, Commented Jul 14, 2018 at 0:01

MilkyWay90 · Accepted Answer · 2018-07-14 15:05:09Z

1

The output you are receiving is called a bytes object. In order to decode it, you need to do output.decode('utf-8').

For example:

output = b'\xe8\xbf\x99\xe7...'
unicode_output = output.decode('utf-8')
print(unicode_output)

would then output non-latin characters (I cannot include it because it counts as spam).

Another way to do this in one-line would be: print(b'\xe8\xbf\x99\xe7...'.decode('utf-8')).

However, if that doesn't work, then it is probably because of the fact that your output isn't a bytes object, but is contained within a string. If that does not work, then there is another solution.

output = '\xe8\xbf\x99\xe7...'
exec('print(b\''+ output + '\'.decode(\'utf-8\'))')

That should be able to fix it. Hope you got something useful out of this. Have a good day!

answered Jul 14, 2018 at 15:05

MilkyWay90

2,0831 gold badge12 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Victor Sergienko · Accepted Answer · 2018-07-13 23:51:43Z

0

This is bytes type, containing a valid utf-8 Chinese text (as far as I can trust Google Translate).

If it's a string literal from your code, add # -*- coding: utf-8 -*- as the first line of your Python file.

If it's an external data, here's how to convert it to a text (str type): bytes_text.decode("utf-8")

answered Jul 13, 2018 at 23:51

Victor Sergienko

13.7k3 gold badges63 silver badges98 bronze badges

Comments

rigsby · Accepted Answer · 2018-07-14 00:07:49Z

0

raw_bytes = b'\xe8\xbf\x99\xe7\xa7\x8d\xe6\x83\x85 . . .'

with raw_bytes a <class 'bytes'> object containing your hexadecimal characters you can then call decode on raw_bytes and get a <class 'str'> representation of your characters.

string_text = raw_bytes.decode("utf-8")

answered Jul 14, 2018 at 0:07

rigsby

7928 silver badges20 bronze badges

Collectives™ on Stack Overflow

Chinese encoding in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related