Utf-8 decoding with Python

Question

I have a csv with some data, and in one row there is a text that was added after encoding it in utf-8.

This is the text:

"b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'"

I'm trying to use this text to obtain the original characters using the decode function, but it's imposible.

Does anyone know which is the correct procedure to do it?

abybaddi009 · Accepted Answer · 2018-02-21 10:48:47Z

4

Assuming that the line in your file is exactly like this:

b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'

And reading the line from the file gives the output:

>>> line
"b'\\xe7\\x94\\xb3\\xe8\\xbf\\xaa\\xe8\\xa5\\xbf\\xe8\\xb7\\xaf255\\xe5\\xbc\\x84660\\xe5\\x8f\\xb7\\xe5\\x92\\x8c665\\xe5\\x8f\\xb7 \\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe4\\xb8\\x8a\\xe6\\xb5\\xb7\\xe6\\xb5\\xa6\\xe4\\xb8\\x9c\\xe6\\x96\\xb0\\xe5\\x8c\\xba 201205'"`

You can try to use eval() function:

with open(r"your_csv.csv", "r") as csvfile:
    for line in csvfile:
        # when you reach the desired line
        b = eval(line).decode('utf-8')

Output:

>>> print(b)
'申迪西路255弄660号和665号 中国上海浦东新区 201205'

answered Feb 21, 2018 at 10:48

abybaddi009

1,1049 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Madmartigan Over a year ago

What the file contens is : b'\xe7\x94\xb3\xe8\...' and when I read and print is <class 'str'> b'\xe7\x94\xb3\xe8'

abybaddi009 Over a year ago

Can you show what the actual file looks like? May be from an editor like Notepad++?

Edwin van Mierlo Over a year ago

@Madmartigan that is exactly what is meant by this answer using eval(), did you try it ?

Narendra · Accepted Answer · 2018-02-21 10:30:40Z

0

Try this:-

a = b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
print(a.decode('utf-8')) #your decoded output

As you are saying you are reading from file then you can try with passing encoding system when reading:-

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
    print repr(line)

edited Feb 21, 2018 at 10:30

answered Feb 21, 2018 at 10:00

Narendra

1,5391 gold badge11 silver badges20 bronze badges

3 Comments

Madmartigan Over a year ago

I know that works. My problem is that I can not find the way to prepare the string. When I read the row I obtain "b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\..." But I need b'\xe7\x94\xb3\xe8\xbf\xaa\xe8...'

Narendra Over a year ago

@Madmartigan ok in that case i modified my answer...try with it

viraptor Over a year ago

@Narendra OP is asking about python-3. It's enough to use open(path, 'r', encoding='utf-8'). You don't have to use the codecs module.

Collectives™ on Stack Overflow

Utf-8 decoding with Python

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related