Python encoding/decoding problems

Question

How do I decode strings such as this one "weren\xe2\x80\x99t" back to the normal encoding.

So this word is actually weren't and not "weren\xe2\x80\x99t"? For example:

print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

â€œThings
“Things
Things

But I actually want to get "Things.

or:

print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

werenâ€™t
weren’t
werent

But I actually want to get weren't.

How should i do this?

You'll need to provide your desired translation dictionary -- e.g from fancy quotes to plain ASCII ones -- and use the .translate method of Unicode strings to apply it. I don't think there is a standard "asciify it down" translation dictionary around... — Alex Martelli
– Alex Martelli, Commented Jan 17, 2015 at 5:35

Brana · Accepted Answer · 2015-01-18 01:06:27Z

I mapped the most common strange chars so this is pretty much complete answer based on the Oliver W. answer.

This function is by no means ideal,but it is the best place to start with. There are more chars definitions:

http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

...

def unicodetoascii(text):

    uni2ascii = {
            ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
            ord('\xc3\xa9'.decode('utf-8')): ord('e'),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),

            ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),

            ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),

            ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
            ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
            ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
            ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
            ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),

                            }
    return text.decode('utf-8').translate(uni2ascii).encode('ascii')

print unicodetoascii("weren\xe2\x80\x99t")

Wim Feijen · Accepted Answer · 2021-03-26 10:58:59Z

5

In Python 3 I would do it like this:

string = "\xe2\x80\x9cThings"
bytes_string = bytes(string, encoding="raw_unicode_escape")
happy_result = bytes_string.decode("utf-8", "strict")
print(happy_result)

No translation maps needed, just code :)

answered Mar 26, 2021 at 10:58

Wim Feijen

8949 silver badges9 bronze badges

3 Comments

AKMalkadi Over a year ago

I was looking for this answer!

Sudipta Roy Over a year ago

Is there such solution for python 2.7.5?

Wim Feijen Over a year ago

Hi @SudiptaRoy do you have a possibility to update to Python 3.x ? If so, I would strongly recommend that. I do not have a Python 2.7.5 available, but I would strongly guess that the following code would work. No guarantees, but fingers crossed! string = u"\xe2\x80\x9cThings"; bytes_string = str(string, encoding="raw_unicode_escape"); print(happy_result)

Oliver W. · Accepted Answer · 2015-01-17 12:58:13Z

1

You should provide a translation map that maps unicode characters to other unicode characters (the latter should be within the ASCII range if you want to re-encode to it):

uni2ascii = {ord('\xe2\x80\x99'.decode('utf-8')): ord("'")}    
yourstring.decode('utf-8').translate(uni2ascii).encode('ascii')
print(yourstring)  # prints: "weren't"

answered Jan 17, 2015 at 12:58

Oliver W.

13.6k3 gold badges41 silver badges52 bronze badges

1 Comment

Brana Over a year ago

I know that i can do this. But is there a ready map that can do this automatically?

Collectives™ on Stack Overflow

Python encoding/decoding problems

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related