How to fix broken utf-8 encoding in Python?

Question

My string is Niá»‡m Bá»“ TÃ¡t (Thiá»n sÆ° Nháº¥t Háº¡nh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

and I start to try by Python

mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
mystr.decode('utf-8')

but actually it is not correct because original string is utf-8 but the string show is not my expecting result.

Note: it is Vietnamese character.

How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.

looks like it was encoded as utf-8 but interpreted as latin-1. — ch3ka
– ch3ka, Commented Oct 21, 2014 at 17:08
>>> "Niệm Bồ Tát (Thiền sư Nhất Hạnh)".encode('utf-8').decode('latin-1') 'Niá»\x87m Bá»\x93 TÃ¡t (Thiá»\x81n sÆ° Nháº¥t Háº¡nh)' pretty close... — ch3ka
– ch3ka, Commented Oct 21, 2014 at 17:10
@BillLetson "Niệm Bồ Tát (Thiền sư Nhất Hạnh)".encode('utf-8').decode('cp1252') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.4/encodings/cp1252.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 23: character maps to <undefined> It is not cp1252 — giaosudau
– giaosudau, Commented Oct 22, 2014 at 14:20
@sepdau yes, it is. Use .decode('cp1252', errors='ignore') and you will get the mangled string exactly. Whatever program mangled your string in the first place ignored errors, which is why you can't get the ề character back even with the accepted answer modified to use cp1252. — Bill Letson
– Bill Letson, Commented Oct 22, 2014 at 14:48

Dmytro Rostopira · Accepted Answer · 2016-10-06 19:42:29Z

36

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

This module fixes pretty much everything and works much better than online decoders.

>>> from ftfy import fix_encoding
>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

answered Oct 6, 2016 at 19:42

Dmytro Rostopira

11.2k5 gold badges69 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

rodrigorf Over a year ago

It worked for me to fix encoding problems in html using lxml libs. Amazingly worked in the first try. Thanks

Dmytro Rostopira Over a year ago

@rodrigorf thanks should go to lib creator =) Star repo

Malik A. Rumi Over a year ago

@DimaRostopira: +1 for ftfy. It is amazing. Thank you for mentioning it. I would never have found it otherwise. And yes, I did send a thank you to the creator already. ;-)

Nick K9 Over a year ago

This library is astounding. It Just Works™, and far better and faster than my hacks would have. Kudos to the lib creators, and for you, Dima, for telling us about it!

retatu Over a year ago

Thank you. I lost about a whole day looking for some answer like yours.

|

Jonathan Ballet · Accepted Answer · 2022-11-01 19:11:49Z

19

I'm not sure what you can do with these kind of data, but for your example in your original post, this works (Python 3.x):

>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> s = mystr.encode('latin1').decode('utf8')
>>> s
'09. Bát Nhã Tâm Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

edited Nov 1, 2022 at 19:11

answered Oct 21, 2014 at 17:27

Jonathan Ballet

1,00210 silver badges21 bronze badges

7 Comments

Bill Letson Over a year ago

The encoding used to mangle this data was most likely cp1252, so using that instead of latin-1 will allow you to recover more (though not all) of the characters. Ni�m B� Tát (Thi�n sư Nhất Hạnh) vs Niệm Bồ Tát (Thi�n sư Nhất Hạnh)

giaosudau Over a year ago

I know it in python3 but how about in python2?

giaosudau Over a year ago

@multani it not work when I decode >>> mystr = 'Niá»‡m Bá»“ TÃ¡t (Thiá»n sÆ° Nháº¥t Háº¡nh)' >>> s = mystr.decode('utf8').encode('latin1').decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2021' in position 4: ordinal not in range(256) >>>

Philippe Over a year ago

Could you maybe update your answer so that it works in python3?

Rafael Baldasso Audibert Over a year ago

For Python3, you only need the latter part: s = mystr.encode('latin1').decode('utf8)

|

user7075574 · Accepted Answer · 2019-10-15 02:34:46Z

4

Try:

str.encode('ascii', 'ignore').decode('utf-8')

You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.

answered Oct 15, 2019 at 2:34

user7075574

Comments

boludoz · Accepted Answer · 2023-05-22 15:03:20Z

0

The correct method in python 3.9.6 is:

"string".encode('utf-8').decode('latin-1')

"string".encode('latin1').decode('utf8')

So, you can use:

'09. BÃ¡t NhÃ£ TÃ¢m Kinh'.encode('latin1').decode('utf8')

and the output is:

>>> '09. BÃ¡t NhÃ£ TÃ¢m Kinh'.encode('latin1').decode('utf8')
'09. Bát Nhã Tâm Kinh'

edited May 22, 2023 at 15:03

answered Aug 17, 2022 at 21:46

boludoz

685 bronze badges

Collectives™ on Stack Overflow

How to fix broken utf-8 encoding in Python?

4 Answers 4

6 Comments

7 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related