16

My string is Niệm Bồ Tát (Thiá»n sư Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

and I start to try by Python

mystr = '09. Bát Nhã Tâm Kinh'
mystr.decode('utf-8')

but actually it is not correct because original string is utf-8 but the string show is not my expecting result.

Note: it is Vietnamese character.

How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.

5
  • 2
    looks like it was encoded as utf-8 but interpreted as latin-1. Commented Oct 21, 2014 at 17:08
  • 1
    >>> "Niệm Bồ Tát (Thiền sư Nhất Hạnh)".encode('utf-8').decode('latin-1') 'Niá»\x87m Bá»\x93 Tát (Thiá»\x81n sư Nhất Hạnh)' pretty close... Commented Oct 21, 2014 at 17:10
  • 3
    @ch3ka, its actually cp1252, a superset of latin-1 Commented Oct 21, 2014 at 18:09
  • @BillLetson "Niệm Bồ Tát (Thiền sư Nhất Hạnh)".encode('utf-8').decode('cp1252') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.4/encodings/cp1252.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 23: character maps to <undefined> It is not cp1252 Commented Oct 22, 2014 at 14:20
  • 1
    @sepdau yes, it is. Use .decode('cp1252', errors='ignore') and you will get the mangled string exactly. Whatever program mangled your string in the first place ignored errors, which is why you can't get the ề character back even with the accepted answer modified to use cp1252. Commented Oct 22, 2014 at 14:48

4 Answers 4

36

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

This module fixes pretty much everything and works much better than online decoders.

>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

Sign up to request clarification or add additional context in comments.

6 Comments

It worked for me to fix encoding problems in html using lxml libs. Amazingly worked in the first try. Thanks
@rodrigorf thanks should go to lib creator =) Star repo
@DimaRostopira: +1 for ftfy. It is amazing. Thank you for mentioning it. I would never have found it otherwise. And yes, I did send a thank you to the creator already. ;-)
This library is astounding. It Just Works™, and far better and faster than my hacks would have. Kudos to the lib creators, and for you, Dima, for telling us about it!
Thank you. I lost about a whole day looking for some answer like yours.
|
19

I'm not sure what you can do with these kind of data, but for your example in your original post, this works (Python 3.x):

>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.encode('latin1').decode('utf8')
>>> s
'09. Bát Nhã Tâm Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

7 Comments

The encoding used to mangle this data was most likely cp1252, so using that instead of latin-1 will allow you to recover more (though not all) of the characters. Ni�m B� Tát (Thi�n sư Nhất Hạnh) vs Niệm Bồ Tát (Thi�n sư Nhất Hạnh)
I know it in python3 but how about in python2?
@multani it not work when I decode >>> mystr = 'Niệm Bồ Tát (Thiá»n sư Nhất Hạnh)' >>> s = mystr.decode('utf8').encode('latin1').decode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2021' in position 4: ordinal not in range(256) >>>
Could you maybe update your answer so that it works in python3?
For Python3, you only need the latter part: s = mystr.encode('latin1').decode('utf8)
|
4

Try:

str.encode('ascii', 'ignore').decode('utf-8')

You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.

Comments

0

The correct method in python 3.9.6 is:

"string".encode('utf-8').decode('latin-1')

"string".encode('latin1').decode('utf8')

So, you can use:

'09. Bát Nhã Tâm Kinh'.encode('latin1').decode('utf8')

and the output is:

>>> '09. Bát Nhã Tâm Kinh'.encode('latin1').decode('utf8')
'09. Bát Nhã Tâm Kinh'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.