Error when reading UTF-8 characters with python

Question

I have the following function in python, which takes a string as argument and returns the same string in ASCII (e.g. "alçapão" -> "alcapao"):

def filt(word):
    dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
    new = ''
    for l in word:
        new = new + dic.get(l, l)
    return new

It is supposed to "filter" all strings in a list that I read from a file using this:

lines = []
with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.strip())

lines = [filt(l) for l in lines]

But I get this:

filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert 
  both arguments to Unicode - interpreting them as being unequal 
  new = new + dic.get(l, l)

and the strings filtered have characters like '\xc3\xb4' instead of ASCII characters. What should I do?

Which version of python? There are mayor differences on how UTF-8 is handled between versions — Bruno9779
– Bruno9779, Commented Feb 17, 2017 at 16:10

Alastair McCormack · Accepted Answer · 2017-02-17 16:49:46Z

You're mixing and matching Unicodes strs and regular (byte) strs.

Use the io module to open and decode your text file to Unicodes as it's read:

with io.open("to-filter.txt","r", encoding="utf-8") as f:

this assumes your to-filter.txt file is UTF-8 encoded.

You can also shrink your file read into an array with just:

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines is now a list of Unicode strings.

Optional

It looks like you're trying to convert non-ASCII characters to their closest ASCII equivalent. The easy way to this is:

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

What this does is:

Decomposes each character into their component parts. For example, ã can be expressed as a single Unicode char (U+00E3 'LATIN SMALL LETTER A WITH TILDE') or as two Unicode characters: U+0061 'LATIN SMALL LETTER A' + U+0303 'COMBINING TILDE'.
Encode component parts to ASCII. Non ASCII parts (those with code points greater than U+007F), will be ignored.
Decode back to a Unicode str for convenience.

Tl;dr

Your code is now:

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines = [filt(l) for l in lines]

Python 3.x

Although not strictly necessarily, remove io from open()

Mark Ransom · Accepted Answer · 2017-02-17 17:05:58Z

-1

The root of your problem is that you're not reading Unicode strings from the file, you're reading byte strings. There are three ways to fix this, first is to open the file with the io module as suggested by another answer. The second is to convert each string as you read it:

with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.decode('utf-8').strip())

The third way is to use Python 3, which always reads text files into Unicode strings.

Finally, there's no need to write your own code to turn accented characters into plain ASCII, there's a package unidecode to do that.

from unidecode import unidecode
print(unidecode(line))

answered Feb 17, 2017 at 17:05

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

1 Comment

محسن عباسی Over a year ago

Module unidecode converts string like '\u0646\u0638\u0627\u0631\u062a' into one like 'jhdy pskhgwy nyzhy mrwz mrdm nyst', instead of 'و مذهب شیعه امیرالمونین علیه السلام اصل اسلام است و آمریکا قصد براندازی آن و اختلاف بین مسلمین را دارد'.

Collectives™ on Stack Overflow

Error when reading UTF-8 characters with python

2 Answers 2

Optional

Tl;dr

Python 3.x

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Optional

Tl;dr

Python 3.x

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related