1

I have the following function in python, which takes a string as argument and returns the same string in ASCII (e.g. "alçapão" -> "alcapao"):

def filt(word):
    dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
    new = ''
    for l in word:
        new = new + dic.get(l, l)
    return new

It is supposed to "filter" all strings in a list that I read from a file using this:

lines = []
with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.strip())

lines = [filt(l) for l in lines]

But I get this:

filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert 
  both arguments to Unicode - interpreting them as being unequal 
  new = new + dic.get(l, l)

and the strings filtered have characters like '\xc3\xb4' instead of ASCII characters. What should I do?

2
  • Which version of python? There are mayor differences on how UTF-8 is handled between versions Commented Feb 17, 2017 at 16:10
  • 2.7.12 (Ubuntu's version) Commented Feb 17, 2017 at 16:13

2 Answers 2

3

You're mixing and matching Unicodes strs and regular (byte) strs.

Use the io module to open and decode your text file to Unicodes as it's read:

with io.open("to-filter.txt","r", encoding="utf-8") as f:

this assumes your to-filter.txt file is UTF-8 encoded.

You can also shrink your file read into an array with just:

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines is now a list of Unicode strings.

Optional

It looks like you're trying to convert non-ASCII characters to their closest ASCII equivalent. The easy way to this is:

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

What this does is:

  1. Decomposes each character into their component parts. For example, ã can be expressed as a single Unicode char (U+00E3 'LATIN SMALL LETTER A WITH TILDE') or as two Unicode characters: U+0061 'LATIN SMALL LETTER A' + U+0303 'COMBINING TILDE'.
  2. Encode component parts to ASCII. Non ASCII parts (those with code points greater than U+007F), will be ignored.
  3. Decode back to a Unicode str for convenience.

Tl;dr

Your code is now:

import unicodedata
def filt(word):
    return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')

with io.open("to-filter.txt","r", encoding="utf-8") as f:
    lines = f.read().splitlines()

lines = [filt(l) for l in lines]

Python 3.x

Although not strictly necessarily, remove io from open()

Sign up to request clarification or add additional context in comments.

Comments

-1

The root of your problem is that you're not reading Unicode strings from the file, you're reading byte strings. There are three ways to fix this, first is to open the file with the io module as suggested by another answer. The second is to convert each string as you read it:

with open("to-filter.txt","r") as f:
    for line in f:
        lines.append(line.decode('utf-8').strip())

The third way is to use Python 3, which always reads text files into Unicode strings.

Finally, there's no need to write your own code to turn accented characters into plain ASCII, there's a package unidecode to do that.

from unidecode import unidecode
print(unidecode(line))

1 Comment

Module unidecode converts string like '\u0646\u0638\u0627\u0631\u062a' into one like 'jhdy pskhgwy nyzhy mrwz mrdm nyst', instead of 'و مذهب شیعه امیرالمونین علیه السلام اصل اسلام است و آمریکا قصد براندازی آن و اختلاف بین مسلمین را دارد'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.