2

I have files with lines with unicode encodings like D\u00f3nde est\u00e1s. I would like to check each word if it contains only characters from the set locale.

This code does not completely work. The string seems to be correctly transformed to Dónde estás and wordmatch matches each word, but it does not consider the locale setting. E.g. if I set the locale to en_US it still matches both words even though they contain ó and á characters.

Using re.LOCALE instead of re.UNICODE also does not seem to work, and both words no longer match the wordmatch regular expression.

import re
import locale

locale.setlocale(locale.LC_ALL,'en_ES')
wordmatch=re.compile(r'^\w*$',re.UNICODE)

line="D\u00f3nde est\u00e1s"
line=line.decode('unicode_escape')

for word in line.split():
    if wordmatch.match(word):
        print "Matched "+word
    else:
        print "No match "+word

1 Answer 1

1

Changing the locale doesn't directly mean changing the encoding, and the encoding for en_US is not forcibly ascii. On my system, for example it is iso-8859-1, an encoding where ó and á are valid. This could explain why re.LOCALE doesn't complain about these characters.

To manipulate encodings, I would rather use the encode function than regular expressions:

line="D\u00f3nde est\u00e1s"
line=line.decode('unicode_escape')

# get current encoding, or set to "ascii" if you want to be more restrictive
pref_encoding = locale.getpreferedencoding()

for word in line.split():
    try:
        w = word.encode(pref_encoding)
    except UnicodeEncodeError as e:
        print "This word contains unacceptable characters: ", word
        break
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.