Python unicode encoding for regular expression

Question

I have files with lines with unicode encodings like D\u00f3nde est\u00e1s. I would like to check each word if it contains only characters from the set locale.

This code does not completely work. The string seems to be correctly transformed to Dónde estás and wordmatch matches each word, but it does not consider the locale setting. E.g. if I set the locale to en_US it still matches both words even though they contain ó and á characters.

Using re.LOCALE instead of re.UNICODE also does not seem to work, and both words no longer match the wordmatch regular expression.

import re
import locale

locale.setlocale(locale.LC_ALL,'en_ES')
wordmatch=re.compile(r'^\w*$',re.UNICODE)

line="D\u00f3nde est\u00e1s"
line=line.decode('unicode_escape')

for word in line.split():
    if wordmatch.match(word):
        print "Matched "+word
    else:
        print "No match "+word

Cilyan · Accepted Answer · 2014-02-13 21:34:55Z

1

Changing the locale doesn't directly mean changing the encoding, and the encoding for en_US is not forcibly ascii. On my system, for example it is iso-8859-1, an encoding where ó and á are valid. This could explain why re.LOCALE doesn't complain about these characters.

To manipulate encodings, I would rather use the encode function than regular expressions:

line="D\u00f3nde est\u00e1s"
line=line.decode('unicode_escape')

# get current encoding, or set to "ascii" if you want to be more restrictive
pref_encoding = locale.getpreferedencoding()

for word in line.split():
    try:
        w = word.encode(pref_encoding)
    except UnicodeEncodeError as e:
        print "This word contains unacceptable characters: ", word
        break

answered Feb 13, 2014 at 21:34

Cilyan

8,5811 gold badge32 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python unicode encoding for regular expression

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related