I have files with lines with unicode encodings like D\u00f3nde est\u00e1s.
I would like to check each word if it contains only characters from the set locale.
This code does not completely work. The string seems to be correctly transformed to Dónde estás and wordmatch matches each word, but it does not consider the locale setting. E.g. if I set the locale to en_US it still matches both words even though they contain ó and á characters.
Using re.LOCALE instead of re.UNICODE also does not seem to work, and both words no longer match the wordmatch regular expression.
import re
import locale
locale.setlocale(locale.LC_ALL,'en_ES')
wordmatch=re.compile(r'^\w*$',re.UNICODE)
line="D\u00f3nde est\u00e1s"
line=line.decode('unicode_escape')
for word in line.split():
if wordmatch.match(word):
print "Matched "+word
else:
print "No match "+word