Python regex with unicode characters bug?

Question

Long story short:

>>> re.compile(r"\w*").match(u"Français")
<_sre.SRE_Match object at 0x1004246b0>
>>> re.compile(r"^\w*$").match(u"Français")
>>> re.compile(r"^\w*$").match(u"Franais")
<_sre.SRE_Match object at 0x100424780>
>>>

Why doesn't it match the string with unicode characters with ^ and $ in the regex? As far as I understand ^ stands for the beginning of the string(line) and $ - for the end of it.

kennytm · Accepted Answer · 2010-08-31 08:36:54Z

5

You need to specify the UNICODE flag, otherwise \w is just equivalent to [a-zA-Z0-9_], which does not include the character 'ç'.

>>> re.compile(r"^\w*$", re.U).match(u"Fran\xe7ais")
<_sre.SRE_Match object at 0x101474168>

answered Aug 31, 2010 at 8:36

kennytm

526k110 gold badges1.1k silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ak. Over a year ago

Why does this wort then: >>> re.compile(r"\w*").match(u"Français")?

kennytm Over a year ago

@ak: Are you sure the match returns Français instead of Fran with it? Note that without the $ the regex won't match until the end.

Turtle Over a year ago

\w* will match absolutely anything. * matches 0 or more times.

Collectives™ on Stack Overflow

Python regex with unicode characters bug?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related