3

Is it possible?

This code works in Python3:

In [1]: import re

In [2]: re.split(r'\W+', 'Les Misérables')
Out[2]: ['Les', 'Misérables']

But it does not work in Python2:

In [1]: import re

In [2]: re.split(r'\W+', u'Les Misérables')
Out[2]: [u'Les', u'Mis', u'rables']

This does not work either (tested on Linux with es_ES.UTF-8 locale):

In [1]: import locale

In [2]: locale.setlocale(locale.LC_ALL, 'es_ES.UTF-8')
Out[2]: 'es_ES.UTF-8'

In [3]: import re

In [4]: re.split(ur'\W+', u'Les Misérables', re.U|re.L)
Out[4]: [u'Les', u'Mis', u'rables']

Is there any way to get regex working with Unicode in Python2?

Note: The question is about getting Unicode-aware matches. I know I can rewrite the above regex to separate words using only ASCII classes.

1 Answer 1

4

Your mistake is that you're adding flags on the wrong place (flags should be the 4th param).

>>> import re
>>> re.split(r'(?u)\W+', u'Les Misérables')
[u'Les', u'Mis\xe9rables']
>>> re.split(ur'\W+', u'Les Misérables', 0, re.U)
[u'Les', u'Mis\xe9rables']

To avoid these issues I'd recommend using inline flags (like (?u) above).

Sign up to request clarification or add additional context in comments.

2 Comments

Right, I was mispositioning the Unicode flag. Didn't know the (?u) expression. Seems useful.
you could use flags=re.U keyword argument.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.