Python 2 vs Python 3 Regex matching behavior

Question

Python 3

import re

P = re.compile(r'[\s\t]+') 
re.sub(P, ' ', '\xa0 haha')
' haha'

Python 2

import re

P = re.compile(r'[\s\t]+')
re.sub(P, u' ', u'\xa0 haha')
u'\xa0 haha'

I desire the Python 3 behavior, but in Python 2 code. How come the regex pattern fails to match space-like codepoints like \xa0 in Python 2 but correctly matches these in Python 3?

Martijn Pieters · Accepted Answer · 2015-01-22 12:37:43Z

7

Use the re.UNICODE flag:

>>> import re
>>> P = re.compile(r'[\s\t]+', flags=re.UNICODE)
>>> re.sub(P, u' ', u'\xa0 haha')
u' haha'

Without the flag, only ASCII whitespace is matched; \xa0 is not part of the ASCII standard (it is a Latin-1 codepoint).

The re.UNICODE flag is the default in Python 3; use re.ASCII if you wanted to have the Python 2 (bytestring) behaviour.

Note that there is no point in including \t in the character class; \t is already part of the \s class, so the following will match the exact same input:

P = re.compile(r'\s+', flags=re.UNICODE)

edited Jan 22, 2015 at 12:37

answered Jan 22, 2015 at 12:08

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python 2 vs Python 3 Regex matching behavior

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related