1

Python 3

import re

P = re.compile(r'[\s\t]+') 
re.sub(P, ' ', '\xa0 haha')
' haha' 

Python 2

import re

P = re.compile(r'[\s\t]+')
re.sub(P, u' ', u'\xa0 haha')
u'\xa0 haha'

I desire the Python 3 behavior, but in Python 2 code. How come the regex pattern fails to match space-like codepoints like \xa0 in Python 2 but correctly matches these in Python 3?

1 Answer 1

7

Use the re.UNICODE flag:

>>> import re
>>> P = re.compile(r'[\s\t]+', flags=re.UNICODE)
>>> re.sub(P, u' ', u'\xa0 haha')
u' haha'

Without the flag, only ASCII whitespace is matched; \xa0 is not part of the ASCII standard (it is a Latin-1 codepoint).

The re.UNICODE flag is the default in Python 3; use re.ASCII if you wanted to have the Python 2 (bytestring) behaviour.

Note that there is no point in including \t in the character class; \t is already part of the \s class, so the following will match the exact same input:

P = re.compile(r'\s+', flags=re.UNICODE)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.