I can use the ur'something' and the re.U flag in Python2 to compile a regex pattern, e.g.:
$ python2
Python 2.7.13 (default, Dec 18 2016, 07:03:39)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(ur'(«)', re.U)
>>> s = u'«abc «def«'
>>> re.sub(pattern, r' \1 ', s)
u' \xab abc \xab def \xab '
>>> print re.sub(pattern, r' \1 ', s)
« abc « def «
In Python3, I can avoid the u'something' and even the re.U flag:
$ python3
Python 3.5.2 (default, Oct 11 2016, 04:59:56)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(r'(«)')
>>> s = u'«abc «def«'
>>> print( re.sub(pattern, r' \1 ', s))
« abc « def «
But the goal is to write the regex such that it supports both Python2 and Python3. And doing ur'something' in Python3 would result in a syntax error:
>>> pattern = re.compile(ur'(«)', re.U)
File "<stdin>", line 1
pattern = re.compile(ur'(«)', re.U)
^
SyntaxError: invalid syntax
Since it's a syntax error, even checking versions before declaring the pattern wouldn't work in Python3:
>>> import sys
>>> _pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
File "<stdin>", line 1
_pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
^
SyntaxError: invalid syntax
How to unicode regex to support both Python2 and Python3?
Although r' ' could easily be replaced by u' ' by dropping the literal string in this case.
There are complicated regexes that sort of requires the r' ' for sanity sake, e.g.
re.sub(re.compile(r'([^\.])(\.)([\]\)}>"\'»]*)\s*$', re.U), r'\1 \2\3 ', s)
So the solution should include the literal string r' ' usage unless there're other ways to get around it. But do note that using string literals or unicode_literals or from __future__ is undesired since it will cause tonnes of other problems, esp. in other parts of the code base that I work with, see http://python-future.org/unicode_literals.html
For specific reason why the code base discourages unicode_literals import but uses the r' ' notation is because filled with it and making changes to each one of them is going to be extremely painful, e.g.
u'(«)'should work fine ...re.escapecan replace the raw string usage? Something likere.compile(re.escape(u'([^\.])(\.)([\]\)}>"\'»]*)\s*$'), re.U)?re.escapecan really help you. It's a pity they don't support theurprefix, but I guess they wanted to limit the number of prefixes, and after all, Python 2 doesn't have to be supported for eternity. – Why don't you use the futureunicode_literalson a module basis, eg. only for those files that actually contain a lot of complex regexes? For the rest, it seems you'll have to double the backslashes...nltk.tokenize.__init__has, like, 6 string literals (all regex patterns), three of which are already unicode. Leaves you three strings to test. (unicode_literalsdoesn't affect anything that is imported.)