Writing unicode regex for both Python2 and Python3

Question

I can use the ur'something' and the re.U flag in Python2 to compile a regex pattern, e.g.:

$ python2
Python 2.7.13 (default, Dec 18 2016, 07:03:39) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(ur'(«)', re.U)
>>> s = u'«abc «def«'
>>> re.sub(pattern, r' \1 ', s)
u' \xab abc  \xab def \xab '
>>> print re.sub(pattern, r' \1 ', s)
 « abc  « def «

In Python3, I can avoid the u'something' and even the re.U flag:

$ python3
Python 3.5.2 (default, Oct 11 2016, 04:59:56) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = re.compile(r'(«)')
>>> s = u'«abc «def«'
>>> print( re.sub(pattern, r' \1 ', s))
 « abc  « def «

But the goal is to write the regex such that it supports both Python2 and Python3. And doing ur'something' in Python3 would result in a syntax error:

>>> pattern = re.compile(ur'(«)', re.U)
  File "<stdin>", line 1
    pattern = re.compile(ur'(«)', re.U)
                               ^
SyntaxError: invalid syntax

Since it's a syntax error, even checking versions before declaring the pattern wouldn't work in Python3:

>>> import sys
>>> _pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
  File "<stdin>", line 1
    _pattern = r'(«)' if sys.version_info[0] == 3 else ur'(«)'
                                                             ^
SyntaxError: invalid syntax

How to unicode regex to support both Python2 and Python3?

Although r' ' could easily be replaced by u' ' by dropping the literal string in this case.

There are complicated regexes that sort of requires the r' ' for sanity sake, e.g.

re.sub(re.compile(r'([^\.])(\.)([\]\)}>"\'»]*)\s*$', re.U), r'\1 \2\3 ', s)

So the solution should include the literal string r' ' usage unless there're other ways to get around it. But do note that using string literals or unicode_literals or from __future__ is undesired since it will cause tonnes of other problems, esp. in other parts of the code base that I work with, see http://python-future.org/unicode_literals.html

For specific reason why the code base discourages unicode_literals import but uses the r' ' notation is because filled with it and making changes to each one of them is going to be extremely painful, e.g.

Maybe I'm missing something, but for this case, it doesn't seem like you actually need a raw string... IOW, u'(«)' should work fine ... — mgilson
– mgilson, Commented Apr 12, 2017 at 3:12
Could re.escape can replace the raw string usage? Something like re.compile(re.escape(u'([^\.])(\.)([\]\)}>"\'»]*)\s*$'), re.U)? — alvas
– alvas, Commented Apr 12, 2017 at 3:19
I don't think re.escape can really help you. It's a pity they don't support the ur prefix, but I guess they wanted to limit the number of prefixes, and after all, Python 2 doesn't have to be supported for eternity. – Why don't you use the future unicode_literals on a module basis, eg. only for those files that actually contain a lot of complex regexes? For the rest, it seems you'll have to double the backslashes... — lenz
– lenz, Commented Apr 14, 2017 at 21:48
nltk.tokenize.__init__ has, like, 6 string literals (all regex patterns), three of which are already unicode. Leaves you three strings to test. (unicode_literals doesn't affect anything that is imported.) — lenz
– lenz, Commented Apr 15, 2017 at 8:48

cco · Accepted Answer · 2017-04-12 03:21:53Z

1

Do you really need raw strings? For your example, a unicode string is needed, but not a raw string. Raw strings are a convenience, but not required - just double any \ you would use in the raw string and use plain unicode.

Python 2 allows concatenating a raw string with a unicode string (resulting in a unicode string), so you could use r'([^\.])(\.)([\]\)}>"\'' u'»' r']*)\s*$'
In Python 3, they will all be unicode, so that will work too.

edited Apr 12, 2017 at 3:21

answered Apr 12, 2017 at 3:14

cco

6,3712 gold badges20 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Patrick Ng · Accepted Answer · 2022-06-23 10:52:16Z

0

I had the same problem, and I ended up doing something like this using the dangerous eval() function. It know it's not pretty, but it allows my code to work in both Python 2 and Python 3.

if sys.version_info.major == 2:
    pattern = eval("re.compile(ur'(\u00ab)', re.U)")
else:
    pattern = re.compile(r'(«)', re.U)

answered Jun 23, 2022 at 10:52

Patrick Ng

1808 bronze badges

Collectives™ on Stack Overflow

Writing unicode regex for both Python2 and Python3

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related