Posix classes in regex module Python

Question

I installed the module regex (not re!) for Python 3.4.3 solely to be able to use POSIX classes such as [:graph:]. However, these don't seem to work.

import regex

sentence = "I like math, I divided ÷ the power ³ by ¾"

sentence = regex.sub("[^[:graph:]\s]","",sentence)

print(sentence)

Output: I like math, I divided ÷ the power ³ by ¾

Expected output: I like math, I divided the power by

It does work in PCRE though. So what am I missing here?

Try sentence = regex.sub(r"(?V1)[^[:graph:]\s]","",sentence). — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 14, 2015 at 19:20
If anything, I think it's PCRE that's doing it wrong. [:graph:] is supposed to match any visible character, but PCRE is only counting ASCII characters. The regex library treats the POSIX character classes as fully Unicode-aware, except a few that seem to be limited to the original POSIX definitions. (Search for "POSIX character classes" at the link you provided.) — Alan Moore
– Alan Moore, Commented Aug 14, 2015 at 19:41
@WashingtonGuedes I don't think that would work anyway, because I want to target all elements that are nor graph, nor \s. — Bram Vanroy
– Bram Vanroy, Commented Aug 14, 2015 at 19:42

vks · Accepted Answer · 2015-08-25 05:01:42Z

1

try sentence = regex.sub("[^[:graph:]\s]","",sentence,flags=regex.VERSION1)

You need to add flag regex.VERSION1

edited Aug 25, 2015 at 5:01

answered Aug 25, 2015 at 3:42

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Joseph Stover · Accepted Answer · 2015-08-26 16:59:29Z

1

Not sure about the regex module, but you can get the result with

import re

sentence = "I like math, I divided ÷ the power ³ by ¾"

sentence = re.sub("[^\x21-\x7E\s]","",sentence)

print(sentence)

There is a nice graph at http://www.regular-expressions.info/posixbrackets.html that shows how to convert the POSIX classes to ASCII, which the re module understands.

edited Aug 26, 2015 at 16:59

answered Aug 14, 2015 at 19:45

Joseph Stover

4274 silver badges13 bronze badges

4 Comments

Bram Vanroy Over a year ago

I'm late, but this doesn't work as expected. It also deletes special characters such as é and à which I don't want.

Bram Vanroy Over a year ago

Just to let you know, that doesn't solve the problem either when the characters is surrounded by spaces: regex101.com/r/sM0yO2/1. I'm guessing that the ASCII range doesn't include special letter characters.

Joseph Stover Over a year ago

@BramVanroy: I've been playing around with this some more. Can you explain how [^\x21-\x7E\s]is behaving differently than [^[:graph:]\s] for you. For me, both are removing é and à. When I moved the \s outside the square brackets it stopped deleting them only because I was typing them at the end of the string I was testing. regex101.com/r/sM0yO2/2 and regex101.com/r/sM0yO2/3

Bram Vanroy Over a year ago

In R, with the unicode flag they behave differently though. At least, that's how I tested it. If I got the time, I'll post a test case for you.

Collectives™ on Stack Overflow

Posix classes in regex module Python

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related