3

I installed the module regex (not re!) for Python 3.4.3 solely to be able to use POSIX classes such as [:graph:]. However, these don't seem to work.

import regex

sentence = "I like math, I divided ÷ the power ³ by ¾"

sentence = regex.sub("[^[:graph:]\s]","",sentence)

print(sentence)

Output: I like math, I divided ÷ the power ³ by ¾

Expected output: I like math, I divided the power by

It does work in PCRE though. So what am I missing here?

9
  • Try sentence = regex.sub(r"(?V1)[^[:graph:]\s]","",sentence). Commented Aug 14, 2015 at 19:20
  • @stribizhev Same output unfortunately. Commented Aug 14, 2015 at 19:28
  • @WashingtonGuedes Doesn't work either. Commented Aug 14, 2015 at 19:39
  • 3
    If anything, I think it's PCRE that's doing it wrong. [:graph:] is supposed to match any visible character, but PCRE is only counting ASCII characters. The regex library treats the POSIX character classes as fully Unicode-aware, except a few that seem to be limited to the original POSIX definitions. (Search for "POSIX character classes" at the link you provided.) Commented Aug 14, 2015 at 19:41
  • @WashingtonGuedes I don't think that would work anyway, because I want to target all elements that are nor graph, nor \s. Commented Aug 14, 2015 at 19:42

2 Answers 2

1

try sentence = regex.sub("[^[:graph:]\s]","",sentence,flags=regex.VERSION1)

You need to add flag regex.VERSION1

Sign up to request clarification or add additional context in comments.

Comments

1

Not sure about the regex module, but you can get the result with

import re

sentence = "I like math, I divided ÷ the power ³ by ¾"

sentence = re.sub("[^\x21-\x7E\s]","",sentence)

print(sentence)

There is a nice graph at http://www.regular-expressions.info/posixbrackets.html that shows how to convert the POSIX classes to ASCII, which the re module understands.

4 Comments

I'm late, but this doesn't work as expected. It also deletes special characters such as é and à which I don't want.
Just to let you know, that doesn't solve the problem either when the characters is surrounded by spaces: regex101.com/r/sM0yO2/1. I'm guessing that the ASCII range doesn't include special letter characters.
@BramVanroy: I've been playing around with this some more. Can you explain how [^\x21-\x7E\s]is behaving differently than [^[:graph:]\s] for you. For me, both are removing é and à. When I moved the \s outside the square brackets it stopped deleting them only because I was typing them at the end of the string I was testing. regex101.com/r/sM0yO2/2 and regex101.com/r/sM0yO2/3
In R, with the unicode flag they behave differently though. At least, that's how I tested it. If I got the time, I'll post a test case for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.