5

I assume there's some beautiful Pythonic way to do this, but I haven't quite figured it out yet. Basically I'm looking to create a testing module and would like a nice simple way for users to define a character set to pull from. I could potentially concatenate a list of the various charsets associated with string, but that strikes me as a very unclean solution. Is there any way to get the charset that the regex represents?

Example:

def foo(regex_set):
    re.something(re.compile(regex_set))

foo("[a-z]")
>>> abcdefghijklmnopqrstuvwxyz

The compile is of course optional, but in my mind that's what this function would look like.

15
  • Is the regex guaranteed to match one code-point or do you want the minimal alphabet that covers all symbols in the language specified by the regex? Commented Jul 8, 2013 at 19:33
  • im pretty sure you cant do that... at least not cleanly ... if its just one char you could bruteforce it but thats gross why not just use string.ascii_lowercase, etc Commented Jul 8, 2013 at 19:34
  • You'd need to create your own parser, and you'd probably only want to support a subset of regex syntax. I assume [a-z](?<![a-hj-z]) isn't something you'd want to support. (That's an obfuscated way of saying [i], in case you don't recognize the syntax.) Commented Jul 8, 2013 at 19:34
  • 2
    Then just create your own syntax: az would mean "a to z". aa would mean "just a". That's not hard to do in any language. Commented Jul 8, 2013 at 19:37
  • 2
    @SlaterTyranus Have a list of letters, each with a check box next to it. Simple, prevalent, well documented functionality. Commented Jul 8, 2013 at 19:52

4 Answers 4

9

Paul McGuire, author of Pyparsing, has written an inverse regex parser, with which you could do this:

import invRegex
print(''.join(invRegex.invert('[a-z]')))
# abcdefghijklmnopqrstuvwxyz

If you do not want to install Pyparsing, there is also a regex inverter that uses only modules from the standard library with which you could write:

import inverse_regex
print(''.join(inverse_regex.ipermute('[a-z]')))
# abcdefghijklmnopqrstuvwxyz

Note: neither module can invert all regex patterns.


And there are differences between the two modules:

import invRegex
import inverse_regex
print(repr(''.join(invRegex.invert('.'))))
print(repr(''.join(inverse_regex.ipermute('.'))))

yields

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Here is another difference, this time pyparsing enumerates a larger set of matches:

x = list(invRegex.invert('[a-z][0-9]?.'))
y = list(inverse_regex.ipermute('[a-z][0-9]?.'))
print(len(x))
# 26884
print(len(y))
# 1100

Sign up to request clarification or add additional context in comments.

5 Comments

Ooh, looks extremely promising. let me check this out for a little bit.
what does invert(".") give? just out of curiousity
@JoranBeasley: I've added the result for both modules.
thanks .... that basically highlights some of the issues with the approach he wants to take...
@JoranBeasley - try it for yourself and see: utilitymill.com/utility/Regex_inverter/13
2

A regex is not needed here. If you want to have users select a character set, let them just pick characters. As I said in my comment, simply listing all the characters and putting checkboxes by them would be sufficent. If you want something that is more compact, or just looks cooler, you could do something like one of these:

One way of displaying the letter selection. (green = selected) Another way of displaying the letter selection. (no x = selected Yet another way of displaying the letter selection. (black bg = selected)

Of course, if you actually use this, what you come up with will undoubtedly look better than these (And they will also actually have all the letters in them, not just "A").

If you need, you could include a button to invert the selection, select all, clear selection, save selection, or anything else you need to do.

4 Comments

Woah, I thought you were joking. Upvote for proof of concept, but I don't believe in GUIs.
I was, actually, but then I realized that that is actually a good solution, too.
Certainly great for some, hence the upvote, but you're speaking to someone with someone that uses dwm.
I don't really believe in GUIs either, actually. Some people seem to like them, though.
1

if its just simple ranges you could manually parse it

def range_parse(rng):
    min,max = rng.split("-")
    return "".join(chr(i) for i in range(ord(min),ord(max)+1))

print range_parse("a-z")+range_parse('A-Z')

but its gross ...

1 Comment

Wasn't thinking of this as being just simple ranges.
0

Another solution I thought of to simplify the problem:

Stick your own [ and ] on the line as part of the prompt, and disallow those characters in the input. After you scan the input and verify it doesn't contain anything matching [\[\]], you can prepend [ and append ] to the string, and use it like a regex against a string of all the characters needed ("abcdefghijklmnopqrstuvwxyz", fort instance).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.