4

What is a regular expression that will match any valid Python integer literal in a string? It should support all the extra stuff like o and l, but not match a float, or a variable with a number in it. I am using Python's re, so any syntax supported by that is OK.

EDIT: Here's my motivation (as apparently that's quite important). I am trying to fix http://code.google.com/p/sympy/issues/detail?id=3182. What I want to do is create a hook for IPython that automatically converts int/int (like 1/2) to Rational(int, int), (like Rational(1, 2). The reason is that otherwise it is impossible to make 1/2 be registered as a rational number, because it's Python type __div__ Python type. In SymPy, this can be quite annoying because things like x**(1/2) will create x**0 (or x**0.5 with __future__ division or Python 3), when what you want is x**Rational(1, 2), an exact quantity.

My solution is to add a hook to IPython that automatically wraps all integer literals in the input with Integer (SymPy's custom integer class that gives Rational on division). This will let me add an option to isympy that will let SymPy act more like a traditional computer algebra system in this respect, for those who want it. I hope this explains why I need it to match any and all literals inside an arbitrary Python expression, which is why it needs to not match float literals and variables with numbers in their names.

Also, since everyone's so interested in what I tried, here it is: not much before I gave up (regular expressions are hard). I played with (?!\.) to make it not catch the first part of float literals, but this didn't seem to work (I'd be curious if someone can tell me why, an example is re.sub(r"(\d*(?!\.))", r"S\(\1\)", "12.1")).

EDIT 2: Since I plan to use this in conjunction with re.sub, you might as well wrap the whole thing in parentheses in your answers so I can use \1 :)

5
  • Everything you need to know is in the Python Docs Commented Jul 31, 2012 at 5:10
  • I did do my research. I googled for it, and even tried it myself. It got me nowhere. I didn't include that in the question because I didn't feel it was relevant. Commented Jul 31, 2012 at 5:10
  • And considering that none of the answers so far do what I want, I'd say it's not a trivial problem. Commented Jul 31, 2012 at 5:11
  • 3
    @asmeurer usually best to post your wrong/incomplete solution (in the question) than nothing purely for this reason. Also, mentioning why you want to do something along with the rest of the question can be handy, because there may be other solutions you didn't expect that are better than the one asked for. Commented Jul 31, 2012 at 5:21
  • I agree with @JoshSmeaton. Sorry if I was a little rude. If you edit your question, I can reverse my downvote. Commented Jul 31, 2012 at 5:28

6 Answers 6

5

The definition of the integer literal is (in 3.x, slightly different in 2.x):

integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"+
nonzerodigit   ::=  "1"..."9"
digit          ::=  "0"..."9"
octinteger     ::=  "0" ("o" | "O") octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
octdigit       ::=  "0"..."7"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"
bindigit       ::=  "0" | "1"

So, something like this:

[1-9]\d*|0|0[oO][0-7]+|0[xX][\da-fA-F]+|0[bB][01]+

Based on saying you want to support "l", I guess you actually want the 2.x definition:

longinteger    ::=  integer ("l" | "L")
integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"
octinteger     ::=  "0" ("o" | "O") octdigit+ | "0" octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
nonzerodigit   ::=  "1"..."9"
octdigit       ::=  "0"..."7"
bindigit       ::=  "0" | "1"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"

which can be written

(?:[1-9]\d+|0|0[oO]?[0-7]+|0[xX][\da-fA-F]+|0[bB][01]+)[lL]?
Sign up to request clarification or add additional context in comments.

3 Comments

I actually will want both. Thanks!
This still matches the first part of float literals and the number part of variables that contain numbers.
I haven't written it yet, but it looks like the decimal example from the Python docs is almost exactly what I want.
4

I'm not convinced using an re is the way to go. Python has tokenize, ast, symbol and parser modules that can be used to parse/process/manipulate/re-write Python code...

>>> s = "33.2 + 6 * 0xFF - 0744"
>>> from StringIO import StringIO
>>> import tokenize
>>> t = list(tokenize.generate_tokens(StringIO(s).readline))
>>> t
[(2, '33.2', (1, 0), (1, 4), '33.2 + 6 * 0xFF - 0744'), (51, '+', (1, 5), (1, 6), '33.2 + 6 * 0xFF - 0744'), (2, '6', (1, 7), (1, 8), '33.2 + 6 * 0xFF - 0744'), (51, '*', (1, 9), (1, 10), '33.2 + 6 * 0xFF - 0744'), (2, '0xFF', (1, 11), (1, 15), '33.2 + 6 * 0xFF - 0744'), (51, '-', (1, 16), (1, 17), '33.2 + 6 * 0xFF - 0744'), (2, '0744', (1, 18), (1, 22), '33.2 + 6 * 0xFF - 0744'), (0, '', (2, 0), (2, 0), '')]
>>> nums = [eval(i[1]) for i in t if i[0] == tokenize.NUMBER]
>>> nums
[33.2, 6, 255, 484]
>>> print map(type, nums)
[<type 'float'>, <type 'int'>, <type 'int'>, <type 'int'>]

There's an example at http://docs.python.org/library/tokenize.html that re-writes floats as decimal.Decimal

3 Comments

That is a good point. I wonder if there is a significant speed difference in doing it this way.
@asmeurer Thanks for accepted answer - how did it work out? (any link to see update?)
see github.com/sympy/sympy/pull/1470. Ironically, the hard part was getting IPython to do this automatically. It turns out their API needs updating.
4

The syntax is described at http://docs.python.org/reference/lexical_analysis.html#integers. Here's one way to express it as a regex:

(0|[1-9][0-9]*|0[oO]?[0-7]+|0[xX][0-9a-fA-F]+|0[bB][01]+)[lL]?

Disclaimer: this does not support negative integers, because in Python, the - in something like -31 isn't actually part of the integer literal, but rather, it's a separate operator.

4 Comments

Missing the format for e.g. 0755 as a hex literal; also requires the [lL] on the end right now.
It's OK if the - is separate. It will still work out fine for what I am doing.
Hmmm interesting point about the -. Now that I think about it, it makes sense that it would be a separate operator.
@Dougal: In other words, I was missing both instances of ?. Dunno how that happened. Thanks for pointing it out; fixed now.
2

If you really want to match both "dialects", you'll get some ambiguities, for example with octals (the o is required in Python 3). But the following should work:

r = r"""(?xi) # Verbose, case-insensitive regex
(?<!\.)       # Assert no dot before the number
\b            # Start of number
(?:           # Match one of the following:
 0x[0-9a-f]+| # Hexadecimal number
 0o?[0-7]+|   # Octal number
 0b[01]+|     # Binary number
 0+|          # Zero
 [1-9]\d*     # Other decimal number
)             # End of alternation
L?            # Optional Long integer
\b            # End of number
(?!\.)        # Assert no dot after the number"""

10 Comments

Yes, I know that I'll have to use different ones for different Pythons, but that's not a big deal as I care only about the running Python version, so a simple sys.version_info will do it for me.
Shouldn't it be a raw string?
Also, unless I parenthesized it incorrectly for \1, it doesn't seem to work correctly for floats (it just matches both ints before and after the .)
You're right. I had misconstructed the lookaround assertions (it's too early in the morning). Now it should finally work. Sorry.
Also, you don't need any parentheses - \0 contains the entire match.
|
1

Would something like this suffice?

r = r"""
(?<![\w.])               #Start of string or non-alpha non-decimal point
    0[X][0-9A-F]+L?|     #Hexadecimal
    0[O][0-7]+L?|        #Octal
    0[B][01]+L?|         #Binary
    [1-9]\d*L?           #Decimal/Long Decimal, will not match 0____
(?![\w.])                #End of string or non-alpha non-decimal point
"""

(with flag re.VERBOSE | re.IGNORECASE)

2 Comments

Instead of (?:^|[^\w\.]), you should use (?<![\w.]). Same with (?:$|[^\w\.]): use (?![^\w.]). Otherwise the characters before/after the number will become part of the match.
Also, octals only go up to the digit 7. And you can make your regex more legible using the re.I flag.
0

This gets fairly close:

re.match('^(0[x|o|b])?\d+[L|l]?$', '0o123l')

4 Comments

ugh, after looking at some of the answers, mine will provide a lot of false positives, and completely skips hex literals.
Wow a downvote for an incomplete answer, even after I mention the limitations? Figure lack of an upvote should be enough.
In my experience, you gotta just delete your wrong answers, or they will be downvoted into oblivion (though honestly at 10.3k I wouldn't be worrying too much about my reputation if I were you)
@asmeurer yeah you're right - and I'm not worried too much about reputation as much as education I guess.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.