Regular expression to match a Python integer literal

Question

What is a regular expression that will match any valid Python integer literal in a string? It should support all the extra stuff like o and l, but not match a float, or a variable with a number in it. I am using Python's re, so any syntax supported by that is OK.

EDIT: Here's my motivation (as apparently that's quite important). I am trying to fix http://code.google.com/p/sympy/issues/detail?id=3182. What I want to do is create a hook for IPython that automatically converts int/int (like 1/2) to Rational(int, int), (like Rational(1, 2). The reason is that otherwise it is impossible to make 1/2 be registered as a rational number, because it's Python type __div__ Python type. In SymPy, this can be quite annoying because things like x**(1/2) will create x**0 (or x**0.5 with __future__ division or Python 3), when what you want is x**Rational(1, 2), an exact quantity.

My solution is to add a hook to IPython that automatically wraps all integer literals in the input with Integer (SymPy's custom integer class that gives Rational on division). This will let me add an option to isympy that will let SymPy act more like a traditional computer algebra system in this respect, for those who want it. I hope this explains why I need it to match any and all literals inside an arbitrary Python expression, which is why it needs to not match float literals and variables with numbers in their names.

Also, since everyone's so interested in what I tried, here it is: not much before I gave up (regular expressions are hard). I played with (?!\.) to make it not catch the first part of float literals, but this didn't seem to work (I'd be curious if someone can tell me why, an example is re.sub(r"(\d*(?!\.))", r"S$\1$", "12.1")).

EDIT 2: Since I plan to use this in conjunction with re.sub, you might as well wrap the whole thing in parentheses in your answers so I can use \1 :)

I did do my research. I googled for it, and even tried it myself. It got me nowhere. I didn't include that in the question because I didn't feel it was relevant. — asmeurer
– asmeurer, Commented Jul 31, 2012 at 5:10
And considering that none of the answers so far do what I want, I'd say it's not a trivial problem. — asmeurer
– asmeurer, Commented Jul 31, 2012 at 5:11
@asmeurer usually best to post your wrong/incomplete solution (in the question) than nothing purely for this reason. Also, mentioning why you want to do something along with the rest of the question can be handy, because there may be other solutions you didn't expect that are better than the one asked for. — Josh Smeaton
– Josh Smeaton, Commented Jul 31, 2012 at 5:21
I agree with @JoshSmeaton. Sorry if I was a little rude. If you edit your question, I can reverse my downvote. — Joel Cornett
– Joel Cornett, Commented Jul 31, 2012 at 5:28

Danica · Accepted Answer · 2012-07-31 05:02:00Z

5

The definition of the integer literal is (in 3.x, slightly different in 2.x):

integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"+
nonzerodigit   ::=  "1"..."9"
digit          ::=  "0"..."9"
octinteger     ::=  "0" ("o" | "O") octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
octdigit       ::=  "0"..."7"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"
bindigit       ::=  "0" | "1"

So, something like this:

[1-9]\d*|0|0[oO][0-7]+|0[xX][\da-fA-F]+|0[bB][01]+

Based on saying you want to support "l", I guess you actually want the 2.x definition:

longinteger    ::=  integer ("l" | "L")
integer        ::=  decimalinteger | octinteger | hexinteger | bininteger
decimalinteger ::=  nonzerodigit digit* | "0"
octinteger     ::=  "0" ("o" | "O") octdigit+ | "0" octdigit+
hexinteger     ::=  "0" ("x" | "X") hexdigit+
bininteger     ::=  "0" ("b" | "B") bindigit+
nonzerodigit   ::=  "1"..."9"
octdigit       ::=  "0"..."7"
bindigit       ::=  "0" | "1"
hexdigit       ::=  digit | "a"..."f" | "A"..."F"

which can be written

(?:[1-9]\d+|0|0[oO]?[0-7]+|0[xX][\da-fA-F]+|0[bB][01]+)[lL]?

edited Jul 31, 2012 at 5:02

answered Jul 31, 2012 at 4:55

Danica

29k6 gold badges94 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

asmeurer Over a year ago

I actually will want both. Thanks!

asmeurer Over a year ago

This still matches the first part of float literals and the number part of variables that contain numbers.

asmeurer Over a year ago

I haven't written it yet, but it looks like the decimal example from the Python docs is almost exactly what I want.

Jon Clements · Accepted Answer · 2012-07-31 07:58:50Z

4

I'm not convinced using an re is the way to go. Python has tokenize, ast, symbol and parser modules that can be used to parse/process/manipulate/re-write Python code...

>>> s = "33.2 + 6 * 0xFF - 0744"
>>> from StringIO import StringIO
>>> import tokenize
>>> t = list(tokenize.generate_tokens(StringIO(s).readline))
>>> t
[(2, '33.2', (1, 0), (1, 4), '33.2 + 6 * 0xFF - 0744'), (51, '+', (1, 5), (1, 6), '33.2 + 6 * 0xFF - 0744'), (2, '6', (1, 7), (1, 8), '33.2 + 6 * 0xFF - 0744'), (51, '*', (1, 9), (1, 10), '33.2 + 6 * 0xFF - 0744'), (2, '0xFF', (1, 11), (1, 15), '33.2 + 6 * 0xFF - 0744'), (51, '-', (1, 16), (1, 17), '33.2 + 6 * 0xFF - 0744'), (2, '0744', (1, 18), (1, 22), '33.2 + 6 * 0xFF - 0744'), (0, '', (2, 0), (2, 0), '')]
>>> nums = [eval(i[1]) for i in t if i[0] == tokenize.NUMBER]
>>> nums
[33.2, 6, 255, 484]
>>> print map(type, nums)
[<type 'float'>, <type 'int'>, <type 'int'>, <type 'int'>]

There's an example at http://docs.python.org/library/tokenize.html that re-writes floats as decimal.Decimal

answered Jul 31, 2012 at 7:58

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

3 Comments

asmeurer Over a year ago

That is a good point. I wonder if there is a significant speed difference in doing it this way.

Jon Clements Over a year ago

@asmeurer Thanks for accepted answer - how did it work out? (any link to see update?)

asmeurer Over a year ago

see github.com/sympy/sympy/pull/1470. Ironically, the hard part was getting IPython to do this automatically. It turns out their API needs updating.

ruakh · Accepted Answer · 2012-07-31 12:18:17Z

4

The syntax is described at http://docs.python.org/reference/lexical_analysis.html#integers. Here's one way to express it as a regex:

(0|[1-9][0-9]*|0[oO]?[0-7]+|0[xX][0-9a-fA-F]+|0[bB][01]+)[lL]?

Disclaimer: this does not support negative integers, because in Python, the - in something like -31 isn't actually part of the integer literal, but rather, it's a separate operator.

edited Jul 31, 2012 at 12:18

answered Jul 31, 2012 at 4:53

ruakh

185k29 gold badges292 silver badges324 bronze badges

4 Comments

Danica Over a year ago

Missing the format for e.g. 0755 as a hex literal; also requires the [lL] on the end right now.

asmeurer Over a year ago

It's OK if the - is separate. It will still work out fine for what I am doing.

Joel Cornett Over a year ago

Hmmm interesting point about the -. Now that I think about it, it makes sense that it would be a separate operator.

ruakh Over a year ago

@Dougal: In other words, I was missing both instances of ?. Dunno how that happened. Thanks for pointing it out; fixed now.

Tim Pietzcker · Accepted Answer · 2012-07-31 07:00:22Z

2

If you really want to match both "dialects", you'll get some ambiguities, for example with octals (the o is required in Python 3). But the following should work:

r = r"""(?xi) # Verbose, case-insensitive regex
(?<!\.)       # Assert no dot before the number
\b            # Start of number
(?:           # Match one of the following:
 0x[0-9a-f]+| # Hexadecimal number
 0o?[0-7]+|   # Octal number
 0b[01]+|     # Binary number
 0+|          # Zero
 [1-9]\d*     # Other decimal number
)             # End of alternation
L?            # Optional Long integer
\b            # End of number
(?!\.)        # Assert no dot after the number"""

edited Jul 31, 2012 at 7:00

answered Jul 31, 2012 at 6:19

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

10 Comments

asmeurer Over a year ago

Yes, I know that I'll have to use different ones for different Pythons, but that's not a big deal as I care only about the running Python version, so a simple sys.version_info will do it for me.

asmeurer Over a year ago

Shouldn't it be a raw string?

asmeurer Over a year ago

Also, unless I parenthesized it incorrectly for \1, it doesn't seem to work correctly for floats (it just matches both ints before and after the .)

Tim Pietzcker Over a year ago

You're right. I had misconstructed the lookaround assertions (it's too early in the morning). Now it should finally work. Sorry.

Tim Pietzcker Over a year ago

Also, you don't need any parentheses - \0 contains the entire match.

|

Joel Cornett · Accepted Answer · 2012-07-31 07:23:32Z

1

Would something like this suffice?

r = r"""
(?<![\w.])               #Start of string or non-alpha non-decimal point
    0[X][0-9A-F]+L?|     #Hexadecimal
    0[O][0-7]+L?|        #Octal
    0[B][01]+L?|         #Binary
    [1-9]\d*L?           #Decimal/Long Decimal, will not match 0____
(?![\w.])                #End of string or non-alpha non-decimal point
"""

(with flag re.VERBOSE | re.IGNORECASE)

edited Jul 31, 2012 at 7:23

answered Jul 31, 2012 at 5:27

Joel Cornett

24.8k9 gold badges69 silver badges90 bronze badges

2 Comments

Tim Pietzcker Over a year ago

Instead of (?:^|[^\w\.]), you should use (?<![\w.]). Same with (?:$|[^\w\.]): use (?![^\w.]). Otherwise the characters before/after the number will become part of the match.

Tim Pietzcker Over a year ago

Also, octals only go up to the digit 7. And you can make your regex more legible using the re.I flag.

Josh Smeaton · Accepted Answer · 2012-07-31 05:00:41Z

0

This gets fairly close:

re.match('^(0[x|o|b])?\d+[L|l]?$', '0o123l')

answered Jul 31, 2012 at 5:00

Josh Smeaton

48.8k24 gold badges137 silver badges165 bronze badges

4 Comments

Josh Smeaton Over a year ago

ugh, after looking at some of the answers, mine will provide a lot of false positives, and completely skips hex literals.

Josh Smeaton Over a year ago

Wow a downvote for an incomplete answer, even after I mention the limitations? Figure lack of an upvote should be enough.

asmeurer Over a year ago

In my experience, you gotta just delete your wrong answers, or they will be downvoted into oblivion (though honestly at 10.3k I wouldn't be worrying too much about my reputation if I were you)

Josh Smeaton Over a year ago

@asmeurer yeah you're right - and I'm not worried too much about reputation as much as education I guess.

Collectives™ on Stack Overflow

Regular expression to match a Python integer literal

6 Answers 6

3 Comments

3 Comments

4 Comments

10 Comments

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

3 Comments

4 Comments

10 Comments

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related