4

In Python Regular Expressions,

re.compile("x"*50000)

gives me OverflowError: regular expression code size limit exceeded

but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC

>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>

Is that normal?

Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?

Tested on Python 2.6, Win32

UPDATE 1:

It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?

So, how about this one?

re.compile(".*?x"*50000)

It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.

So, Am I missing something?

UPDATE 2:

My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.

It does not make sense for me, thats why.

It is something missing on overflow checking or Its just fine or its really overflowing something?

Any Hints/Opinions will be appreciated.

6
  • No ".*?x"5000 does not reduce down to ".?x" - it reduces down to a regex with 5000 x's in, with ".*?" before each x. This is why it does not match "abcx" or "x" - it will only match a string with 5000 x's. Commented Jan 4, 2010 at 9:34
  • 1
    It reminds me, that I hit once PHP's maximum length for regular expressions, when I tried to convert the ABNF for SVG paths into a regexp. So, IMHO, the answer would be good to know. Commented Jan 4, 2010 at 9:35
  • @Dave Kirby, Thanks, but its 50000 (50k). My original questions is re.compile("x"*50000) does not get compiled, but re.compile(".*?x"*50000) got compile. Commented Jan 4, 2010 at 9:37
  • Why do you need to know the maximum regex? Write the regex you need. x{5000} or x{500000} or whatever. When you reach the limit, there you are. What's the point of knowing where the limit is? You don't need to know until the (unlikely) even that you write a REAL regex thats too long. Not these degenerate cases that aren't sensible regexs in the first place. Commented Jan 4, 2010 at 10:58
  • @S.Lott, You're right, knowing the limit is no points at all, but, don't you think it can be vulnerable for Buffer Overflow when "x"*50000 got caught by overflow handler, but ".*?x"*50000 does not, even ".*?x"*100000 does not. Commented Jan 4, 2010 at 11:03

2 Answers 2

6

The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).

EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, but how about re.compile(".*?x"*50000)?
I don't really know about internals about the Python regex engine, so I'm not sure. The regex should match 50000 x's with any characters between them. I don't know what optimalizations does it do, but it's likely that it does something special about the regex. FYI, all of the regexes work on Linux.
Thanks for info about linux version, I just checked _sre.c -> _compile function, there is no specific code for windows, so it may be because of some size different like wchar_t and/or your python is compiled with Unicode=UCS4
"x"*50000 will generate one long item and ".*?x"*50000 will generate many small items, Thank you, that make sense.
1

you want to match 50000 "x"s , correct??? if so, an alternative without regex

if "x"*50000 in mystring:
    print "found"

if you want to match 50000 "x"s using regex, you can use range

>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>

on my system it will take in length of 65535 max

>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.6/re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
    groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>

I don't know if there are tweaks in Python we can use to increase that limit though.

4 Comments

Thanks for the update for the code, but {65535} is the repetation limit, thats a bit different with mine. "x"*50000 and "x{50000} is different in my understanding.
"x"*50000 produces 50000 x's.. in a regex, if you put x{50000}, you are telling the regex engine to search for 50000 x's as well... Or am i missing something ? you should clearly state what you want to do in your question with examples...
@ghostdog74, you are right, I can use x{50000} if I want to match 50000 x. I am just wondering re.compile is safe or not and testing for vulnerablity. thats all.
well, as you can see, string length exceeding 65535 (for my system) crashes the regex engine and produce error. I would suggest checking your string according to your application specs before passing it to re.compile.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.