After long debugging I found why my application using python regexps is slow. Here is something I find surprising:
import datetime
import re
pattern = re.compile('(.*)sol(.*)')
lst = ["ciao mandi "*10000 + "sol " + "ciao mandi "*10000,
"ciao mandi "*1000 + "sal " + "ciao mandi "*1000]
for s in lst:
print "string len", len(s)
start = datetime.datetime.now()
re.findall(pattern,s)
print "time spent", datetime.datetime.now() - start
print
The output on my machine is:
string len 220004
time spent 0:00:00.002844
string len 22004
time spent 0:00:05.339580
The first test string is 220K long, matches, and the matching is quite fast. The second test string is 20K long, does not match and it takes 5 seconds to compute!
I knew this report http://swtch.com/~rsc/regexp/regexp1.html which says that regexp implementation in python, perl, ruby is somewhat non optimal... Is this the reason? I didn't thought it could happen with such a simple expression.
added My original task is to split a string trying different regex in turn. Something like:
for regex in ['(.*)sol(.*)', '\emph{([^{}])*)}(.*)', .... ]:
lst = re.findall(regex, text)
if lst:
assert len(lst) == 1
assert len(lst[0]) == 2
return lst[0]
This is to explain why I cannot use split. I solved my issue by replacing (.*)sol(.*) with (.*?)sol(.*) as suggested by Martijn.
Probably I should use match instead of findall... but I don't think this would have solved the issue since the regexp is going to match the entire input and hence findall should stop at first match.
Anyway my question was more about how easy is to fall in this problem for a regexp newbie... I think it is not so simple to understand that (.*?)sol(.*) is the solution (and for example (.*?)sol(.*?) is not).
.*are too permissive and cause a catastrophic backtracking. What are you exactly trying to do?(.*). Searching for(.*)solhas exactly the same time profile. In fact(.*)solis actually worse if the string containssoland you search withfindall(), since it triggers a failed backtracking search on the substring that followssol. (The original RE will consume the entire string on success).