I have a program where I need to compile several thousand large regexes, all of which will be used many times. Problem is, it takes too long (according to cProfiler, 113 secs) to re.compile() them. (BTW, actually searching using all of these regexes < 1.3 secs once compiled.)
If I don't precompile, it just postpones the problem to when I actually search, since re.search(expr, text) implicitly compiles expr. Actually, it's worse, because re is going to recompile the entire list of regexes every time I use them.
I tried using multiprocessing, but that actually slows things down. Here's a small test to demonstrate:
## rgxparallel.py ##
import re
import multiprocessing as mp
def serial_compile(strings):
return [re.compile(s) for s in strings]
def parallel_compile(strings):
print("Using {} processors.".format(mp.cpu_count()))
pool = mp.Pool()
result = pool.map(re.compile, strings)
pool.close()
return result
l = map(str, xrange(100000))
And my test script:
#!/bin/sh
python -m timeit -n 1 -s "import rgxparallel as r" "r.serial_compile(r.l)"
python -m timeit -n 1 -s "import rgxparallel as r" "r.parallel_compile(r.l)"
# Output:
# 1 loops, best of 3: 6.49 sec per loop
# Using 4 processors.
# Using 4 processors.
# Using 4 processors.
# 1 loops, best of 3: 9.81 sec per loop
I'm guessing that the parallel version is:
- In parallel, compiling and pickling the regexes, ~2 secs
- In serial, un-pickling, and therefore recompiling them all, ~6.5 secs
Together with the overhead for starting and stopping the processes, multiprocessing on 4 processors is more than 25% slower than serial.
I also tried divvying up the list of regexes into 4 sub-lists, and pool.map-ing the sublists, rather than the individual expressions. This gave a small performance boost, but I still couldn't get better than ~25% slower than serial.
Is there any way to compile faster than serial?
EDIT: Corrected the running time of the regex compilation.
I also tried using threading, but due to GIL, only one processor was used. It was slightly better than multiprocessing (130 secs vs. 136 secs), but still slower than serial (113 secs).
EDIT 2: I realized that some regexes were likely to be duplicated, so I added a dict for caching them. This shaved off ~30 sec. I'm still interested in parallelizing, though. The target machine has 8 processors, which would reduce compilation time to ~15 secs.