A lot of the performance overhead of regular expressions comes from compiling the regular expressions. You should move the compilation of the regular expression out of the loop.
This should give you a considerable improvement:
pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub(pattern1,' ',titles[k])
#remove extra white space
s = re.sub(pattern2,' ', s).strip()
Some tests with wordlist.txt from here:
import re
def noncompiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])
#remove extra white space
s = re.sub( '\s+', ' ', s).strip()
def compiled():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop
In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop
To remove the "bad words" from your excludes list, you should as @zsquare suggested create a joined regex, which will most likely be the fastest that you can get.
def with_excludes():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
pattern1=re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile( '\s+')
excludes = ["shit","poo","ass","love","boo","ch"]
excludes_regex = re.compile('|'.join(excludes))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = pattern1.sub('',titles[k])
#remove extra white space
s = pattern2.sub('', s)
#remove bad words
s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop
You can take this approach one step further by just compiling a master regex:
def master():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
excludes = ["shit","poo","ass","love","boo","ch"]
nonalpha='[^0-9a-zA-Z]+'
whitespace='\s+'
badwords = '|'.join(excludes)
master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))
for k in range(len(titles)):
#remove all non-alphanumeric characters
s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop
You can gain some more speed by avoiding the iteration in python:
result = [master_regex.sub('',item) for item in titles]
In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop
Note: The data generation step:
def baseline():
with open("wordlist.txt",'r') as f:
titles = f.readlines()
titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop
titles_alpha[k].replace(...)totitles_exclonce for each item inexcludes. Meaningtitles_exclwill end up with 50000*150 items at the end, rather than just 50000. I suggest testing your code with smaller input - say, 10 titles and 3 exclusions - to confirm that it works as desired, before you run it on the big data.