String substitution performance in python

Question

I have a list of ~50,000 strings (titles), and a list of ~150 words to remove from these titles, if they are found. My code so far is below. The final output should be the list of 50,000 strings, with all instances of the 150 words removed. I would like to know the most efficient (performance wise) way of doing this. My code seems to be running, albeit very slowly ..

excludes = GetExcludes()
titles = GetTitles()
titles_alpha = []
titles_excl = []
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

    #remove extra white space
    s = re.sub( '\s+', ' ', s).strip()

    #lowercase
    s = s.lower()

    titles_alpha.append(s)
    #remove any excluded words


    for i in range (len(excludes)):
        titles_excl.append(titles_alpha[k].replace(excludes[i],''))

print titles_excl

This doesn't look right. You're appending titles_alpha[k].replace(...) to titles_excl once for each item in excludes. Meaning titles_excl will end up with 50000*150 items at the end, rather than just 50000. I suggest testing your code with smaller input - say, 10 titles and 3 exclusions - to confirm that it works as desired, before you run it on the big data. — Kevin
– Kevin, Commented Nov 23, 2015 at 13:24

Sebastian Wozny · Accepted Answer · 2015-11-23 14:12:34Z

A lot of the performance overhead of regular expressions comes from compiling the regular expressions. You should move the compilation of the regular expression out of the loop.

This should give you a considerable improvement:

pattern1 = re.compile('[^0-9a-zA-Z]+')
pattern2 = re.compile('\s+')
for k in range(len(titles)):
    #remove all non-alphanumeric characters 
    s = re.sub(pattern1,' ',titles[k])

    #remove extra white space
    s = re.sub(pattern2,' ', s).strip()

Some tests with wordlist.txt from here:

import re
def noncompiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = re.sub('[^0-9a-zA-Z]+', ' ',titles[k])

        #remove extra white space
        s = re.sub( '\s+', ' ', s).strip()

def compiled():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)



In [2]: %timeit noncompiled()
1 loops, best of 3: 292 ms per loop

In [3]: %timeit compiled()
10 loops, best of 3: 176 ms per loop

To remove the "bad words" from your excludes list, you should as @zsquare suggested create a joined regex, which will most likely be the fastest that you can get.

def with_excludes():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    pattern1=re.compile('[^0-9a-zA-Z]+')
    pattern2 = re.compile( '\s+')
    excludes = ["shit","poo","ass","love","boo","ch"]
    excludes_regex = re.compile('|'.join(excludes))
    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = pattern1.sub('',titles[k])

        #remove extra white space
        s = pattern2.sub('', s)
        #remove bad words
        s = pattern2.sub('', s)
In [2]: %timeit with_excludes()
1 loops, best of 3: 251 ms per loop

You can take this approach one step further by just compiling a master regex:

def master():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]
    excludes = ["shit","poo","ass","love","boo","ch"]
    nonalpha='[^0-9a-zA-Z]+'
    whitespace='\s+'
    badwords = '|'.join(excludes)
    master_regex=re.compile('|'.join([nonalpha,whitespace,badwords]))

    for k in range(len(titles)):
        #remove all non-alphanumeric characters 
        s = master_regex.sub('',titles[k])
In [2]: %timeit master()
10 loops, best of 3: 148 ms per loop

You can gain some more speed by avoiding the iteration in python:

    result = [master_regex.sub('',item) for item in titles]


In [4]: %timeit list_comp()
10 loops, best of 3: 139 ms per loop

Note: The data generation step:

def baseline():
    with open("wordlist.txt",'r') as f:
        titles = f.readlines()
    titles = ["".join([title,nonalpha]) for title in titles for nonalpha in "!@#$%"]

In [2]: %timeit baseline()
10 loops, best of 3: 24.8 ms per loop

zsquare · Accepted Answer · 2015-11-23 14:16:54Z

2

One way to do this would be to dynamically create a regex of your excluded words and replace them in the list.

Something like:

excludes_regex = re.compile('|'.join(excludes))
titles_excl = []
for title in titles:
    titles_excl.append(excludes_regex.sub('', title))

edited Nov 23, 2015 at 14:16

answered Nov 23, 2015 at 13:30

zsquare

10.2k6 gold badges56 silver badges87 bronze badges

2 Comments

Sebastian Wozny Over a year ago

You should flip the arguments of sub. It should be repl,string. docs.python.org/2/library/re.html#re.RegexObject.sub

zsquare Over a year ago

Woops! Fixed! Thanks!

Collectives™ on Stack Overflow

String substitution performance in python

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related