2

I been trying to speed up my code below which looks up an index, which will get a string from the list "name", and finally count the number of exact matches it has in two sections of data.

This process has been very slow. I read about replacing for loops when using numpy arrays but was not sure how to handle/approach creating a vectorized version with the regex matching.

x = np.empty([38000, 8000])  
y = np.empty([38000, 8000])  
for i in range(0, 38000):
    for j in range(0, 8000):
        x[i, j] = len(re.findall('\\b'+name[index[j]]+'\\b', data[i][1]))
        y[i, j] = len(re.findall('\\b'+name[index[j]]+'\\b', data[i][2]))

Any insight is greatly appreciated,

4
  • what do name, index and data look like? Commented Apr 24, 2015 at 4:17
  • Unless name itself contains regexs, it looks like you could first filter the possible candidate cells with simple string matching and then run regexs against the candidate cells... Commented Apr 24, 2015 at 4:28
  • index = [ 0, 123, 454, ...] #1-by-8000 index holds an index of name that is deemed interesting name = ['dog', 'cat', ...] name holds a large list of strings (1-by-50000) which we only want the index numbered values of. Commented Apr 24, 2015 at 4:53
  • String count or np.char.count are faster than re.findall if you don't need the \b separation. Commented Apr 24, 2015 at 15:30

2 Answers 2

1

Vectorizing won't help you much here, but avoiding repeated work will:

patterns = [re.compile('\\b'+name[idx]+'\\b') for idx in index]
for i, row in enumerate(data):
    for j, patt in enumerate(patterns):
        x[i, j] = len(patt.findall(row[1]))
        y[i, j] = len(patt.findall(row[2]))
Sign up to request clarification or add additional context in comments.

5 Comments

and if len(name) < len(index), compile the patterns before indexing.
You can compile the patterns at the start.
It's my understanding that the re module keeps a cache of compiled patterns, so pre-compiling might not help much.
I get a 40% speed up with [len(pat.findall(i)) for i in x1] compared to ` [len(re.findall('\\b'+'name'+'\\b',i)) for i in x1]``.
Interesting, maybe the large number of patterns overflows the re pattern cache? I'll update my answer.
1

vectorizing a function...

first define a function and vectorize it:

def count_words(word, sentence):
    return len(re.findall(r'\b%s\b'%word, sentence))

vcount_words = np.vectorize(count_words)

then apply (here words is array 800 element array and data is 3800X2 matrix)

vcount_words(names, data[:,:1])

smaller example so it fits here (5X3):

names = ['aaa', 'bbb', 'ccc']
data = np.array([['aaa aaa aaa bbb dd', 'ee ff ccc ee ee dd bbb ee'],
                 ['aaa ccc dd aaa ff ff ee', 'dd ccc ee ccc dd ee ff'],
                 ['ee aaa ff ccc ff ee aaa dd bbb', 'aaa'],
                 ['ff ee ccc ccc', 'dd'],
                 ['ccc ee aaa dd', 'ccc bbb ee aaa bbb ff ee']])
x = vcount_words(names, data[:,:1])
# returns >>>
array([[3, 1, 0],
       [2, 0, 1],
       [2, 1, 1],
       [0, 0, 2],
       [1, 0, 1]])

Adjust accordingly for your data. This could be speed up by not recompiling the regex in the fuction (pre-compile and index into it). I would also investigate numba whenever you are looping over numpy arrays with for loops.

But, this demonstrates the vectorize a function approach, you've already "accepted" and it's late.

2 Comments

The vectorize function does not speed up the code - it just wraps it in a way that facilitates broadcasting and other array tricks.
There is a np.char module that applies string operations to arrays of strings. But it doesn't handle the fancier search patterns that re does.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.