Nested For Loops Numpy Array: Is vectorizing possible?

Question

I been trying to speed up my code below which looks up an index, which will get a string from the list "name", and finally count the number of exact matches it has in two sections of data.

This process has been very slow. I read about replacing for loops when using numpy arrays but was not sure how to handle/approach creating a vectorized version with the regex matching.

x = np.empty([38000, 8000])  
y = np.empty([38000, 8000])  
for i in range(0, 38000):
    for j in range(0, 8000):
        x[i, j] = len(re.findall('\\b'+name[index[j]]+'\\b', data[i][1]))
        y[i, j] = len(re.findall('\\b'+name[index[j]]+'\\b', data[i][2]))

Any insight is greatly appreciated,

Unless name itself contains regexs, it looks like you could first filter the possible candidate cells with simple string matching and then run regexs against the candidate cells... — dawg
– dawg, Commented Apr 24, 2015 at 4:28
index = [ 0, 123, 454, ...] #1-by-8000 index holds an index of name that is deemed interesting name = ['dog', 'cat', ...] name holds a large list of strings (1-by-50000) which we only want the index numbered values of. — John
– John, Commented Apr 24, 2015 at 4:53
String count or np.char.count are faster than re.findall if you don't need the \b separation. — hpaulj
– hpaulj, Commented Apr 24, 2015 at 15:30

perimosocordiae · Accepted Answer · 2015-04-24 15:29:52Z

1

Vectorizing won't help you much here, but avoiding repeated work will:

patterns = [re.compile('\\b'+name[idx]+'\\b') for idx in index]
for i, row in enumerate(data):
    for j, patt in enumerate(patterns):
        x[i, j] = len(patt.findall(row[1]))
        y[i, j] = len(patt.findall(row[2]))

edited Apr 24, 2015 at 15:29

answered Apr 24, 2015 at 3:31

perimosocordiae

17.9k14 gold badges64 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

hpaulj Over a year ago

and if len(name) < len(index), compile the patterns before indexing.

hpaulj Over a year ago

You can compile the patterns at the start.

perimosocordiae Over a year ago

It's my understanding that the re module keeps a cache of compiled patterns, so pre-compiling might not help much.

hpaulj Over a year ago

I get a 40% speed up with [len(pat.findall(i)) for i in x1] compared to ` [len(re.findall('\\b'+'name'+'\\b',i)) for i in x1]``.

perimosocordiae Over a year ago

Interesting, maybe the large number of patterns overflows the re pattern cache? I'll update my answer.

Phil Cooper · Accepted Answer · 2015-04-24 06:18:56Z

1

vectorizing a function...

first define a function and vectorize it:

def count_words(word, sentence):
    return len(re.findall(r'\b%s\b'%word, sentence))

vcount_words = np.vectorize(count_words)

then apply (here words is array 800 element array and data is 3800X2 matrix)

vcount_words(names, data[:,:1])

smaller example so it fits here (5X3):

names = ['aaa', 'bbb', 'ccc']
data = np.array([['aaa aaa aaa bbb dd', 'ee ff ccc ee ee dd bbb ee'],
                 ['aaa ccc dd aaa ff ff ee', 'dd ccc ee ccc dd ee ff'],
                 ['ee aaa ff ccc ff ee aaa dd bbb', 'aaa'],
                 ['ff ee ccc ccc', 'dd'],
                 ['ccc ee aaa dd', 'ccc bbb ee aaa bbb ff ee']])
x = vcount_words(names, data[:,:1])
# returns >>>
array([[3, 1, 0],
       [2, 0, 1],
       [2, 1, 1],
       [0, 0, 2],
       [1, 0, 1]])

Adjust accordingly for your data. This could be speed up by not recompiling the regex in the fuction (pre-compile and index into it). I would also investigate numba whenever you are looping over numpy arrays with for loops.

But, this demonstrates the vectorize a function approach, you've already "accepted" and it's late.

answered Apr 24, 2015 at 6:18

Phil Cooper

5,8871 gold badge27 silver badges41 bronze badges

2 Comments

hpaulj Over a year ago

The vectorize function does not speed up the code - it just wraps it in a way that facilitates broadcasting and other array tricks.

hpaulj Over a year ago

There is a np.char module that applies string operations to arrays of strings. But it doesn't handle the fancier search patterns that re does.

Collectives™ on Stack Overflow

Nested For Loops Numpy Array: Is vectorizing possible?

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related