Improving for loop speed for numpy.ndarray

Question

I'm trying to calculate the mutual information for unigrams in a dataset. When trying to do this, I'm trying to improve the speed when looping through numpy ndarrays. I have the following code where I'm using an already created matrix 'C' with 6018 rows and 27721 columns in order to compute the PMI matrix. Any ideas how to improve the for loop speed (currently it takes almost 4 hours to run)? I read in some other post about using Cython, but are there any alternatives? In advance, thanks for your help.

# MAKE MUTUAL INFO MATRIX, PMI
print "Creating mutual information matrix"
N = C.sum()
invN = 1/N  # replaced divide by N with multiply by invN in formula below
PMI = np.zeros((C.shape))
row, col = C.shape
for r in xrange(row):  # u
    for c in xrange(r):  # w
        if C[r,c]!=0:  # if they co-occur
            numerator = C[r,c]*invN  # getting number of reviews where u and w co-occur and multiply by invN (numerator)
            denominator = (sum(C[:,c])*invN) * (sum(C[r])*invN)
            pmi = log10(numerator*(1/denominator))
            PMI[r,c] = pmi
            PMI[c,r] = pmi

grc · Accepted Answer · 2015-01-26 01:44:47Z

1

You should get faster speeds if you can scrap the loops and take advantage of NumPy's vectorisation instead.

I haven't tried it, but something like this should work:

numerator = C * invN
denominator = (np.sum(C, axis=0) * invN) * (np.sum(C, axis=1)[:,None] * invN)
pmi = np.log10(numerator * (1 / denominator))

Note that numerator, denominator, and pmi will each be arrays of values rather than scalars.

Also, you might have to deal with the C == 0 case somehow:

pmi = np.log10(numerator[numerator != 0] * (1 / denominator[numerator != 0]))

As Blckknght pointed out in the comments, you could leave out some of the invN multiplications:

denominator = np.sum(C, axis=0) * np.sum(C, axis=1)[:,None] * invN
pmi = np.log10(C * (1 / denominator))

edited Jan 26, 2015 at 1:44

answered Jan 26, 2015 at 1:33

grc

23.7k5 gold badges45 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Blckknght Over a year ago

This is the approach I was going to suggest too. I think to make the denominator array come out with the right dimensions you need to change the axis of one of the sums, perhaps with a slice like [:,None]). Also, a bunch of the invN multiplications can probably be left out, since they mostly tend to cancel in the final division (I think there's one factor left in the denominator).

Collectives™ on Stack Overflow

Improving for loop speed for numpy.ndarray

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related