0

I'm trying to calculate the mutual information for unigrams in a dataset. When trying to do this, I'm trying to improve the speed when looping through numpy ndarrays. I have the following code where I'm using an already created matrix 'C' with 6018 rows and 27721 columns in order to compute the PMI matrix. Any ideas how to improve the for loop speed (currently it takes almost 4 hours to run)? I read in some other post about using Cython, but are there any alternatives? In advance, thanks for your help.

# MAKE MUTUAL INFO MATRIX, PMI
print "Creating mutual information matrix"
N = C.sum()
invN = 1/N  # replaced divide by N with multiply by invN in formula below
PMI = np.zeros((C.shape))
row, col = C.shape
for r in xrange(row):  # u
    for c in xrange(r):  # w
        if C[r,c]!=0:  # if they co-occur
            numerator = C[r,c]*invN  # getting number of reviews where u and w co-occur and multiply by invN (numerator)
            denominator = (sum(C[:,c])*invN) * (sum(C[r])*invN)
            pmi = log10(numerator*(1/denominator))
            PMI[r,c] = pmi
            PMI[c,r] = pmi

1 Answer 1

1

You should get faster speeds if you can scrap the loops and take advantage of NumPy's vectorisation instead.

I haven't tried it, but something like this should work:

numerator = C * invN
denominator = (np.sum(C, axis=0) * invN) * (np.sum(C, axis=1)[:,None] * invN)
pmi = np.log10(numerator * (1 / denominator))

Note that numerator, denominator, and pmi will each be arrays of values rather than scalars.

Also, you might have to deal with the C == 0 case somehow:

pmi = np.log10(numerator[numerator != 0] * (1 / denominator[numerator != 0]))

As Blckknght pointed out in the comments, you could leave out some of the invN multiplications:

denominator = np.sum(C, axis=0) * np.sum(C, axis=1)[:,None] * invN
pmi = np.log10(C * (1 / denominator))
Sign up to request clarification or add additional context in comments.

1 Comment

This is the approach I was going to suggest too. I think to make the denominator array come out with the right dimensions you need to change the axis of one of the sums, perhaps with a slice like [:,None]). Also, a bunch of the invN multiplications can probably be left out, since they mostly tend to cancel in the final division (I think there's one factor left in the denominator).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.