3

I am trying to optimize some python code (to speed up some matrix operations), my code is something similar to this one (my real dataset is also similar to 'gps'),

import numpy as np
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')

for i in xrange(len(gps)):
  for j in xrange(0,i+1):
    ips[i,j]= f.innerProd(gps[i],gps[j])
    ips[j,i]= ips[i,j]
   print "Inner product matrix: %3.0f %% done (%d of %d)"%  \
               (((i+1)**2.)/(len(gps)**2.)*100, i, len(gps))

def innerProd(mat1,mat2):
    return float(np.sum(np.dot(np.dot(mat1,mat2),mat1)))

What I would like to understand is , why is it that the program begins running fast during the first iterations and then slows down as it iterates further? I know the question might be a bit naive but I really want to have a clearer idea of what is happening before I attempt anything else. I already implemented my function in Fortran (leaving within the Fortran realm any for loops) and used f2py to create a dynamic lib to call the function from python, this would be the new code in python..

import numpy as np
import myfortranInnProd as fip

gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')

ips = fip.innerProd(gps)

unfortunately I only found out (surprisingly) that my fortran-python version runs 1.5 ~ 2 times slower than the first version (it is important to mention that I used MATMUL() on the Fortran implementation). I have been googling around for a while and I believe that this "slow down" has something to do with the memory bandwidth, memory allocation or caching, given the large datasets, but I am not very sure about what is really happening behind and how could I improve the performance. I have run the code on both a small intel atom , 2GB ram and a 4 core intel xeon, with 8GB (of course with a correspondingly scaled dataset) and the "slow down" behavior is the same.

I just need to understand why is it that this 'slow down' happens? would it do any good if i implement the function in C ? or try to implement it to run on a GPU ? Any other ideas how to improve it? Thanks in advance

2
  • i ran this under profile and found that most of the time 307 seconds were being spent in np.dot; you can profile any file by doing python -m profile filename.py Commented May 1, 2011 at 2:30
  • 1
    the inner loop takes the same time if you divide it by (i+1). So it behaves as it should. Commented May 1, 2011 at 4:05

2 Answers 2

4

At the risk of stating the obvious, the number of executions of the inner loop will grow each time you complete an execution of the outer loop. When i is 0, the inner loop will only be executed once, but when i is 100, it will be executed 101 times. Could this explain your observations, or do you mean that each execution of the inner loop itself is getting slower over time?

Sign up to request clarification or add additional context in comments.

1 Comment

You are right, it was quite obvious :D.. I should have figured it out by myself. Anyway, any ideas on how to improve its performance? Do you think a C implementation could do the job? As I understand numpy is already implemented low-level, so I don't know if it would make any difference, what do you think? I believe I will try to implement it using pycuda.
2

The number of executions of the inner for loop depends on the value of i, the index of the outer for loop. Since you're displaying your debug each time the inner loop finishes, it gets displayed less and less often as i grows. (Note that the percentage increases regularly, however.)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.