numpy function cythonization

Question

I have the following function in pure python:

import numpy as np

def subtractPython(a, b):
    xAxisCount = a.shape[0]
    yAxisCount = a.shape[1]

    shape = (xAxisCount, yAxisCount, xAxisCount)
    results = np.zeros(shape)
    for index in range(len(b)):
        subtracted = (a - b[index])
        results[:, :, index] = subtracted
    return results

I tried to cythonize it this way:

import numpy as np
cimport numpy as np

DTYPE = np.int
ctypedef np.int_t DTYPE_t

def subtractPython(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
    cdef int xAxisCount = a.shape[0]
    cdef int yAxisCount = a.shape[1]

    cdef np.ndarray[DTYPE_t, ndim=3] results = np.zeros([xAxisCount, yAxisCount, xAxisCount], dtype=DTYPE)

    cdef int lenB = len(b)

    cdef np.ndarray[DTYPE_t, ndim=2] subtracted
    for index in range(lenB):
        subtracted = (a - b[index])
        results[:, :, index] = subtracted
    return results

However, Im not seeing any speedup. Is there something I'm missing or this process can't be sped up?

EDIT -> I've realized that I'm not actually cythonizing the subtraction algorithm in the above code. I've managed to cythonize it, but it has the exact same runtime as a - b[:, None], so I guess this is the maximum speed of this operation.

This is basically a - b[:, None] -> has same runtime

%%cython

import numpy as np
cimport numpy as np


DTYPE = np.int
ctypedef np.int_t DTYPE_t

cimport cython
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function
def subtract(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
    cdef np.ndarray[DTYPE_t, ndim=3] result = np.zeros([b.shape[0], a.shape[0], a.shape[1]], dtype=DTYPE)

    cdef int lenB = b.shape[0]
    cdef int lenA = a.shape[0]
    cdef int lenColB = b.shape[1]

    cdef int rowA, rowB, column

    for rowB in range(lenB):
        for rowA in range(lenA):
            for column in range(lenColB):
                result[rowB, rowA, column] = a[rowA, column] - b[rowB, column]
    return result

Is it neccesary to cythonize the function? If you are using NumPy you should vectorize your code. I think that results = a[:,:,np.newaxis]-b gets the same output. — sebacastroh
– sebacastroh, Commented Feb 25, 2018 at 15:09
You might seed a cdef in the loop. Add one for you subtracted variable. — ilmarinen
– ilmarinen, Commented Feb 25, 2018 at 15:10
There's a few small things you can do (e.g. set a type of index - it's possible Cython can't guess it). However, this might just be how long the calculation takes. — DavidW
– DavidW, Commented Feb 25, 2018 at 15:19
See this bit in the Cython docs: "Gotcha: This efficient indexing only affects certain index operations, namely those with exactly ndim number of typed integer indices. So if v for instance isn’t typed, then the lookup f[v, w] isn’t optimized. On the other hand this means that you can continue using Python objects for sophisticated dynamic slicing etc. just as when the array is not typed." — Paul Panzer
– Paul Panzer, Commented Feb 25, 2018 at 17:22

ead · Accepted Answer · 2018-02-28 13:05:35Z

When trying to optimize a function, one always should know what is the bottle-neck of this function - without you will spend a lot of time running in the wrong direction.

Let's use your python-function as baseline (actually I use result=np.zeros(shape,dtype=a.dtype) otherwise your method returns floats which is probably a bug):

>>> import numpy as np
>>> a=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> b=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> %timeit subtractPython(a,b)
274 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The first question we should ask ourselves is: Is this task memory or CPU bound? Obviously, this is a memory-bound task - a subtraction is nothing compared to needed memory-read- and write-accesses.

This means, all above we have to optimize the memory layout in order to reduce cache-misses. As a rule of thumb, our memory accesses should access one consecutive memory address after another.

Is this the case? No, the array result is in C-order, i.e. row-major-order and thus the access

results[:, :, index] = subtracted

isn't consecutive. On the other hand,

results[index, :, :] = subtracted

would be a consecutive access. Let's change the way information is stored in result:

def subtract1(a, b):
    xAxisCount = a.shape[0]
    yAxisCount = a.shape[1]

    shape = (xAxisCount,  xAxisCount, yAxisCount) #<=== Change order
    results = np.zeros(shape, dtype=a.dtype)
    for index in range(len(b)):
        subtracted = (a - b[index])
        results[index, :, :] = subtracted   #<===== consecutive access
    return results

The timings are now:

>>> %timeit subtract1(a,b)
>>> 35.8 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

There are also 2 more small improvements: we don't have to initialize result with zeros and we can save some python overhead, but this gives us just about 5%:

def subtract2(a, b):
    xAxisCount = a.shape[0]
    yAxisCount = a.shape[1]

    shape = (xAxisCount,  xAxisCount, yAxisCount) 
    results = np.empty(shape, dtype=a.dtype)        #<=== no need for zeros
    for index in range(len(b)):
        results[index, :, :] = (a-b[index])   #<===== less python overhead
    return results

>>> %timeit subtract2(a,b)
34.5 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now this is about factor 8 faster than the original version.

You could use Cython to try to speed-up this even further - but the task is probably still memory-bound, so don't expect to get it significantly faster - after all cython cannot make the memory work faster. However, without proper profiling it is hard to tell, how much improvement is possible - would not be surprised, if someone would come up with a faster version.

wow, I had no idea that simply rearranging an array would lead to such an improvement!

Collectives™ on Stack Overflow

numpy function cythonization

This is basically a - b[:, None] -> has same runtime

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

This is basically a - b[:, None] -> has same runtime

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related