4

I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).

Here is a stupid example of implementation in cython :

%%cython -f -c=-O2 -I./ 

import numpy as np
cimport numpy as np

cimport cython


cdef class Test:
    
    cdef double[:, ::1] a, b
    cdef Py_ssize_t[:, ::1] indices
    
    def __cinit__(self, a, b, indices):
        self.a = a
        self.b = b
        self.indices = indices
    
    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run1(self):
        """ Use of external structure of indices. """
        cdef Py_ssize_t idx, ix, iy
        cdef int n = self.indices.shape[0]
        
        
        for idx in range(n):
            ix = self.indices[idx, 0]
            iy = self.indices[idx, 1]
            self.b[ix, iy] = ix * iy * self.a[ix, iy]

    @cython.boundscheck(False)
    @cython.nonecheck(False)
    @cython.wraparound(False)
    @cython.initializedcheck(False)
    cpdef void run2(self):
        """ Direct formulation """
        cdef Py_ssize_t idx, ix, iy
        cdef int nx = self.a.shape[0]
        cdef int ny = self.a.shape[1]
        
        for ix in range(nx):
            for iy in range(ny):
                self.b[ix, iy] = ix * iy * self.a[ix, iy]

with this on the python side :

import itertools
import numpy as np

N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)

and the results :

%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Why does the Test.run1() method run much slower than the Test.run2() method?

What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?

2 Answers 2

2

Because run1 is significantly more complicated...

  1. run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.
  2. It's fairly trivial for the compiler to work out exactly what order it's accessing the array elements in run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.

Your current performance is probably as good as it gets.

Sign up to request clarification or add additional context in comments.

Comments

0

In addition of the good @DavidW answer, note that run2 is SIMD-friendly as opposed to run1. This means a compiler can easily generate SIMD instruction in run2 so to read multiple packed items from memory, multiply multiple items in a row thanks to packed SIMD instructions and write the packed items into memory. If the array is small enough to fit in CPU caches, which is the case here, the SIMD computation can be very fast. Indeed, nearly all modern x86-64 processors support the 256-bit wide AVX/AVX-2 instruction set that can operate on 8 32-bit integers in a row and 4 double-precision floating-point numbers. Additionally, such a code can easily be unrolled and well pipelined by modern processors. Hardware prefetchers are also optimized for this kind of use-case.

Meanwhile, run1 do indirect memory accesses. Compilers can hardly assume they are actually contiguous and generate packed loads/stores (this is very unlikely in most codes and this is up to developers to write this kind optimization). The indirection require multiple load instructions that saturate the load ports and make the overall computation at least twice slower. AVX-2 have a gather instruction that can theoretically help for such a case. That being said, the instruction is currently not well efficiently implemented on current Intel/AMD processors (it basically does scalar loads internally, saturating the load ports). Still, it should certainly make run1 runs as fast as run2 if the later is not vectorized (otherwise run2 should sharply outperform run1 even with gather instructions). Compilers unfortunately have a hard time using such instructions yet.

In fact, regarding the code and the timing, run2 should be even faster if SIMD instruction would be used. I think this is not the case and this is certainly because the -O2 optimization level is currently set in your code and compilers like GCC does not yet automatically vectorize the code (unless with the very last version AFAIK). Please consider using -O3. Please also consider enabling -mavx and -mavx2 if possible (this assume the target processor are not too old) as it should make the code faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.