1

Consider the following iPython perf test, where we create a pair of 10,000 long 32-bit vectors and add them. Firstly using integer arithmetic and then using float arithmetic:

from numpy.random import randint
from numpy import int32, float32

a, b = randint(255,size=10000).astype(int32), randint(255,size=10000).astype(int32)
%timeit a+b  # int32 addition, gives 20.6µs per loop

a, b = randint(255,size=10000).astype(float32), randint(255,size=10000).astype(float32)
%timeit a+b  # float32 addition, gives 3.91µs per loop

Why is the floating point version about 5x faster?

If you do the same test with float64 it takes twice as long as float32, which is what you'd expect if we are fully utilizing hardware. However the timing for the integer case seems to be constant for int8 to int64. This, together with the 5x slowdown make me suspect that it is completely failing to use SSE.

For int32, I observe similar 20µs values when a+b is replaced by a & 0xff or a >> 2, suggesting that the problem is not limited to addition.

I'm using numpy 1.9.1, though unfortunately I can't remember whether I complied it locally or downloaded a binary. But either way, this performance observation was pretty shocking to me. How is it possible that the version I have is so hopeless at integer arithmetic?

Edit: I've also tested on a similar, but separate PC, running numpy 1.8, which I'm fairly sure was straight from a PythonXY binary. I got the same results.

Question: Do other people see similar results, if not what can I do to be like them?

Update: I have created a new issue on numpy's github repo.

4
  • You can gather information on how your numpy was compiled using np.show_config() and import numpy.distutils.system_info as sysinfo; sysinfo.show_all(). See this SO question. Commented Feb 18, 2015 at 19:51
  • @unutbu - I've just run the sysinfo.show_all() - see output on pastebin - it's mostly a list of things with "NOT AVAILABLE". Commented Feb 18, 2015 at 19:55
  • 1
    Your observations are spot on: as of this writing, only boolean and floating point operations use SIMD. This is done with explicit intrinsics calls, so it should have little dependence on the compilation setup. As with any other open source project, contributions are always welcome... Commented Feb 18, 2015 at 22:48
  • @Jaime - are you talking about this file: numpy/numpy/core/src/umath/simd.inc.src. I don't claim to have any idea of what it's doing, but I'm surprised that there aren't opportunities for the compiler to auto vectorise loops, once all the broadcasting/stepping logic has been dealt with by the programmer. Is there an existing issue/milestone that I can show interest in. If not would you consider creating one...I can't see myself hacking at the core of numpy any time soon. Commented Feb 19, 2015 at 0:13

2 Answers 2

2

the not yet released numpy 1.10 will also vectorize integer operations, if the compiler supports it. It was added in this change: https://github.com/numpy/numpy/pull/5144

E.g. your testcase with current git head compiled with gcc 4.8 results in the same speed for int and float and the code produced looks decent:

  0.04 │27b:   movdqu (%rdx,%rax,1),%xmm0
 25.33 │       add    $0x1,%r10
       │       movdqu (%r8,%rax,1),%xmm1
       │       paddd  %xmm1,%xmm0
 23.17 │       movups %xmm0,(%rcx,%rax,1)
 34.72 │       add    $0x10,%rax
 16.05 │       cmp    %r10,%rsi
       │     ↑ ja     27b

additional speedups can be archived by using AVX2 if the cpu supports it (e.g. intel haswell), though currently that needs to be done by compiling with OPT="-O3 -mavx2", there is no runtime detection for this in numpy yet.

Sign up to request clarification or add additional context in comments.

3 Comments

Brilliant news! Is this true of all kinds of views/broadcasting requirements or is it limited to the simple example I gave above?
all integer ufuncs should be covered, if something was missed please file a bug. Note that only contiguous operations (stride equal to elementsize) can be vectorized. Also the default 64bit integer does not often profit much from vectorizing.
surely in principle you can vectorise in some other cases, e.g. when the MMX register size is an integer multiple of or factor of elementsize*stride. I imagine it's a fairly common usage case to have an nx2 array or a nd array with each axis a power of 2 in length. (And I guess you could go further and find lowest common multiple of the two sizes etc...but I imagine that at some point you are only causing code bloat and get diminishing returns.)
0

On a modern CPU there are a lot of factors that influence performance. Whether the data is integer or floating point is only one of them.

Factors such as whether the data is in the cache or has to be fetched from RAM (or even worse from swap) will have a big impact.

The compiler that was used to compile numpy will also have a big influence; how good is it at using the SIMD instructions like SSE? Those can speed up array operations significantly.

The results for my system (Intel Core2 Quad Q9300);

In [1]: from numpy.random import randint

In [2]: from numpy import int32, float32, float64

In [3]: a, b = randint(255,size=10000).astype(int32), randint(255,size=10000).astype(int32)

In [4]: %timeit a+b
100000 loops, best of 3: 12.9 µs per loop

In [5]: a, b = randint(255,size=10000).astype(float32), randint(255,size=10000).astype(float32)

In [6]: %timeit a+b
100000 loops, best of 3: 8.25 µs per loop

In [7]: a, b = randint(255,size=10000).astype(float64), randint(255,size=10000).astype(float64)

In [8]: %timeit a+b
100000 loops, best of 3: 13.9 µs per loop

So on this machine, there is no factor of five difference between int32 and float32. And neither is there a factor of two between float32 and float64.

From processor utilization I can see that the timeit loops only use one of the four available cores. This seems to confirm that these simple operations don't use BLAS routines since this numpy was built with a parallel openBLAS.

The way numpy was compiled will also have significant influence. Based on the answers to this question, I could see using objdump that my numpy uses SSE2 instructions and the xmm registers.

In [9]: from numpy import show_config

In [10]: show_config()
atlas_threads_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    include_dirs = ['/usr/local/include']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    libraries = ['alapack', 'ptf77blas', 'ptcblas', 'atlas']
openblas_lapack_info:
  NOT AVAILABLE
blas_opt_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    libraries = ['openblasp', 'openblasp']
mkl_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
lapack_opt_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    include_dirs = ['/usr/local/include']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    libraries = ['alapack', 'ptf77blas', 'ptcblas', 'atlas']
openblas_info:
    library_dirs = ['/usr/local/lib']
    language = f77
    libraries = ['openblasp', 'openblasp']
blas_mkl_info:
  NOT AVAILABLE

If you want to see the effect of the BLAS that you use, run the following program with numpy compiled with different BLAS libraries.

from __future__ import print_function
import numpy
import sys
import timeit

try:
    import numpy.core._dotblas
    print('FAST BLAS')
except ImportError:
    print('slow blas')

print("version:", numpy.__version__)
print("maxint:", sys.maxsize)
print()

setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5

t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print("dot:", t.timeit(count)/count, "sec")

On my machine I get;

FAST BLAS
version: 1.9.1
maxint: 9223372036854775807

dot: 0.06626860399264842 sec

Based on the results from this test I switched from ATLAS to OpenBLAS, because it was significantly faster on my machine.

2 Comments

The statement about cache is presumably not relevant here given that both sets of data have identical cache footprints: 10 thousand 32bit elements stored contiguously. Also, %timeit% should take care of warming up the cache before running the main loop, so they both start from an equal footing.
Also, the lack of multicore utilization is not that surprising given that there is presumably a smallish sweet-spot where the data fits in cache, but isnt so small as to make multicore overkill...if the data doesnt fit in cache then there is no benefit in going multicore.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.