performance of NumPy with different BLAS implementations

Question

I'm running an algorithm that is implemented in Python and uses NumPy. The most computationally expensive part of the algorithm involves solving a set of linear systems (i.e. a call to numpy.linalg.solve(). I came up with this small benchmark:

import numpy as np
import time

# Create two large random matrices
a = np.random.randn(5000, 5000)
b = np.random.randn(5000, 5000)

t1 = time.time()
# That's the expensive call:
np.linalg.solve(a, b)
print time.time() - t1

I've been running this on:

My laptop, a late 2013 MacBook Pro 15" with 4 cores at 2GHz (sysctl -n machdep.cpu.brand_string gives me Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz)
An Amazon EC2 c3.xlarge instance, with 4 vCPUs. Amazon advertises them as "High Frequency Intel Xeon E5-2680 v2 (Ivy Bridge) Processors"

Bottom line:

On the Mac it runs in ~4.5 seconds
On the EC2 instance it runs in ~19.5 seconds

I have tried it also on other OpenBLAS / Intel MKL based setups, and the runtime is always comparable to what I get on the EC2 instance (modulo the hardware config.)

Can anyone explain why the performance on Mac (with the Accelerate Framework) is > 4x better? More details about the NumPy / BLAS setup in each are provided below.

Laptop setup

numpy.show_config() gives me:

atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

EC2 instance setup:

On Ubuntu 14.04, I installed OpenBLAS with

sudo apt-get install libopenblas-base libopenblas-dev

When installing NumPy, I created a site.cfg with the following contents:

[default]
library_dirs= /usr/lib/openblas-base

[atlas]
atlas_libs = openblas

numpy.show_config() gives me:

atlas_threads_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

Haswell has 2x the raw compute of Ivybridge per cycle per core (due to inclusion of FMA). I wonder if your openblas was built without AVX support enabled? That would give another 2x. — Stephen Canon
– Stephen Canon, Commented Oct 23, 2014 at 13:06
Sounds like it might be related to this. Can you check whether your EC2 instance is actually multithreading BLAS operations? — ali_m
– ali_m, Commented Dec 19, 2014 at 20:56

Elmar Peise · Accepted Answer · 2015-01-07 01:14:27Z

3

The reason for this behavior could be that Accelerate uses multithreading, while the others don't.

Most BLAS implementations follow the environment variable OMP_NUM_THREADS to determine how many threads to use. I believe they only use 1 thread if not told otherwise explicitly. Accelerate's man page, however sounds like threading is turned on by default; it can be turned off by setting the environment variable VECLIB_MAXIMUM_THREADS.

To determine if this is really what's happening, try

export VECLIB_MAXIMUM_THREADS=1

before calling the Accelerate version, and

export OMP_NUM_THREADS=4

for the other versions.

Independent of whether this is really the reason, it's a good idea to always set these variables when you use BLAS to be sure you control what is going on.

answered Jan 7, 2015 at 1:14

Elmar Peise

15.8k3 gold badges24 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Elmar Peise Over a year ago

Linked to Accelerate, VECLIB_MAXIMUM_THREADS does affect numpy.linalg.norm's performance. scipy.linalg.norm on the other hand is consistently slower and not affected by the variable which leads me to believe that it's not linked to Accelerate but instead uses reference LAPACK.

denis Over a year ago

Thanks Elmar. Fwiw, scipy.linalg.norm does if ord in (None, 2) and (a.ndim == 1): nrm2 = get_blas_funcs('nrm2'); norm in numpy.linalg.linalg says "# Immediately handle some default, simple, fast, and common cases". Altogether, too complex.

Collectives™ on Stack Overflow

performance of NumPy with different BLAS implementations

Laptop setup

EC2 instance setup:

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Laptop setup

EC2 instance setup:

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related