12

I'm running an algorithm that is implemented in Python and uses NumPy. The most computationally expensive part of the algorithm involves solving a set of linear systems (i.e. a call to numpy.linalg.solve(). I came up with this small benchmark:

import numpy as np
import time

# Create two large random matrices
a = np.random.randn(5000, 5000)
b = np.random.randn(5000, 5000)

t1 = time.time()
# That's the expensive call:
np.linalg.solve(a, b)
print time.time() - t1

I've been running this on:

  1. My laptop, a late 2013 MacBook Pro 15" with 4 cores at 2GHz (sysctl -n machdep.cpu.brand_string gives me Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz)
  2. An Amazon EC2 c3.xlarge instance, with 4 vCPUs. Amazon advertises them as "High Frequency Intel Xeon E5-2680 v2 (Ivy Bridge) Processors"

Bottom line:

  • On the Mac it runs in ~4.5 seconds
  • On the EC2 instance it runs in ~19.5 seconds

I have tried it also on other OpenBLAS / Intel MKL based setups, and the runtime is always comparable to what I get on the EC2 instance (modulo the hardware config.)

Can anyone explain why the performance on Mac (with the Accelerate Framework) is > 4x better? More details about the NumPy / BLAS setup in each are provided below.

Laptop setup

numpy.show_config() gives me:

atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

EC2 instance setup:

On Ubuntu 14.04, I installed OpenBLAS with

sudo apt-get install libopenblas-base libopenblas-dev

When installing NumPy, I created a site.cfg with the following contents:

[default]
library_dirs= /usr/lib/openblas-base

[atlas]
atlas_libs = openblas

numpy.show_config() gives me:

atlas_threads_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'openblas']
    library_dirs = ['/usr/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = f77
    include_dirs = ['/usr/include/atlas']
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE
2
  • 2
    Haswell has 2x the raw compute of Ivybridge per cycle per core (due to inclusion of FMA). I wonder if your openblas was built without AVX support enabled? That would give another 2x. Commented Oct 23, 2014 at 13:06
  • 1
    Sounds like it might be related to this. Can you check whether your EC2 instance is actually multithreading BLAS operations? Commented Dec 19, 2014 at 20:56

1 Answer 1

3

The reason for this behavior could be that Accelerate uses multithreading, while the others don't.

Most BLAS implementations follow the environment variable OMP_NUM_THREADS to determine how many threads to use. I believe they only use 1 thread if not told otherwise explicitly. Accelerate's man page, however sounds like threading is turned on by default; it can be turned off by setting the environment variable VECLIB_MAXIMUM_THREADS.

To determine if this is really what's happening, try

export VECLIB_MAXIMUM_THREADS=1

before calling the Accelerate version, and

export OMP_NUM_THREADS=4

for the other versions.

Independent of whether this is really the reason, it's a good idea to always set these variables when you use BLAS to be sure you control what is going on.

Sign up to request clarification or add additional context in comments.

2 Comments

Linked to Accelerate, VECLIB_MAXIMUM_THREADS does affect numpy.linalg.norm's performance. scipy.linalg.norm on the other hand is consistently slower and not affected by the variable which leads me to believe that it's not linked to Accelerate but instead uses reference LAPACK.
Thanks Elmar. Fwiw, scipy.linalg.norm does if ord in (None, 2) and (a.ndim == 1): nrm2 = get_blas_funcs('nrm2'); norm in numpy.linalg.linalg says "# Immediately handle some default, simple, fast, and common cases". Altogether, too complex.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.