8

Performing convolution along Z vector of a 3d numpy array, then other operations on the results, but it is slow as it is implemented now. Is the for loop what is slowing me down here or is is the convolution? I tried reshaping to a 1d vector and perform the convolution in 1 pass (as I did in Matlab), without the for loop, but it doesn't improve performance. My Matlab version is about 50% faster than anything I can come up with in Python. Relevant section of code:

convolved=np.zeros((y_lines,x_lines,z_depth))
for i in range(0, y_lines):
    for j in range(0, x_lines):
        convolved[i,j,:]= fftconvolve(data[i,j,:], Gauss) #80% of time here
        result[i,j,:]= other_calculations(convolved[i,j,:]) #20% of time here

Is there a better way to do this than a for loop? Heard of Cython but I have limited experience in Python as of now, would aim for the simplest solution.

2
  • What is Gauss? Some kind of 1-D gaussian kernel? If so, what size relative to z_depth? Commented Aug 15, 2015 at 21:11
  • Gaussian kernel generated once, before the loop. Data is 1d vector (z_depth) typically around 1535 elements long, with 1D gaussian kernel of length typically 79. I cleaned up a bunch of the overhead in fftconvolve, basically just goes directly to irfftn(rfftn(in1, fshape) * rfftn(in2, fshape), fshape)[fslice].copy() Commented Aug 15, 2015 at 21:27

2 Answers 2

6

The fftconvolve function you are using is presumably from SciPy. If so, be aware that it takes N-dimensional arrays. So a faster way to do your convolution would be to generate the 3d kernel that corresponds to doing nothing in the x and y dimensions and doing a 1d gaussian convolution in z.

Some code and timing results are below. On my machine and with some toy data, this led to a 10× speedup, as you can see:

import numpy as np
from scipy.signal import fftconvolve
from scipy.ndimage.filters import gaussian_filter

# use scipy filtering functions designed to apply kernels to isolate a 1d gaussian kernel
kernel_base = np.ones(shape=(5))
kernel_1d = gaussian_filter(kernel_base, sigma=1, mode='constant')
kernel_1d = kernel_1d / np.sum(kernel_1d)

# make the 3d kernel that does gaussian convolution in z axis only
kernel_3d = np.zeros(shape=(1, 1, 5,))
kernel_3d[0, 0, :] = kernel_1d

# generate random data
data = np.random.random(size=(50, 50, 50))

# define a function for loop based convolution for easy timeit invocation
def convolve_with_loops(data):
    nx, ny, nz = data.shape
    convolved=np.zeros((nx, ny, nz))
    for i in range(0, nx):
        for j in range(0, ny):
            convolved[i,j,:]= fftconvolve(data[i, j, :], kernel_1d, mode='same') 
    return convolved

# compute the convolution two diff. ways: with loops (first) or as a 3d convolution (2nd)
convolved = convolve_with_loops(data)
convolved_2 = fftconvolve(data, kernel_3d, mode='same')

# raise an error unless the two computations return equivalent results
assert np.all(np.isclose(convolved, convolved_2))

# time the two routes of the computation
%timeit convolved = convolve_with_loops(data)
%timeit convolved_2 = fftconvolve(data, kernel_3d, mode='same')

timeit results:

10 loops, best of 3: 198 ms per loop
100 loops, best of 3: 18.1 ms per loop
Sign up to request clarification or add additional context in comments.

6 Comments

Try generating data of length 64 and see if this makes it faster. FFT is usually significantly more efficient on powers of two.
Implemented a 3D version, it is slower than what I had: Time convolving: 5851.7 ms (New 3D version) Time convolving: 4093.4 ms (Old version)
cxrodgers: fftconvolve is using def _next_regular(target): to find the optimal size of data (here 1620 for a 1535 element vector, pad with zeros).
What is the exact shape of everyone's data? The 3-D fftconvolve stays faster for me up to 256 x 256 x 256 - sized array, although by slightly less than 2x instead of the 10x for the code I posted above.
Data is (minimum size) 200x200x1535, gaussian is 1x79. As I mentionned I removed all the overhead from fftconvolve (if statements etc) so it goes directly to the convolution, shaved about 30% time off the unmodified fftconvolve.
|
2

I think you have already found the source code of the fftconvolve function. Normally, for real inputs, it uses the numpy.fft.rfftn and .irfftn functions, which compute N-dimensional transforms. Since the goal is to do multiple 1-D transformations, you can basically rewrite fftconvolve like this (simplified):

from scipy.signal.signaltools import _next_regular

def fftconvolve_1d(in1, in2):

    outlen = in1.shape[-1] + in2.shape[-1] - 1
    n = _next_regular(outlen)

    tr1 = np.fft.rfft(in1, n)
    tr2 = np.fft.rfft(in2, n)
    out = np.fft.irfft(tr1 * tr2, n)

    return out[..., :outlen].copy()

And calculate the desired result:

result = fftconvolve_1d(data, Gauss)

This works because numpy.fft.rfft and .irfft (notice the lack of n in the name) transform over a single axis of the input array (the last axis by default). This is roughly 40% faster than the OP code on my system.


Further speedup can be achieved by using a different FFT back-end.

For one, the functions in scipy.fftpack appear to be somewhat faster than their Numpy equivalents. However, the output format of the Scipy variants is pretty awkward (see docs) and this makes it hard to do the multiplication.

Another possible backend is FFTW through the pyFFTW wrapper. Downsides are that transforms are preceded by a slow "planning phase" and that inputs have to be 16-byte aligned to achieve best performance. This is explained pretty well in the pyFFTW tutorial. The resulting code could be for example:

from scipy.signal.signaltools import _next_regular
import pyfftw
pyfftw.interfaces.cache.enable()  # Cache for the "planning"
pyfftw.interfaces.cache.set_keepalive_time(1.0)

def fftwconvolve_1d(in1, in2):

    outlen = in1.shape[-1] + in2.shape[-1] - 1
    n = _next_regular(outlen)

    tr1 = pyfftw.interfaces.numpy_fft.rfft(in1, n)
    tr2 = pyfftw.interfaces.numpy_fft.rfft(in2, n)

    sh = np.broadcast(tr1, tr2).shape
    dt = np.common_type(tr1, tr2)
    pr = pyfftw.n_byte_align_empty(sh, 16, dt)
    np.multiply(tr1, tr2, out=pr)
    out = pyfftw.interfaces.numpy_fft.irfft(pr, n)

    return out[..., :outlen].copy()

With aligned inputs and cached "planning" I saw a speedup of almost 3x over the code in the OP. Memory alignment can be easily checked by looking at the memory address in the ctypes.data attribute of a Numpy array.

1 Comment

Replacing rfftn by rfft improved performance by about 30%. The pyfftw method doesn't help however: pyFFTW: 6.3 sec numpy rfft: 4.6 sec pyFFTW: 86.1 sec numpy rfft: 62.4 sec

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.