0

I have been playing around with numba and trying to implement a simple element-wise matrix multiplication. When using 'vectorize' I get the same result as the numpy multiplication but when I'm using 'cuda.jit' they are not same. Many of them are zeros. I'm providing a minimum working example for this purpose. Any help with the problem will be appreciated. I'm using numba o.35.0 and python 2.7

from __future__ import division
from __future__ import print_function

import numpy as np

from numba import vectorize, cuda, jit

M = 80
N = 40
P = 40

# Set the number of threads in a block
threadsperblock = 32

# Calculate the number of thread blocks in the grid
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock

@vectorize(['float32(float32,float32)'], target='cuda')
def VectorMult3d(a, b):
   return a*b

@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])')
def mult_gpu_3d(a, b, c):
  [x, y, z] = cuda.grid(3)
  if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]:
    c[x, y, z] = a[x, y, z] * b[x, y, z]

if __name__ == '__main__':
  A = np.random.normal(size=(M, N, P)).astype(np.float32)
  B = np.random.normal(size=(M, N, P)).astype(np.float32)

  numpy_C = A*B

  A_gpu = cuda.to_device(A)
  B_gpu = cuda.to_device(B)
  C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu)

  mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu)

  cudajit_C = C_gpu.copy_to_host()

  print('------- using cuda.jit -------')
  print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C)))
  print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P))
  print('------- using cuda.jit -------\n')


  vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu)
  vectorize_C = vectorize_C_gpu.copy_to_host()

  print('------- using vectorize -------')
  print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C)))
  print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P))
  print('------- using vectorize -------\n')

  import numba; print("numba version: "+numba.__version__)

1 Answer 1

2

Here is how you could debug this.

Consider a smaller and simplified example with:

  • reduced array sizes, e.g. (2, 3, 1) (so you could actually print the values and be able to read them)
  • simple and deterministic contents, e.g. "all ones" (to compare across runs)
  • additional kernel arguments for debugging

from __future__ import (division, print_function)

import numpy as np
from numba import cuda

M = 2
N = 3
P = 1

threadsperblock = 1
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock


@cuda.jit
def mult_gpu_3d(a, b, c, grid_ran, grid_multed):
    grid = cuda.grid(3)
    x, y, z = grid

    grid_ran[x] = 1

    if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]):
        grid_multed[x] = 1
        c[grid] = a[grid] * b[grid]


if __name__ == '__main__':
    A = np.ones((M, N, P), np.int32)
    B = np.ones((M, N, P), np.int32)

    A_gpu = cuda.to_device(A)
    B_gpu = cuda.to_device(B)
    C_gpu = cuda.to_device(np.zeros_like(A))

    # Tells whether thread at index i have ran
    grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32))

    # Tells whether thread at index i have performed multiplication
    grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32))

    mult_gpu_3d[blockspergrid, threadsperblock](
        A_gpu, B_gpu, C_gpu, grid_ran, grid_multed)

    print("grid_ran.shape    : ", grid_ran.shape)
    print("grid_multed.shape : ", grid_multed.shape)
    print("C_gpu.shape       : ", C_gpu.shape)

    print("grid_ran          : ", grid_ran.copy_to_host())
    print("grid_multed       : ", grid_multed.copy_to_host())

    C = C_gpu.copy_to_host()
    print("C transpose flat  : ", C.T.flatten())
    print("C                 : \n", C)

Output:

grid_ran.shape    :  (6,)
grid_multed.shape :  (6,)
C_gpu.shape       :  (2, 3, 1)
grid_ran          :  [1 1 1 1 1 1]
grid_multed       :  [1 1 0 0 0 0]
C transpose flat  :  [1 1 0 0 0 0]
C                 : 
 [[[1]
  [0]
  [0]]

 [[1]
  [0]
  [0]]]

You can see that the device grid shape does not correspond to the shape of the arrays: the grid is flat (M*N*P), while arrays are all 3-dimensional (M, N, P). That is, first dimension of the grid has indices in range 0..M*N*P-1 (0..5, totaling 6 values in my example), while first dimension of the array is only in 0..M-1 (0..1, totaling 2 values in my example). This mistake typically leads do out-of-bounds access, but you have protected your kernel with a conditional which cuts down the offending threads:

if (x <= c.shape[0])

This line does not allow threads with indices above M-1 (1 in my example) to run (well, sort of [1]), that is why no values are written and you get many zeros in the resulting array.

Possible solutions:

  • In general, you could use multidimensional kernel grid configuration, i.e. a 3D vector for blockspergrid instead of a scalar [2].
  • In particular, as elementwise multiplication is a map operation and does not depend on array shapes, you could flatten all 3 arrays to 1D arrays, run your kernel as is on 1D grid, then reshape the result back [3], [4].

References:

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. Your explanations are quite clear. I took the advice to use multidimensional kernel grid configuration. Something like below. threadsperblock = (4, 4, 4); blockspergrid_x = np.int(np.ceil(M / threadsperblock[0])) Similarly setting blockspergrid_y and blockspergrid_z and then blockspergrid = (blockspergrid_x, blockspergrid_y, blockspergrid_z). Finally calling mult_gpu_3d with blockspergrid and threadsperblock. The references provided by you are also great!! Thanks again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.