Cuda: Copy host data to shared memory array

Question

I have a struct defined on my host and on my device. In the host I initialize an array of this struct with values.

MyStruct *h_s = (MyStruct *) malloc(objsize*sizeof(MyStruct));
hs[0] = ...

Mystruct *d_s;
cudaMalloc( &d_s, objsize * sizeof(MyStruct));
cudaMemcpy( d_s, h_s, objsize * sizeof(MyStruct), cudaMemcpyHostToDevice );
init<<< gridSize, blockSize >>> ( d_s );

In my kernel I have about 7 functions which should use this array. Some of them are global and some are simple device functions. For simplicity and efficiency i want to use a shared memory array.

__shared__ Mystruct *d_s;

__global__ void init(Mystruct *theStructArray){
   //How to allocate memory for d_s
   //How copy theStructArray to d_s
}

So the question is: How can I allocate memory for the shared array and set its values with the function parameter?

Edit: I am trying to write the smallpt code to CUDA.

struct Sphere {
    double rad;       // radius
    Vec p, e, c;      // position, emission, color
    Refl_t refl;      // reflection type (DIFFuse, SPECular, REFRactive)

    Sphere(){
        rad = 16.5;
        p = (Vec(27,16.5,47) + Vec(73,16.5,78))*0.5;
        e = Vec();
        c = Vec(0.75, 0.75, 0.75);
        refl = DIFF;
    }

    Sphere(double rad_, Vec p_, Vec e_, Vec c_, Refl_t refl_):
        rad(rad_), p(p_), e(e_), c(c_), refl(refl_) {}

    __device__ double intersect(const Ray &r) const { // returns distance, 0 if nohit
        Vec op = p-r.o; // Solve t^2*d.d + 2*t*(o-p).d + (o-p).(o-p)-R^2 = 0
        double t, eps=1e-4, b=op.dot(r.d), det=b*b-op.dot(op)+rad*rad;
        if (det<0) return 0; else det=sqrt(det);
        return (t=b-det)>eps ? t : ((t=b+det)>eps ? t : 0);
    } 
};

You should show the actual definition of MyStruct. Does it have pointers to other data in it? — Robert Crovella
– Robert Crovella, Commented May 11, 2015 at 14:12
It isn't very obvious what you are trying to ask here, but I suspect you have completely misunderstood what shared memory actually is and how it is used. __shared__ memory (whether statically or dynamically allocated) has block scope. If you want to load something into shared memory, each block in each kernel you run must read the source memory and load the source contents to its shared memory. The host can't copy to shared memory. — talonmies
– talonmies, Commented May 11, 2015 at 14:33
@talonmies Ok this comment helped me already much! So i updated the questen and the examplecode. I hope its now more obvious what i am trying to ask here. — Harald
– Harald, Commented May 11, 2015 at 15:00

2 revs · Accepted Answer · 2017-05-23 11:46:01Z

If you understand the scope and size limitations of shared memory, then the question appears to be

how to dynamically reserved memory for the shared memory array
how to use the dynamic shared memory within the kernel

Your kernel becomes something like this:

__shared__ Mystruct *d_s;

__global__ void init(Mystruct *theStructArray){

    int tid = blockDim.x * blockIdx.x + threadIdx.x;

    // load to shared memory array
    // assumes Mystruct has correct copy assignment semantics
    d_s[threadIdx.x] = theStructArray[tid]

    __syncthreads();

    // Each thread has now loaded one value to the block
    // scoped shared array
}

[disclaimer: code written in browser, never compiled or tested, and note the caveat in comments about copy assignment]

The calling host code needs to add an additional argument to the kernel call to reserve memory for the shared array:

MyStruct *h_s = (MyStruct *) malloc(objsize*sizeof(MyStruct));
hs[0] = ...

Mystruct *d_s;
cudaMalloc( &d_s, objsize * sizeof(MyStruct));
cudaMemcpy( d_s, h_s, objsize * sizeof(MyStruct), cudaMemcpyHostToDevice );
init<<< gridSize, blockSize, blockSize * sizeof(MyStruct) >>> ( d_s );

Note the third argument to the <<< >>> stanza of the kernel call. That specifies the number of bytes of memory reserved per block. There hardware dictated limits on the size of the shared memory allocations you can make, and they might have an additional effect on performance beyond the hardware limits.

Shared memory is a very well documented feature of CUDA, I would recommend Mark Harris's blog and this Stack Overflow Question as good starting points on the mechanics of shared memory in CUDA.

Collectives™ on Stack Overflow

Cuda: Copy host data to shared memory array

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related