Revisions to CUDA/C++ Host/Device Polymorphic Class Implementation

deleted 169 characters in body

Source Link

edited May 2, 2018 at 0:36

Jamal

35.2k
13
134
238

The way I've achieved this is to: specify a base BField class with __host__ __device__ specified pure virtual interface functions, and overwrite these with a number of derived classes (here DipoleB). On host, when an instance of the derived class is created, a mirror image of the instance is also created on device and a pointer to the on-device instance is stored on host. This on-device instance is also destroyed on host-instance destruction. The interface functions (here it's getBFieldAtS(double, double) and getGradBAtS(double, double)) are called on device by a __global__ kernel which is run over ~3.5mil particles. Here's my code:

So, aA few questions:

Am I achieving my goals in the most efficient way possible?
Are there any performance issues incurred by the fact that I'm creating one instance of a derived class on GPU and calling the interface function ~3.5 million * number of iterations~3.5 million * number of iterations times? That is, what are the implications of this many calls to a single member function?
This produces expected physical results (that is, calls to interface functions are producing the correct values because the particles behave appropriately), however when running through cuda-memcheck, I get a whole host of issues. I'm thinking this is because of how BFieldBField is set up and the fact that calling the (virtual) interface functions accesses something that would be outside the memory footprint of a Base instance... that is: memory: [BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------]Base instance:
```
  [BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------]
```
and cuda-memcheck doesn't think this should be valid. Does this sound feasible? Do I understand what is going on right?
Any non-optimal performance issues incurred by device-side dynamic allocation? Is there even another way to do this?

Also, I suppose, how's the readability, number of inline comments, other general code stuff?...kinda new at this coding thing

Thanks all.

The way I've achieved this is to: specify a base BField class with __host__ __device__ specified pure virtual interface functions, and overwrite these with a number of derived classes (here DipoleB). On host, when an instance of the derived class is created, a mirror image of the instance is also created on device and a pointer to the on-device instance is stored on host. This on-device instance is also destroyed on host-instance destruction. The interface functions (here it's getBFieldAtS(double, double) and getGradBAtS(double, double)) are called on device by a __global__ kernel which is run over ~3.5mil particles. Here's my code:

So, a few questions:

Am I achieving my goals in the most efficient way possible?
Are there any performance issues incurred by the fact that I'm creating one instance of a derived class on GPU and calling the interface function ~3.5 million * number of iterations times? That is, what are the implications of this many calls to a single member function?
This produces expected physical results (that is, calls to interface functions are producing the correct values because the particles behave appropriately), however when running through cuda-memcheck, I get a whole host of issues. I'm thinking this is because of how BField is set up and the fact that calling the (virtual) interface functions accesses something that would be outside the memory footprint of a Base instance... that is: memory: [BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------] and cuda-memcheck doesn't think this should be valid. Does this sound feasible? Do I understand what is going on right?
Any non-optimal performance issues incurred by device-side dynamic allocation? Is there even another way to do this?

Also, I suppose, how's the readability, number of inline comments, other general code stuff?...kinda new at this coding thing

Thanks all.

The way I've achieved this is to: specify a base BField class with __host__ __device__ specified pure virtual interface functions, and overwrite these with a number of derived classes (here DipoleB). On host, when an instance of the derived class is created, a mirror image of the instance is also created on device and a pointer to the on-device instance is stored on host. This on-device instance is also destroyed on host-instance destruction. The interface functions (here it's getBFieldAtS(double, double) and getGradBAtS(double, double)) are called on device by a __global__ kernel which is run over ~3.5mil particles.

A few questions:

Am I achieving my goals in the most efficient way possible?
Are there any performance issues incurred by the fact that I'm creating one instance of a derived class on GPU and calling the interface function ~3.5 million * number of iterations times? That is, what are the implications of this many calls to a single member function?
This produces expected physical results (that is, calls to interface functions are producing the correct values because the particles behave appropriately), however when running through cuda-memcheck, I get a whole host of issues. I'm thinking this is because of how BField is set up and the fact that calling the (virtual) interface functions accesses something that would be outside the memory footprint of a Base instance:
```
  [BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------]
```
and cuda-memcheck doesn't think this should be valid. Does this sound feasible? Do I understand what is going on right?
Any non-optimal performance issues incurred by device-side dynamic allocation? Is there even another way to do this?

Added question

Source Link

edited May 1, 2018 at 18:06

TomAdo

103
1
7

The way I've achieved this is to: specify a base BField class with __host__ __device__ specified pure virtual interface functions, and overwrite these with a number of derived classes (here DipoleB). On host, when an instance of the derived class is created, a mirror image of the instance is also created on device and a pointer to the on-device instance is stored on host. This on-device instance is also destroyed on host-instance destruction. The interface functions (here it's showMembergetBFieldAtS(double, double) and getGradBAtS(double, double)) are called on device by a __global__ kernel which is run over ~3.5mil particles. Here's a simplified version of my code:

Am I achieving my goals in the most efficient way possible?
Am I achieving my goals in the most efficient way possible?
Are there any performance issues incurred by the fact that I'm creating one instance of a derived class on GPU and calling the interface function ~3.5 million * number of iterations times? That is, what are the implications of this many calls to a single member function?
Are there any performance issues incurred by the fact that I'm creating one instance of a derived class on GPU and calling the interface function ~3.5 million * number of iterations times? That is, what are the implications of this many calls to a single member function?
This produces expected physical results (that is, calls to interface functions are producing the correct values because the particles behave appropriately), however when running through cuda-memcheck, I get a whole host of issues. I'm thinking this is because of how BField is set up and the fact that calling the (virtual) interface functions accesses something that would be outside the memory footprint of a Base instance... that is: memory:
This produces expected physical results (that is, calls to interface functions are producing the correct values because the particles behave appropriately), however when running through cuda-memcheck, I get a whole host of issues. I'm thinking this is because of how BField is set up and the fact that calling the (virtual) interface functions accesses something that would be outside the memory footprint of a Base instance... that is: memory: [BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------] and cuda-memcheck doesn't think this should be valid. Does this sound feasible? Do I understand what is going on right?
[BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------] and cuda-memcheck doesn't think this should be valid. Does this sound feasible? Do I understand what is going on right?
Any non-optimal performance issues incurred by device-side dynamic allocation? Is there even another way to do this?
Any non-optimal performance issues incurred by device-side dynamic allocation? Is there even another way to do this?

Also, I suppose, how's the readability, number of inline comments, other general code stuff?...kinda new at this coding thing

The way I've achieved this is to: specify a base BField class with __host__ __device__ specified pure virtual interface functions, and overwrite these with a number of derived classes (here DipoleB). On host, when an instance of the derived class is created, a mirror image of the instance is also created on device and a pointer to the on-device instance is stored on host. This on-device instance is also destroyed on host-instance destruction. The interface functions (here it's showMember()) are called on device by a __global__ kernel which is run over ~3.5mil particles. Here's a simplified version of my code:

Am I achieving my goals in the most efficient way possible?
Are there any performance issues incurred by the fact that I'm creating one instance of a derived class on GPU and calling the interface function ~3.5 million * number of iterations times? That is, what are the implications of this many calls to a single member function?
This produces expected physical results (that is, calls to interface functions are producing the correct values because the particles behave appropriately), however when running through cuda-memcheck, I get a whole host of issues. I'm thinking this is because of how BField is set up and the fact that calling the (virtual) interface functions accesses something that would be outside the memory footprint of a Base instance... that is: memory: [BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------] and cuda-memcheck doesn't think this should be valid. Does this sound feasible? Do I understand what is going on right?
Any non-optimal performance issues incurred by device-side dynamic allocation? Is there even another way to do this?

The way I've achieved this is to: specify a base BField class with __host__ __device__ specified pure virtual interface functions, and overwrite these with a number of derived classes (here DipoleB). On host, when an instance of the derived class is created, a mirror image of the instance is also created on device and a pointer to the on-device instance is stored on host. This on-device instance is also destroyed on host-instance destruction. The interface functions (here it's getBFieldAtS(double, double) and getGradBAtS(double, double)) are called on device by a __global__ kernel which is run over ~3.5mil particles. Here's my code:

Am I achieving my goals in the most efficient way possible?
Are there any performance issues incurred by the fact that I'm creating one instance of a derived class on GPU and calling the interface function ~3.5 million * number of iterations times? That is, what are the implications of this many calls to a single member function?
This produces expected physical results (that is, calls to interface functions are producing the correct values because the particles behave appropriately), however when running through cuda-memcheck, I get a whole host of issues. I'm thinking this is because of how BField is set up and the fact that calling the (virtual) interface functions accesses something that would be outside the memory footprint of a Base instance... that is: memory: [BField instance memory footprint][-------(x impl of virt fcn here)----DipoleB Instance footprint-------] and cuda-memcheck doesn't think this should be valid. Does this sound feasible? Do I understand what is going on right?
Any non-optimal performance issues incurred by device-side dynamic allocation? Is there even another way to do this?

Also, I suppose, how's the readability, number of inline comments, other general code stuff?...kinda new at this coding thing

Added calling functions

Source Link

edited May 1, 2018 at 17:54

TomAdo

103
1
7

Calling Functions:

__device__ double accel1dCUDA(const double vs_RK, const double t_RK, const double* args, BField** bfield, EField** efield) //made to pass into 1D Fourth Order Runge Kutta code
{//args array: [s_0, mu, q, m, simtime]
    double F_lor, F_mir, stmp;
    stmp = args[0] + vs_RK * t_RK; //ps_0 + vs_RK * t_RK
    
    //Mirror force
    F_mir = -args[1] * (*bfield)->getGradBAtS(stmp, t_RK + args[4]); //-mu * gradB(pos, runge-kutta time + simtime)
    
    //Lorentz force - simply qE - v x B is taken care of by mu - results in kg.m/s^2 - to convert to Re equivalent - divide by Re
    F_lor = args[2] * (*efield)->getEFieldAtS(stmp, t_RK + args[4]); //q * EFieldatS
    
    return (F_lor + F_mir) / args[3];
}//returns an acceleration in the parallel direction to the B Field

__device__ double foRungeKuttaCUDA(const double y_0, const double h, const double* funcArg, BField** bfield, EField** efield)
{
    // dy / dt = f(t, y), y(t_0) = y_0
    // funcArgs are whatever you need to pass to the equation
    // args array: [s_0, mu, q, m, simtime]
    double k1, k2, k3, k4; double y{ y_0 }; double t_RK{ 0.0 };

    k1 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k1 = f(t_n, y_n), returns units of dy / dt
    
    t_RK = h / 2;
    y = y_0 + k1 * t_RK;
    k2 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k2 = f(t_n + h/2, y_n + h/2 * k1)

    y = y_0 + k2 * t_RK;
    k3 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k3 = f(t_n + h/2, y_n + h/2 * k2)

    t_RK = h;
    y = y_0 + k3 * t_RK;
    k4 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k4 = f(t_n + h, y_n + h k3)

    return (k1 + 2 * k2 + 2 * k3 + k4) * h / 6; //returns delta y, not dy / dt, not total y
}

__global__ void computeKernel(double** currData_d, BField** bfield, EField** efield,
    const double simtime, const double dt, const double mass, const double charge, const double simmin, const double simmax)
{
    unsigned int thdInd{ blockIdx.x * blockDim.x + threadIdx.x };

    double* v_d{ currData_d[0] }; const double* mu_d{ currData_d[1] }; double* s_d{ currData_d[2] }; const double* t_incident_d{ currData_d[3] }; double* t_escape_d{ currData_d[4] };

    if (t_escape_d[thdInd] >= 0.0) //particle has escaped, t_escape is >= 0 iff it has both entered and is outside the sim boundaries
        return;
    else if (t_incident_d[thdInd] > simtime) //particle hasn't "entered the sim" yet
        return;
    else if (s_d[thdInd] < simmin * 0.999) //particle is out of sim to the bottom and t_escape not set yet
    {
        t_escape_d[thdInd] = simtime;
        return;
    }
    else if (s_d[thdInd] > simmax * 1.001) //particle is out of sim to the top and t_escape not set yet
    {
        t_escape_d[thdInd] = simtime;
        return;
    }

    //args array: [ps_0, mu, q, m, simtime]
    const double args[]{ s_d[thdInd], mu_d[thdInd], charge, mass, simtime };
    
    v_d[thdInd] += foRungeKuttaCUDA(v_d[thdInd], dt, args, bfield, efield) / 2;
    s_d[thdInd] += v_d[thdInd] * dt;
}

Calling Functions:

__device__ double accel1dCUDA(const double vs_RK, const double t_RK, const double* args, BField** bfield, EField** efield) //made to pass into 1D Fourth Order Runge Kutta code
{//args array: [s_0, mu, q, m, simtime]
    double F_lor, F_mir, stmp;
    stmp = args[0] + vs_RK * t_RK; //ps_0 + vs_RK * t_RK
    
    //Mirror force
    F_mir = -args[1] * (*bfield)->getGradBAtS(stmp, t_RK + args[4]); //-mu * gradB(pos, runge-kutta time + simtime)
    
    //Lorentz force - simply qE - v x B is taken care of by mu - results in kg.m/s^2 - to convert to Re equivalent - divide by Re
    F_lor = args[2] * (*efield)->getEFieldAtS(stmp, t_RK + args[4]); //q * EFieldatS
    
    return (F_lor + F_mir) / args[3];
}//returns an acceleration in the parallel direction to the B Field

__device__ double foRungeKuttaCUDA(const double y_0, const double h, const double* funcArg, BField** bfield, EField** efield)
{
    // dy / dt = f(t, y), y(t_0) = y_0
    // funcArgs are whatever you need to pass to the equation
    // args array: [s_0, mu, q, m, simtime]
    double k1, k2, k3, k4; double y{ y_0 }; double t_RK{ 0.0 };

    k1 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k1 = f(t_n, y_n), returns units of dy / dt
    
    t_RK = h / 2;
    y = y_0 + k1 * t_RK;
    k2 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k2 = f(t_n + h/2, y_n + h/2 * k1)

    y = y_0 + k2 * t_RK;
    k3 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k3 = f(t_n + h/2, y_n + h/2 * k2)

    t_RK = h;
    y = y_0 + k3 * t_RK;
    k4 = accel1dCUDA(y, t_RK, funcArg, bfield, efield); //k4 = f(t_n + h, y_n + h k3)

    return (k1 + 2 * k2 + 2 * k3 + k4) * h / 6; //returns delta y, not dy / dt, not total y
}

__global__ void computeKernel(double** currData_d, BField** bfield, EField** efield,
    const double simtime, const double dt, const double mass, const double charge, const double simmin, const double simmax)
{
    unsigned int thdInd{ blockIdx.x * blockDim.x + threadIdx.x };

    double* v_d{ currData_d[0] }; const double* mu_d{ currData_d[1] }; double* s_d{ currData_d[2] }; const double* t_incident_d{ currData_d[3] }; double* t_escape_d{ currData_d[4] };

    if (t_escape_d[thdInd] >= 0.0) //particle has escaped, t_escape is >= 0 iff it has both entered and is outside the sim boundaries
        return;
    else if (t_incident_d[thdInd] > simtime) //particle hasn't "entered the sim" yet
        return;
    else if (s_d[thdInd] < simmin * 0.999) //particle is out of sim to the bottom and t_escape not set yet
    {
        t_escape_d[thdInd] = simtime;
        return;
    }
    else if (s_d[thdInd] > simmax * 1.001) //particle is out of sim to the top and t_escape not set yet
    {
        t_escape_d[thdInd] = simtime;
        return;
    }

    //args array: [ps_0, mu, q, m, simtime]
    const double args[]{ s_d[thdInd], mu_d[thdInd], charge, mass, simtime };
    
    v_d[thdInd] += foRungeKuttaCUDA(v_d[thdInd], dt, args, bfield, efield) / 2;
    s_d[thdInd] += v_d[thdInd] * dt;
}

Updating with my actual code

Source Link

edited May 1, 2018 at 17:48

TomAdo

103
1
7

Loading

Source Link

asked May 1, 2018 at 17:08

TomAdo

103
1
7

Loading

Stack Exchange Network

Return to Question