When using GPU Coder, how should I try to minimize how often data is transferred between CPU AND GPU?
2 views (last 30 days)
Show older comments
MathWorks Support Team
on 2 Nov 2020
Edited: MathWorks Support Team
on 31 Jan 2025
I am using GPU Coder and am concerned about CPU/GPU data transfer affecting performance. Suppose I have two MATLAB functions with the 'coder.gpu.kernelfun' pragma at the top of each, and I do something with the data between calling them:
A = half(data);
B = kernelfun1(A); % output is B
% do something with B here
C = kernelfun2(B); % input is B
Does the data remain on the GPU the whole time as a half-precision float, or does it get copied to the CPU during the "do something with B" part?
Accepted Answer
MathWorks Support Team
on 25 Jan 2025
Edited: MathWorks Support Team
on 31 Jan 2025
GPU Coder tries to minimize copies between CPU and GPU. CPU/GPU copies purely depend on data access patterns.
To access the relevant documentation, execute the following command in the MATLAB R2020b command window:
>> web(fullfile(docroot, 'gpucoder/ug/gpu-memory-allocation-and-minimization.html'))
If you generate code for kernelfun1 and kernelfun2 separately (i.e., you call 'codegen' twice) and then try to call the generated mex functions like kernelfun1(b) .* kernelfun2(c), and kernelfun1 or kernelfun2 attempt to return a 'half' data type, a transfer to the CPU will occur to perform the multiplication. This is a current limitation of MATLAB because 'gpuArray' does not support the half data type. However, if you perform the multiplication in a wrapper function, e.g.:
function a = kernelfun3(b,c)
coder.gpu.kernelfun;
a = kernelfun1(b) .* kernelfun2(c);
end
and only call 'codegen' on func3, then GPU Coder will generate code such that the multiplication is performed on the GPU.
The limitation above does not apply if the returned data type of kernelfun1 and kernelfun2 is 'single' or some other datatype supported by gpuArray. In that case, the following multiplication will be performed on the GPU:
kernelfun1(b) .* kernelfun2(c)
CPU Coder tries to fuse the kernels as much as possible, so with the example above, you may find that the generated code contains a single GPU kernel instead of three separate ones for kernelfun1, kernelfun2, and kernelfun3. The effectiveness of this optimization depends on program structure and dataflow. However, we have noticed this optimization not happening in some cases. We recommend trying out code generation on your design and examining the generated code to see whether the coder performed this optimization. If not, you can try altering your design to achieve the desired results.
Please use the below link to search for the required information in the current release:
0 Comments
More Answers (0)
See Also
Categories
Find more on Kernel Creation from MATLAB Code in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!