When using GPU Coder, how should I try to minimize how often data is transferred between CPU AND GPU?

Question

MathWorks Support Team 2020년 11월 2일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/734315-when-using-gpu-coder-how-should-i-try-to-minimize-how-often-data-is-transferred-between-cpu-and-gpu

편집: MathWorks Support Team 2025년 1월 31일

I am using GPU Coder and am concerned about CPU/GPU data transfer affecting performance. Suppose I have two MATLAB functions with the 'coder.gpu.kernelfun' pragma at the top of each, and I do something with the data between calling them:

A = half(data);
B = kernelfun1(A); % output is B
% do something with B here
C = kernelfun2(B); % input is B

Does the data remain on the GPU the whole time as a half-precision float, or does it get copied to the CPU during the "do something with B" part?

이 질문에 답변하려면 로그인하십시오.

Answer 1

MathWorks Support Team 2025년 1월 25일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/734315-when-using-gpu-coder-how-should-i-try-to-minimize-how-often-data-is-transferred-between-cpu-and-gpu#answer_613145

편집: MathWorks Support Team 2025년 1월 31일

MATLAB Online에서 열기

GPU Coder tries to minimize copies between CPU and GPU. CPU/GPU copies purely depend on data access patterns.

To access the relevant documentation, execute the following command in the MATLAB R2020b command window:

>> web(fullfile(docroot, 'gpucoder/ug/gpu-memory-allocation-and-minimization.html'))

If you generate code for kernelfun1 and kernelfun2 separately (i.e., you call 'codegen' twice) and then try to call the generated mex functions like kernelfun1(b) .* kernelfun2(c), and kernelfun1 or kernelfun2 attempt to return a 'half' data type, a transfer to the CPU will occur to perform the multiplication. This is a current limitation of MATLAB because 'gpuArray' does not support the half data type. However, if you perform the multiplication in a wrapper function, e.g.:

function a = kernelfun3(b,c)
    coder.gpu.kernelfun;
    a = kernelfun1(b) .* kernelfun2(c);
end

and only call 'codegen' on func3, then GPU Coder will generate code such that the multiplication is performed on the GPU.

The limitation above does not apply if the returned data type of kernelfun1 and kernelfun2 is 'single' or some other datatype supported by gpuArray. In that case, the following multiplication will be performed on the GPU:

kernelfun1(b) .* kernelfun2(c)

CPU Coder tries to fuse the kernels as much as possible, so with the example above, you may find that the generated code contains a single GPU kernel instead of three separate ones for kernelfun1, kernelfun2, and kernelfun3. The effectiveness of this optimization depends on program structure and dataflow. However, we have noticed this optimization not happening in some cases. We recommend trying out code generation on your design and examining the generated code to see whether the coder performed this optimization. If not, you can try altering your design to achieve the desired results.

Please use the below link to search for the required information in the current release:

https://www.mathworks.com/help/

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

When using GPU Coder, how should I try to minimize how often data is transferred between CPU AND GPU?

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

When using GPU Coder, how should I try to minimize how often data is transferred between CPU AND GPU?

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기