CUDA-optimize data transfer

Optimize Data Transfers

The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050, for example) than the peak bandwidth between host memory and device memory (8 GB/s on PCIe x16 Gen2). This disparity means that your implementation of data transfers between the host and GPU devices can make or break your overall application performance.
Let’s start with a few general guidelines for host-device data transfers.

Minimize the amount of data transferred between host and device when possible, even if that means running kernels on the GPU that get little or no speed-up compared to running them on the host CPU.
Higher bandwidth is possible between the host and the device when using page-locked (or “pinned”) memory.
Batching many small transfers into one larger transfer performs much better because it eliminates most of the per-transfer overhead.
Data transfers between the host and device can sometimes be overlapped with kernel execution and other data transfers.

We investigate the first three guidelines above in this post, and we dedicate the next post to overlapping data transfers. First I want to talk about how to measure time spent in data transfers without modifying the source code.

Measuring Data Transfer Times with nvprof

To measure the time spent in each data transfer, we could record a CUDA event before and after each transfer and use cudaEventElapsedTime(), as we described in a previous post. However, we can get the elapsed transfer time without instrumenting the source code with CUDA events by using nvprof.
推荐使用nvprof，测试间。

使用实例.假如有一源文件·profile.cu, 编译:

1 2	$ nvcc profile.cu $ nvprof ./a.out

It returns as follow:

$ nvprof ./a.out 
======== NVPROF is profiling a.out...
======== Command: a.out
======== Profiling result:
Time(%)     Time  Calls      Avg      Min      Max Name
  50.08 718.11us      1 718.11us 718.11us 718.11us [CUDA memcpy DtoH]
  49.92 715.94us      1 715.94us 715.94us 715.94us [CUDA memcpy HtoD]

Minimizing Data Transfers

如果可以不传输数据，就不要传输。总之，尽量少用PCIe。

Pinned Host Memory

测试使用P106 和 GTX1060，使用pinned memory 并没有显著提高。

Batching Small Transfers

Due to the overhead associated with each transfer, it is preferable to batch many small transfers together into a single transfer. This is easy to do by using a temporary array, preferably pinned, and packing it with the data to be transferred.

For two-dimensional array transfers, you can use cudaMemcpy2D().

1	cudaMemcpy2D(dest, dest_pitch, src, src_pitch, w, h, cudaMemcpyHostToDevice)

The arguments here are a pointer to the first destination element and the pitch of the destination array, a pointer to the first source element and pitch of the source array, the width and height of the submatrix to transfer, and the memcpy kind. There is also a cudaMemcpy3D() function for transfers of rank three array sections.

原文作者 Mark Harris
原文链接

CUDA