CUDA-Grid stride Loop

目的是更新thread ID，同笔记【CUDA-更新线程ID】。

Grid-stride loop

Grid-stride loop 长这个样子：

__global__ void add(int n, float *x, float *y){
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

当有足够的线程可以覆盖所有需要处理的数据时，一个线程处理一个数据。线程ID不需要更新，一次并行执行结束，如下:
Common CUDA guidance is to launch one thread per data element, which means to parallelize the above SAXPY loop we write a kernel that assumes we have enough threads to more than cover the array size:

__global__ void saxpy(int n, float a, float *x, float *y) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) 
        y[i] = a * x[i] + y[i];
}

一个grid可以覆盖所有数据，这种方式的kernel被称作monolithic kernel. 如下kernel可以一次处理1M的数据量:

1 2	// Perform SAXPY on 1M elements saxpy<<<4096,256>>>(1<<20, 2.0, x, y);

但是，当数据量很大时，超过可用的线程数，那么所有线程由不能只干一次活了，所有线程做完一批后更新ID接着做下一批。这种方式被称作grid-stride loop，如下边的kernel:

__global__ void saxpy(int n, float a, float *x, float *y) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; 
         i < n; 
         i += blockDim.x * gridDim.x) {
          y[i] = a * x[i] + y[i];
      }
}

自然地，更新ID的方式就是，让ID加上grid的大小，即所有线程个数。一个grid 的所有线程个数就是blockDim.x * gridDim.x。
这个值可以称作为ID更新步长。加入我有1280个线程，那么线程0 将会处理元素0,1280,2560，…。这样做的好处是，保证了相邻的线程处理相邻的数据，这是效率最高的执行方式。如本文所讲“we ensure that all addressing within warps is unit-stride, so we get maximum memory coalescing, just as in the monolithic version.”

总结下grid-stride loop的优势:

1) Scalability and thread reuse.

保证可以处理任何量的数据，一批一批地串行就可以啦，没办法，可用线程数有限，这可以保证所有数据正确被处理。另一方面，可以限制block的数量，做微调，尝试提升性能。
“By using a loop, you can support any problem size. even if it exceeds the largest grid size your CUDA device supports. Moreover, you can limit the number of blocks you use to tune performance. For example, it’s often useful to launch a number of blocks that is a multiple of the number of multiprocessors on the device, to balance utilization. As an example, we might launch the loop version of the kernel like this:”

int numSMs;
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, devId);
// Perform SAXPY on 1M elements
saxpy<<<32*numSMs, 256>>>(1 << 20, 2.0, x, y);

When you limit the number of blocks in your grid, threads are reused for multiple computations. Thread reuse amortizes thread creation and destruction cost along with any other processing the kernel might do before or after the loop (such as thread-private or shared data initialization).

2) Debugging

方便Debug。只是用一个线程，使整个过程变为串行处理，通过使用打印语句，找到错误，便于修改。
By using a loop instead of a monolithic kernel, you can easily switch to serial processing by launching one block with one thread.

1	saxpy<<<1,1>>>(1<<20, 2.0, x, y);

This makes it easier to emulate a serial host implementation to validate results, and it can make printf debugging easier by serializing the print order. Serializing the computation also allows you to eliminate numerical variations caused by changes in the order of operations from run to run, helping you to verify that your numerics are correct before tuning the parallel version.

3) Portability and readability

The grid-stride loop code is more like the original sequential loop code than the monolithic kernel code, making it clearer for other users. In fact we can pretty easily write a version of the kernel that compiles and runs either as a parallel CUDA kernel on the GPU or as a sequential loop on the CPU. The Hemi library provides a grid_stride_range() helper that makes this trivial using C++11 range-based for loops.

HEMI_LAUNCHABLE
void saxpy(int n, float a, float *x, float *y){
    for (auto i : hemi::grid_stride_range(0, n)) {
        y[i] = a * x[i] + y[i];
    }
}

We can launch the kernel using this code, which generates a kernel launch when compiled for CUDA, or a function call when compiled for the CPU. hemi::cudaLaunch(saxpy, 1<<20, 2.0, x, y);
Grid-stride loops are a great way to make your CUDA kernels flexible, scalable, debuggable, and even portable.

原文内容来自 Mark Harris
原文链接

CUDA