CUDA-project review

这篇blog记录了项目中使用或未使用到的CUDA知识点。

__constant__ float d_arr[10] 在constant memory中开辟10个空间。
cudaMemcpyToSymbol(d_arr, h_arr, sizeof(h_arr)) 将Host中的数据复制进Device中所开辟的空间。
__device__ float d_arr[10][5] 在Global memory中开辟空间。
cudaMemcpyFromSymbol(h_arr,d_arr, sizeof(d_arr)) 将Device中的数据复制到Host中所开辟的空间。

在local memory中开辟空间，lifetime为threads的周期：

__global__ void func(){
    float tmp[7];
    ...
}

使用registers而非local memory，当所需数据大小较小，且数量固定时将float a[3] 改写成float a0,a1,a2.

使用grid-stride-loop。其中idd需要根据实际问题计算得到:

__global__ void func(){
    int tid = ...;
    int stride = ...;
    for (int idd = tid; idd<N; idd+=stride){
        // idd is the thread id in this loop;
    }
}

使用grid-stride-loop 后，将kernel函数改为<<<1,1>>>，并且在适当的位置加上打印语句。便于调试。

实现时，在cuda相关的语句前加上checkCudaError().这个函数要自己实现。
在调用跟kernel函数后，加上checkCudaError(cudaGetLatError());
根据当前问题找example中可用内容。
实验函数，先用笔在纸上实现，定义内个变量的含义，左后写code。
在一个较大的实现中，保证一段code一个功能，这一段的实现尽量不要使用其他段code的变量，尽量使每段code独立化。
pinned memory VS pageable memory.
deviceQuery 轻量级的方法。
协作组
对于代操作数据为二维或三维点，一个技巧是，为了尽可能减少PCIe的使用，线程id天然可以表示成数据点的坐标：(idx,idy)<=>(x,y).
因为Device段不能动态分配空间，所以当实现摸个算法的CPU版本时，要使用stack内存，开辟足够多的空间。
cudaMallocPitch();
cudaMemSet2D();
std::bitset<16> foo;
角度与弧度的转化：1°=π/180,1rad=(180/π)°
choose device
multiple GPUs
使用event给code计时。或自己写计时类。
Unified Memory.
循环展开，减少操作。
注意CPU code中不可并行的部分，如下：
1
2
3
if (tid < N){
bb[tid+1] = count + bb[tid];
}
上面的指令只能串行执行。

CUDA