CUDA-优化优先级

高优先级：

为最大化开发者的效率，使用程序分析工具来找到程序最耗时的部分，找到效率瓶颈。
最大化地利用CUDA，首先想办法把原程序中的串行代码并行化。
使用程序使用的有效带宽最为测量性能和优化效果的指标。
- 理论带宽
  
  理论带宽可以从硬件的商品指标计算得到。比如NVIDIA Tesla V100 使用 HBM2 (double data rate) RAM 时钟频率是 877 MHz。存储器位宽为 4096-bit-wide。
  
  通过上述指标可以计算这个显卡的理论带宽：
  
  ( 0.877 × 10^9 × ( 4096 / 8 ) × 2 ) ÷ 10^9 = 898 ⁢ GB/s ⁡
  
  (0.877 × 10^9)表示把时钟频率转化成Hz。 (4096 / 8) × 2)将位宽单位转化成字节，后乘以2，由于RAM是double data rate。最后除以 10^9 将最终单位转化为GB/s。
- 实际带宽
  
  实际带宽通过程序的实际执行，通过下面的公式得到：
  
  实际带宽 = ( ( Br + Bw ) ÷ 10^9 ) ÷ time
  
  结果的单位是GB/s。Br表示每个kernel读取的字节数，Bw表示每个kernel写入的字节数。
  
  比如，一个程序要计算一个2048*2048的矩阵拷贝，整个过程的带宽：
  
  实际带宽 = ( ( 2048^2 × 4 × 2 ) ÷ 10^9 ) ÷ time
  
  其中乘以4 表示矩阵每个元素的类型是float（4字节），乘以2是因为由读写两个过程。最后除以 10^9 将最终单位转化为GB/s。

尽可能不使用PCIe，步进行Device和Host间的数据传输。数据传输很可能抵消掉并行带来的性能提升。

中间数据应在Device内存中创建，销毁，由设备操作。此外，由于与每个传输相关联的开销，将许多小的传输批处理为一个较大的传输要比分别进行每个传输好得多。

此外，当使用pinned memory时，Device和Host间的带宽更高。
尽可能确保Global memory的访问时，地址是连续的。记住，连续的threads访问连续的地址，效率是最高的。
尽量少用Global memory，尽量多的使用Shared memory。

内存指令（Memory instructions）包括读取或写入shared，local或Global内存的任何指令。当访问未缓存的local或Global内存时，内存延迟有数百个时钟周期。

下边这个例子，的赋值运算符，有很高的吞吐量，但是从Global的读操作，会有上百个时钟周期的延迟。
1
2
3
__shared__ float shared[32];
__device__ float device[32];
shared[threadIdx.x] = device[threadIdx.x];
如果在等待Global内存访问完成的同时，可以发出足够的独立算术指令，则线程调度程序（thread scheduler）可以隐藏大部分全局内存延迟。但是，最好尽可能避免访问全局内存。这种操作称为Overlap

总之，能不用Global memory就尽量不使用。
在一个warp中，避免出现分支，就是说，避免Divergence。

中优先级

使用Shared内存以避免从Global内存进行冗余传输。见使用Shared memory对矩阵相乘进行的优化。
为每个线程保持足够的寄存器占用率。CUDA有个工具来计算资源占用率：CUDA Occupancy Calculator
对于kernel的配置，每个block中的线程数应该是32 的倍数，CUDA中32是个特别的数字，一个warp由32 个线程，Shared memory被划分成32个banks。
在loop中，对于循环计数器，由于循环计数器的值通常都是正的，因此可能会尝试将其声明为无符号的。但是，为了获得更好的性能，应该将它们声明为signed。
当速度超过精度时，使用快速的数学库。

CUDA支持两种数学库，两种数学库通过名字区分：__functionName()和functionName()。
- __functionName()运算时，直接映射到硬件层。快，但是精度低。
- functionName()慢，但是精度高。
尽可能的使用更快，更专的数学库，而不是更慢，更通用的数学库。这里

低优先级

Use zero-copy operations on integrated GPUs for CUDA Toolkit version 2.2 and later.
使用移位运算来避免昂贵的出发和取模运算。

Integer division and modulo operations are particularly costly and should be avoided or replaced with bitwise operations whenever possible: If n is a power of 2, ( i / n ) is equivalent to ( i ≫ log2 n ) and ( i % n ) is equivalent to ( i & n - 1 ).
避免将双精度数自动转换为浮点数。

The compiler must on occasion insert conversion instructions, introducing additional execution cycles. This is the case for:
- Functions operating on char or short whose operands generally need to be converted to an int
- Double-precision floating-point constants (defined without any type suffix) used as input to single-precision floating-point computations
  
  The latter case can be avoided by using single-precision floating-point constants, defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f.
  
  For single-precision code, use of the float type and the single-precision math functions are highly recommended.
  
  It should also be noted that the CUDA math library’s complementary error function, erfcf(), is particularly fast with full single-precision accuracy.
让编译器很容易使用分支预测代替（in lieu of）循环或控制语句。

Sometimes, the compiler may 循环展开 unroll loops or optimize out if or switch statements by using branch predication instead. In these cases, no warp can ever diverge. The programmer can also control loop unrolling using #pragma unroll.