CUDA-Performance Metrics

Implement Performance Metrics in CUDA

“Before we jump into these performance measurement techniques, we need to discuss how to synchronize execution between the host and device.”
why? 因为有些指令同步执行，有些指令异步执行。只有知道了区别才可以正确测量性能。

遇到cudaMemcpy() 执行变成同步的，也就是说，所有指令必须等待其他指令执行到此，才可以一起向下继续执行。如果没有cudaMemcpy()，可以使用cudaDeviceSynchronize()实现同步。

使用CPU的timer

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

t1 = myCPUTimer();
saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
cudaDeviceSynchronize();   // 
t2 = myCPUTimer();

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

“we use the explicit synchronization barrier cudaDeviceSynchronize() to block CPU execution until all previously issued commands on the device have completed. Without this barrier, this code would measure the kernel launch time and not the kernel execution time.” 在这里犯过错，CPU负责控制，当执行到kernel函数时，是CPU调用kernel函数，但是在GPU上执行，CPU调用之后，马上执行下面的语句，如果没有cudaDeviceSynchronize()，CPU会执行t2，如此一来t2-t1测的是调用kernel的时间，而非kernel执行的时间。也就是说，CPU与GPU是异步的，只有加上cudaDeviceSynchronize()，告诉CPU等待GPU把kernel执行完毕，后一同执行t2.

Timing using CUDA Events

A problem with using host-device synchronization points, such as cudaDeviceSynchronize(), is that they stall the GPU pipeline. For this reason, CUDA offers a relatively light-weight alternative to CPU timers via the CUDA event API. The CUDA event API includes calls to create and destroy events, record events, and compute the elapsed time in milliseconds between two recorded events.

A CUDA stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped—a property that can be used to hide data transfers between the host and the device

默认使用的stream 0，
Up to now, all operations on the GPU have occurred in the default stream, or stream 0 (also called the “Null Stream”).

here is an example:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

cudaEventRecord(start);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaEventRecord(stop);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

这个计时器记录的是核函数的执行时间。

CUDA events are of type cudaEvent_t and are created and destroyed with cudaEventCreate() and cudaEventDestroy(). In the above code cudaEventRecord() places the start and stop events into the default stream, stream 0. The device will record a time stamp for the event when it reaches that event in the stream. The function cudaEventSynchronize() blocks CPU execution until the specified event is recorded. The cudaEventElapsedTime() function returns in the first argument the number of milliseconds time elapsed between the recording of start and stop. This value has a resolution of approximately one half microsecond.

Memory Bandwidth

我们需要知道极限带宽，和实际带宽。

极限带宽（理论带宽）

Theoretical bandwidth can be calculated using hardware specifications available in the product literature. For example, the NVIDIA Tesla M2050 GPU uses DDR (double data rate) RAM with a memory clock rate of 1,546 MHz and a 384-bit wide memory interface. Using these data items, the peak theoretical memory bandwidth of this GPU can be computed using the following:

1	BWTheoretical = 1546 * 10^6 * (384/8) * 2 / 10^9 = 148 GB/s

解释：In this calculation, we convert the memory clock rate to Hz, multiply it by the interface width (divided by 8, to convert bits to bytes) and multiply by 2 due to the double data rate. Finally, we divide by 109 to convert the result to GB/s.

实际带宽

We calculate effective bandwidth by timing specific program activities and by knowing how our program accesses data. We use the following equation.

1	BWEffective = (RB + WB) / (t * 10^9)

Here, BWEffective is the effective bandwidth in units of GB/s, RB is the number of bytes read per kernel, WB is the number of bytes written per kernel, and t is the elapsed time given in seconds.

实例：

__global__
void saxpy(int n, float a, float *x, float *y){
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}

int main(void){
    int N = 20 * (1 << 20);
    float *x, *y, *d_x, *d_y;
    x = (float*)malloc(N*sizeof(float));
    y = (float*)malloc(N*sizeof(float));

    cudaMalloc(&d_x, N*sizeof(float)); 
    cudaMalloc(&d_y, N*sizeof(float));

    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

    cudaEventRecord(start);

    // Perform SAXPY on 1M elements
    saxpy<<<(N+511)/512, 512>>>(N, 2.0f, d_x, d_y);

    cudaEventRecord(stop);

    cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    float maxError = 0.0f;
    for (int i = 0; i < N; i++) {
        maxError = max(maxError, abs(y[i]-4.0f));
    }

    printf("Max error: %fn", maxError);
    printf("Effective Bandwidth (GB/s): %fn", N*4*3/milliseconds/1e6);
}

In the bandwidth calculation, N*4 is the number of bytes transferred per array read or write, and the factor of three represents the reading of x and the reading and writing of y. The elapsed time is stored in the variable milliseconds to make units clear. Note that in addition to adding the functionality needed for the bandwidth calculation, we have also changed the array size and the thread-block size.

CUDA events use the GPU timer and therefore avoid the problems associated with host-device synchronization

原文作者Mark Harris
原文链接

CUDA