In this tutorial, we'll explore advanced CUDA programming concepts which help in identifying and resolving performance bottlenecks, because the only reason why people choose to write their own kernels is to get more performance our of the same hardware.

1. Atomic operations

Atomic operations are used to perform read-modify-write operations on shared or global memory in a thread-safe manner, ensuring that no other thread can access the memory location during the operation. Atomic operations are crucial for implementing parallel algorithms that involve concurrent updates to shared data structures.

Here's an example of a CUDA kernel for calculating a histogram using atomic operations

#include <iostream>
#include <cuda_runtime.h>

__global__ void histogramKernel(const int *data, int *hist, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < N) {
        int bin = data[idx];
        atomicAdd(&hist[bin], 1);
    }
}

This kernel calculates the histogram by atomically incrementing the appropriate bin in the histogram array using atomicAdd()

There are a couple things to beware of when using atomic operations. As mentioned before, shared memory is much faster than global memory, so atomic operations in shared memory tend to complete faster than atomic operations in global memory. While atomic operations are often necessary in some algorithms, it is important to minimize their usage when possible, especially with global memory accesses.

Also beware of serialization. If two threads perform an atomic operation at the same memory address at the same time, those operations will be serialized. The order in which the operations complete is undefined, which is fine, but the serialization can be quite costly.

2. Streams

CUDA streams allow you to concurrently execute multiple kernels and data transfers on the same GPU. Streams can help improve GPU utilization and reduce overall execution time by overlapping computation and data transfers.

The following example demonstrates how to use CUDA streams to overlap data transfers and kernel execution: