In the previous tutorial, we wrote our first CUDA kernel and executed on a NVIDIA GPU. In this tutorial, we'll dive deeper into CUDA programming concepts, including shared memory, thread synchronization, and memory access operations.

1. Prerequisites

2. Shared Memory

Shared memory is the high-speed, on-chip memory that can be accessed by all threads within a block. It enables efficient communication and data sharing between threads. However, shared memory is limited in size and must be managed explicitly.

To get an idea of how shared memory is managed we can look at the following matrix multiplication kernel which uses shared memory -

#include <iostream>
#include <cuda_runtime.h>

#define TILE_WIDTH 16

__global__ void matrixMulKernelShared(const float *A, const float *B, float *C, int N) {
    __shared__ float ds_A[TILE_WIDTH][TILE_WIDTH];
    __shared__ float ds_B[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int row = by * blockDim.y + ty;
    int col = bx * blockDim.x + tx;

    float value = 0;

    for (int t = 0; t < (N + TILE_WIDTH - 1) / TILE_WIDTH; ++t) {
        if (row < N && t * TILE_WIDTH + tx < N)
            ds_A[ty][tx] = A[row * N + t * TILE_WIDTH + tx];
        else
            ds_A[ty][tx] = 0;

        if (t * TILE_WIDTH + ty < N && col < N)
            ds_B[ty][tx] = B[(t * TILE_WIDTH + ty) * N + col];
        else
            ds_B[ty][tx] = 0;

        __syncthreads();

        for (int k = 0; k < TILE_WIDTH; ++k)
            value += ds_A[ty][k] * ds_B[k][tx];

        __syncthreads();
    }

    if (row < N && col < N)
        C[row * N + col] = value;
}

This kernel divides the input matrices into smaller tiles and uses shared memory to store the tiles, reducing global memory access and increasing performance. The __syncthreads() function is used to synchronize threads within a block, ensuring that all threads have completed loading data into shared memory before proceeding.

3. Coalesced Memory Access

Coalesced memory access is a technique that organises memory access patterns so that consecutive threads access consecutive memory locations. This can greatly improve memory bandwidth utilisation.

Here's a vector addition kernel that demonstrates coalesced memory access-