Part 4 - CUDA Libraries

In this tutorial, we'll explore how to leverage CUDA libraries to simplify and accelerate your GPU tasks. We will discuss several popular libraries, such as Thrust, cuBLAS, cuDNN, and cuFFT, and demonstrate their usage through examples. Additionally, we will touch upon interoperability with other languages and libraries.

1. Prerequisites

Knowledge of CUDA programming
Completed first 3 parts

2. Thrust: High-Level CUDA C++ Template Library

Thrust is a high-level, parallel algorithms library that resembles the C++ Standard Template Library (STL). Thrust provides a rich collection of data parallel primitives and containers, enabling developers to write high-performance CUDA code with less effort.

The following kernel shows parallel reduction (parallel reduction refers to algorithms which combine an array of elements producing a single value as a result eg. sum of an array) with Thrust

#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>

int main() {
    const int N = 1024;

    // Initialize host data
    thrust::host_vector<int> h_data(N, 1);

    // Copy data from host to device
    thrust::device_vector<int> d_data = h_data;

    // Perform parallel reduction on the device
    int sum = thrust::reduce(d_data.begin(), d_data.end());

    // Print the result
    std::cout << "Sum: " << sum << std::endl;

    return 0;
}

3. cuBLAS: CUDA Basic Linear Algebra Subroutines

cuBLAS is a GPU-accelerated version of the BLAS (Basic Linear Algebra Subprograms) library. It provides a wide range of linear algebra functions, such as vector and matrix operations, that are highly optimized for NVIDIA GPUs.

Example kernel showing matrix multiplication using cuBLAS -

#include <iostream>
#include <cublas_v2.h>

int main() {
    const int N = 1024;
    size_t size = N * N * sizeof(float);

    // Initialize host data
    // ...

    // Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc((void **)&d_A, size);
    cudaMalloc((void **)&d_B, size);
    cudaMalloc((void **)&d_C, size);

    // Copy data from host to device
    // ...

    // Create cuBLAS handle
    cublasHandle_t handle;
    cublasCreate(&handle);

    // Perform matrix multiplication using cuBLAS
    const float alpha = 1.0f;
    const float beta = 0.0f;
    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);

    // Copy the result from device to host
    // ...

    // Clean up
    cublasDestroy(handle);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    return 0;
}