In this tutorial, we'll explore how to leverage CUDA libraries to simplify and accelerate your GPU tasks. We will discuss several popular libraries, such as Thrust, cuBLAS, cuDNN, and cuFFT, and demonstrate their usage through examples. Additionally, we will touch upon interoperability with other languages and libraries.
Thrust is a high-level, parallel algorithms library that resembles the C++ Standard Template Library (STL). Thrust provides a rich collection of data parallel primitives and containers, enabling developers to write high-performance CUDA code with less effort.
The following kernel shows parallel reduction (parallel reduction refers to algorithms which combine an array of elements producing a single value as a result eg. sum of an array) with Thrust
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
int main() {
const int N = 1024;
// Initialize host data
thrust::host_vector<int> h_data(N, 1);
// Copy data from host to device
thrust::device_vector<int> d_data = h_data;
// Perform parallel reduction on the device
int sum = thrust::reduce(d_data.begin(), d_data.end());
// Print the result
std::cout << "Sum: " << sum << std::endl;
return 0;
}
cuBLAS is a GPU-accelerated version of the BLAS (Basic Linear Algebra Subprograms) library. It provides a wide range of linear algebra functions, such as vector and matrix operations, that are highly optimized for NVIDIA GPUs.
Example kernel showing matrix multiplication using cuBLAS -
#include <iostream>
#include <cublas_v2.h>
int main() {
const int N = 1024;
size_t size = N * N * sizeof(float);
// Initialize host data
// ...
// Allocate device memory
float *d_A, *d_B, *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);
// Copy data from host to device
// ...
// Create cuBLAS handle
cublasHandle_t handle;
cublasCreate(&handle);
// Perform matrix multiplication using cuBLAS
const float alpha = 1.0f;
const float beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N);
// Copy the result from device to host
// ...
// Clean up
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return 0;
}