CUDA Day 1
03 Jan 2026One of my goals for 2026 is to be at least familiar enough with CUDA and GPU programming to expand my mental model of compute beyond current CPU architectures and memory hierarchy. While CPU programming is heavily focused on optimizing cache and memory locality (especially with NUMA regions) and managing a small number of threads efficiently, GPU programming requires a fundamentally different mindset. Instead of a few threads doing a lot of work, it’s inherently massively parallel, organizing thousands of threads into blocks and warps, and requires understanding a distinct GPU memory hierarchy (global, shared, local, and constant memory).
Core Terminology
Before diving into code, it’s important to understand the fundamental concepts of CUDA:
- Host - The CPU and its memory (system RAM)
- Device - The GPU and its memory (VRAM)
- Kernel - A function that runs on the GPU, marked with
__global__ - Thread - The smallest unit of execution. Each thread runs the kernel function independently
- Block - A logical group of threads that can cooperate and share memory. Threads in a block can synchronize and access shared memory
- Grid - A collection of blocks. All blocks in a grid execute the same kernel
- Warp - A group of 32 threads that execute together in lockstep. This is the fundamental execution unit on NVIDIA GPUs, and is defined in hardware. All threads in a warp execute the same instruction simultaneously (SIMT - Single Instruction, Multiple Thread). Warp divergence occurs when threads in a warp take different execution paths (e.g., different branches of an if statement), which can significantly hurt performance
- Thread ID -
threadIdx.x/y/zidentifies a thread within its block - Block ID -
blockIdx.x/y/zidentifies a block within the grid - Block Dimension -
blockDim.x/y/zgives the size of each dimension of the thread block - Grid Dimension -
gridDim.x/y/zgives the size of each dimension of the grid
The hierarchy is: Grid → Blocks → Threads → Warps (automatic grouping of 32 threads)
Visualizing this takes awhile, but it is well worth effort to internalize.
Writing a kernel
A kernel is a function that runs on a thread, and is what the GPU can parallelize efficiently. The basic structure of a CUDA program involves:
- Host code - runs on the CPU
- Device code - runs on the GPU (kernels)
- Memory management - allocating and copying data between host and device
The biggest thing to understand is that you don’t exactly write GPU code to run on the GPU directly - it depends on the Host to send execution commands to the GPU to run. Inherently, it is distributed in that sense so the same concepts apply as when optimizing code across different devices.
Here’s a simple “Hello World” equivalent - a kernel that adds two arrays:
#include <cuda_runtime.h>
#include <stdio.h>
__global__ void addArrays(int *a, int *b, int *c, int n) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n) {
c[index] = a[index] + b[index];
}
}
int main() {
const int n = 1024;
int *h_a, *h_b, *h_c; // host arrays
int *d_a, *d_b, *d_c; // device arrays
// Allocate host memory
h_a = (int*)malloc(n * sizeof(int));
h_b = (int*)malloc(n * sizeof(int));
h_c = (int*)malloc(n * sizeof(int));
// Initialize host arrays
for (int i = 0; i < n; i++) {
h_a[i] = i;
h_b[i] = i * 2;
}
// Allocate device memory
cudaMalloc(&d_a, n * sizeof(int));
cudaMalloc(&d_b, n * sizeof(int));
cudaMalloc(&d_c, n * sizeof(int));
// Copy data to device
cudaMemcpy(d_a, h_a, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, n * sizeof(int), cudaMemcpyHostToDevice);
// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
addArrays<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Copy result back to host
cudaMemcpy(h_c, d_c, n * sizeof(int), cudaMemcpyDeviceToHost);
// Verify results
bool correct = true;
for (int i = 0; i < n; i++) {
if (h_c[i] != h_a[i] + h_b[i]) {
correct = false;
break;
}
}
printf("Result: %s\n", correct ? "PASSED" : "FAILED");
// Cleanup
free(h_a); free(h_b); free(h_c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Key Concepts
__global__- indicates a kernel function that can be called from host codeblockIdx.xandthreadIdx.x- built-in CUDA variables that identify the block and thread within that blockblockDim.x- number of threads per block<<<blocksPerGrid, threadsPerBlock>>>- kernel launch configuration syntax- Memory management -
cudaMallocfor device memory,cudaMemcpyfor transfers between host and device
The most interesting part is how the kernel automatically parallelizes across thousands of threads. Each thread computes one element of the result array, and they all run concurrently on the GPU.
Next Steps
I’m planning to explore:
- GPU memory types and usage (such as coalescing)
- Understanding warp execution and divergence
- Performance profiling with
nvprofornsight - More complex algorithms that benefit from GPU parallelism
- cuTile
- and more!
This is just the beginning, but I’m excited to dive deeper into the world of parallel computing!