Peter Zhao wistful thoughts

CUDA Day 1

One of my goals for 2026 is to be at least familiar enough with CUDA and GPU programming to expand my mental model of compute beyond current CPU architectures and memory hierarchy. While CPU programming is heavily focused on optimizing cache and memory locality (especially with NUMA regions) and managing a small number of threads efficiently, GPU programming requires a fundamentally different mindset. Instead of a few threads doing a lot of work, it’s inherently massively parallel, organizing thousands of threads into blocks and warps, and requires understanding a distinct GPU memory hierarchy (global, shared, local, and constant memory).

Core Terminology

Before diving into code, it’s important to understand the fundamental concepts of CUDA:

  • Host - The CPU and its memory (system RAM)
  • Device - The GPU and its memory (VRAM)
  • Kernel - A function that runs on the GPU, marked with __global__
  • Thread - The smallest unit of execution. Each thread runs the kernel function independently
  • Block - A logical group of threads that can cooperate and share memory. Threads in a block can synchronize and access shared memory
  • Grid - A collection of blocks. All blocks in a grid execute the same kernel
  • Warp - A group of 32 threads that execute together in lockstep. This is the fundamental execution unit on NVIDIA GPUs, and is defined in hardware. All threads in a warp execute the same instruction simultaneously (SIMT - Single Instruction, Multiple Thread). Warp divergence occurs when threads in a warp take different execution paths (e.g., different branches of an if statement), which can significantly hurt performance
  • Thread ID - threadIdx.x/y/z identifies a thread within its block
  • Block ID - blockIdx.x/y/z identifies a block within the grid
  • Block Dimension - blockDim.x/y/z gives the size of each dimension of the thread block
  • Grid Dimension - gridDim.x/y/z gives the size of each dimension of the grid

The hierarchy is: Grid → Blocks → Threads → Warps (automatic grouping of 32 threads)

Visualizing this takes awhile, but it is well worth effort to internalize.

Writing a kernel

A kernel is a function that runs on a thread, and is what the GPU can parallelize efficiently. The basic structure of a CUDA program involves:

  1. Host code - runs on the CPU
  2. Device code - runs on the GPU (kernels)
  3. Memory management - allocating and copying data between host and device

The biggest thing to understand is that you don’t exactly write GPU code to run on the GPU directly - it depends on the Host to send execution commands to the GPU to run. Inherently, it is distributed in that sense so the same concepts apply as when optimizing code across different devices.

Here’s a simple “Hello World” equivalent - a kernel that adds two arrays:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void addArrays(int *a, int *b, int *c, int n) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n) {
        c[index] = a[index] + b[index];
    }
}

int main() {
    const int n = 1024;
    int *h_a, *h_b, *h_c;  // host arrays
    int *d_a, *d_b, *d_c;  // device arrays
    
    // Allocate host memory
    h_a = (int*)malloc(n * sizeof(int));
    h_b = (int*)malloc(n * sizeof(int));
    h_c = (int*)malloc(n * sizeof(int));
    
    // Initialize host arrays
    for (int i = 0; i < n; i++) {
        h_a[i] = i;
        h_b[i] = i * 2;
    }
    
    // Allocate device memory
    cudaMalloc(&d_a, n * sizeof(int));
    cudaMalloc(&d_b, n * sizeof(int));
    cudaMalloc(&d_c, n * sizeof(int));
    
    // Copy data to device
    cudaMemcpy(d_a, h_a, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, n * sizeof(int), cudaMemcpyHostToDevice);
    
    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
    addArrays<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
    
    // Copy result back to host
    cudaMemcpy(h_c, d_c, n * sizeof(int), cudaMemcpyDeviceToHost);
    
    // Verify results
    bool correct = true;
    for (int i = 0; i < n; i++) {
        if (h_c[i] != h_a[i] + h_b[i]) {
            correct = false;
            break;
        }
    }
    printf("Result: %s\n", correct ? "PASSED" : "FAILED");
    
    // Cleanup
    free(h_a); free(h_b); free(h_c);
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    
    return 0;
}

Key Concepts

  • __global__ - indicates a kernel function that can be called from host code
  • blockIdx.x and threadIdx.x - built-in CUDA variables that identify the block and thread within that block
  • blockDim.x - number of threads per block
  • <<<blocksPerGrid, threadsPerBlock>>> - kernel launch configuration syntax
  • Memory management - cudaMalloc for device memory, cudaMemcpy for transfers between host and device

The most interesting part is how the kernel automatically parallelizes across thousands of threads. Each thread computes one element of the result array, and they all run concurrently on the GPU.

Next Steps

I’m planning to explore:

  • GPU memory types and usage (such as coalescing)
  • Understanding warp execution and divergence
  • Performance profiling with nvprof or nsight
  • More complex algorithms that benefit from GPU parallelism
  • cuTile
  • and more!

This is just the beginning, but I’m excited to dive deeper into the world of parallel computing!

Hello World

Welcome to my blog! It’s 2026 so I’m kicking this back up again. I’ll be writing up my thoughts and explorations in my journey to learn more about all sorts of interesting topics.

Here’s a code snippet to celebrate! Programmed in Python 3, and if you want to see what it does, just copy it into your favorite Python compiler!

import math

def generate_pattern(size):
    pattern = [[' ' for _ in range(size)] for _ in range(size)]
    center = size // 2
    chars = [' ', '.', ':', '*', '#']

    for y in range(size):
        for x in range(size):
            dx = x - center
            dy = y - center
            r = math.hypot(dx, dy)
            angle = math.atan2(dy, dx)

            spiral = math.sin(r * 0.7 + angle * 5)
            value = int((spiral + 1) / 2 * (len(chars) - 1))
            pattern[y][x] = chars[value]

    return '\n'.join(''.join(row) for row in pattern)

print(generate_pattern(67))

Enjoy, and welcome!