Understanding the GPU
Table of Contents
Utilizing the GPU
A GPU (Graphics Processing Unit) contains thousands of smaller, efficient cores designed to handle multiple tasks simultaneously—ideal for rendering graphics. Each core computes a small portion of the image to be displayed on the screen, enabling a high level of parallelism.
But how can we use GPUs for non-graphics tasks? Programmers originally wrote small programs called shaders to determine things like pixel position and color. These shaders run in parallel across many data points on the GPU.
Eventually, developers began repurposing shaders to perform general-purpose parallel computations. This led to technologies like CUDA (for NVIDIA GPUs), where NVIDIA provided an API to allow developers to access GPU power for broader, non-graphics purposes like machine learning, scientific computing, and data analysis.
CUDA
In CUDA, the Streaming Multiprocessor (SM) is the fundamental unit of computation. Each SM contains several Streaming Processors (SPs), also known as CUDA cores. These cores execute instructions for individual threads.
Threads are organized into blocks, which typically contain 128, 256, or 512 threads (up to a maximum of 1024 threads per block). Each block is further divided into warps—groups of 32 threads—which are the smallest unit of execution. A warp is scheduled on an SM, and all threads in a warp execute the same instruction simultaneously, following the SIMD (Single Instruction, Multiple Data) model.
SIMD
Warps in CUDA follow the SIMT (Single Instruction, Multiple Threads) execution model. This means that all 32 threads in a warp begin execution at the same program counter and attempt to execute instructions in lockstep—that is, they all execute the same instruction at the same time, as long as possible.
When a block (e.g., containing 50 threads) is assigned to a Streaming Multiprocessor (SM), it is split into warps of 32 threads each. Each warp is then scheduled onto the SM's execution units. The SM tries to execute the same instruction across all 32 threads in the warp during each clock cycle.
If the block contains a number of threads that isn't a multiple of 32, the last warp will contain fewer active threads. For example, with 50 threads in a block, the first warp will have 32 active threads, and the second warp will contain only 18 active threads—the remaining 14 threads in that warp are considered inactive and do not perform any work.
Active Threads
Active threads are those currently loaded and resident on a Streaming Multiprocessor (SM), meaning they are ready and eligible for execution. In CUDA, threads are organized into warps of 32 threads each. A warp is considered resident if its context—such as registers and potentially shared memory—is fully allocated within the SM.
Resident warps may either be actively executing instructions or temporarily stalled (for example, waiting on memory access). Regardless, they remain loaded on the SM and can be quickly scheduled by the GPU as soon as resources are available, enabling high throughput through fast context switching.
Context switching
A streaming processor can have multiple warps in flight concurrently. This helps hide memory latency and improves throughput by allowing the SM to switch between warps that are ready to execute while others wait for memory operations.
The register file is shared across all warps within an SM. Each warp receives its own set of registers to store thread-specific data. Typically, each thread has access to around 255 registers, though the exact number depends on the GPU architecture (such as Volta, Turing, or Ampere).
When a context switch occurs—such as between warps or threads—the state is saved and restored using the register file instead of global memory. This allows for fast context switching, since the state of active threads is kept close to the compute cores. Global memory (RAM) is only used when explicitly needed, such as when working with large data structures that exceed register capacity.