GPU Architecture

A graphics processing unit (GPU) is a massively parallel processor designed to apply the same computation to enormous quantities of data simultaneously. Where a central processing unit devotes most of its silicon to control logic and cache so that it can execute a single thread of instructions as quickly as possible, a GPU devotes most of its silicon to arithmetic units and runs thousands of threads at once. This fundamental difference in design philosophy makes the GPU exceptionally effective for workloads with abundant data parallelism, from rendering three-dimensional scenes to training neural networks and performing scientific simulation.

Understanding GPU architecture requires setting aside many intuitions formed from sequential processor design. The GPU does not try to make any single thread fast; instead it keeps a vast pool of threads in flight and switches among them to hide the latency of memory and arithmetic. Performance comes not from low latency on one task but from high throughput across many tasks. This article examines the execution model that makes this possible, the structure of the processing cores, the memory hierarchy that feeds them, the graphics and compute pipelines that GPUs serve, the programming models that expose the hardware, and the architectural trade-offs that distinguish GPUs from conventional processors.

Throughput-Oriented Design Philosophy

The defining goal of GPU architecture is throughput: completing the greatest total amount of work per unit time, even at the expense of the time taken by any individual operation. A central processing unit minimizes the latency of a single instruction stream through deep pipelines, out-of-order execution, branch prediction, and large caches. A GPU instead maximizes aggregate work by replicating simple arithmetic units thousands of times and keeping them busy with a deep pool of independent threads.

Latency Hiding Through Parallelism

When a thread issues a memory request that takes hundreds of cycles to satisfy, a latency-oriented processor stalls or speculates. A GPU instead sets that thread aside and immediately executes another group of threads that are ready to run. With enough resident threads, the long latency of any one operation is completely overlapped with useful work from others, so the arithmetic units rarely sit idle.

Massive thread oversubscription: Many more threads are resident than can execute in a single cycle, providing a reservoir of ready work.
Zero-overhead context switching: Thread state is held in a large register file, so switching among thread groups costs no cycles.
Occupancy: The ratio of resident threads to the hardware maximum measures how effectively latency can be hidden.

Silicon Allocation

The contrast in design philosophy is visible directly in the floor plan of the chip. A GPU spends comparatively little area on control and caching and a great deal on parallel arithmetic, which is why it delivers far higher peak arithmetic throughput than a CPU of comparable transistor count.

Arithmetic density: The majority of the die is arithmetic logic, organized into wide arrays of execution lanes.
Reduced control: A single instruction-fetch and decode unit is amortized across many lanes that execute in lockstep.
Smaller caches per thread: Caches exist chiefly to conserve bandwidth rather than to reduce per-thread latency.

The SIMT Execution Model

GPUs execute under a model usually called single-instruction, multiple-thread (SIMT). A group of threads executes a common instruction stream in lockstep, with each thread operating on its own data and its own registers. The model resembles classical single-instruction, multiple-data execution, but it presents the abstraction of independent scalar threads to the programmer, which makes code far easier to write than explicit vector programming.

Warps and Wavefronts

The hardware groups threads into fixed-size bundles that are scheduled and executed together. One vendor calls this bundle a warp and fixes its size at thirty-two threads; another calls it a wavefront. Every thread in the bundle shares one program counter and executes the same instruction in the same cycle, differing only in the data each thread touches.

Lockstep issue: All threads in a warp advance together through the instruction stream.
Per-thread registers: Each thread holds private register state, so the same instruction produces different results across the warp.
Scheduling unit: The warp, not the individual thread, is the fundamental unit the hardware schedules onto execution resources.

Branch Divergence

Because a warp shares a single program counter, conditional branches whose outcome differs among the threads of a warp create a problem. The hardware handles such divergence by executing each taken path in turn while masking off the threads that did not take that path, then reconverging. Divergence reduces efficiency because some lanes are idle during each branch path, and minimizing it is a central concern of GPU programming.

Predication and masking: Inactive threads are disabled by a per-lane execution mask rather than by separate control flow.
Serialized paths: Divergent branches execute their paths sequentially, so a warp split across two paths runs at roughly half efficiency.
Reconvergence: The hardware tracks a reconvergence point at which the divided threads resume lockstep execution.

Streaming Multiprocessors

The GPU is built from an array of largely independent processing blocks, called streaming multiprocessors by one vendor and compute units by another. Each such block contains the execution lanes, register file, scheduling logic, and local memory needed to run many warps concurrently. A high-end GPU contains dozens of these blocks, and the architecture scales across product tiers chiefly by changing how many it includes.

Internal Organization

Within a streaming multiprocessor, the execution lanes are partitioned among one or more warp schedulers, each of which selects a ready warp every cycle and issues its next instruction to the lanes. Specialized units handle operations that the general lanes do not, and a large register file supplies operands.

Execution lanes: Arrays of arithmetic units perform integer and floating-point operations, one element per lane per instruction.
Warp schedulers: Multiple schedulers issue from independent warps each cycle to keep the lanes saturated.
Special function units: Dedicated hardware evaluates transcendental functions such as reciprocal, square root, and trigonometric functions.
Matrix engines: Modern designs add units, such as tensor cores, that compute small matrix multiply-accumulate operations for machine learning.

The Register File and Occupancy

Each streaming multiprocessor contains an unusually large register file, because the registers of every resident thread must be held simultaneously to allow zero-overhead switching. The register file is therefore a primary resource that limits how many warps can reside at once. A kernel that demands many registers per thread reduces the number of resident warps and lowers occupancy, which in turn weakens the architecture's ability to hide latency.

Static partitioning: Registers are divided among resident threads when a block is launched and held for its lifetime.
Resource trade-off: Higher per-thread register use yields lower occupancy, balancing per-thread performance against latency hiding.
Shared memory pressure: On-chip shared memory is likewise partitioned among resident blocks and constrains occupancy in the same way.

Memory Hierarchy

The GPU memory hierarchy is engineered for bandwidth rather than latency. Its broad memory interface, on-chip scratchpad, and coalescing logic exist to move data to the arithmetic units in bulk. Because the architecture hides latency through parallelism, the hierarchy concentrates on maximizing how many bytes per second can be delivered.

Levels of Memory

From fastest to largest, the hierarchy proceeds from per-thread registers through programmer-managed shared memory and hardware caches to the large but slower device memory. Each level trades capacity against speed in the familiar way, but the GPU adds a level that the CPU lacks: an explicitly managed on-chip scratchpad.

Registers: Private to each thread and fastest, holding the working set of active computation.
Shared memory: An on-chip scratchpad, shared by the threads of a block and managed explicitly by the programmer for data reuse and exchange.
L1 and L2 caches: Hardware caches that reduce traffic to device memory; the L2 is shared across the whole chip.
Device memory: Large off-chip memory, built from high-bandwidth technologies such as GDDR or stacked high-bandwidth memory, offering enormous bandwidth at relatively high latency.

Memory Coalescing

The single most important determinant of memory performance is whether the threads of a warp access adjacent addresses. When they do, the hardware coalesces their individual requests into a small number of wide transactions, using the memory bus efficiently. When the accesses are scattered, the same data requires many separate transactions and effective bandwidth collapses.

Coalesced access: Consecutive threads reading consecutive addresses combine into one wide transaction.
Scattered access: Irregular addressing fragments a warp's request into many transactions, wasting bandwidth.
Bank conflicts: Shared memory is divided into banks, and simultaneous access to the same bank by different threads is serialized.

The Graphics Pipeline

Although GPUs are now widely used for general computation, their architecture grew out of the demands of real-time rendering, and a substantial part of the chip remains dedicated to the graphics pipeline. This pipeline transforms a description of a three-dimensional scene into a two-dimensional image through a sequence of stages, some of which run on fixed-function hardware and some on the same programmable lanes used for compute.

Programmable Stages

The modern pipeline interleaves programmable shader stages with fixed-function blocks. Shaders are short programs that the application supplies, and they execute on the streaming multiprocessors using the same SIMT model as general computation.

Vertex shading: Transforms the positions and attributes of geometry vertices into screen space.
Geometry and tessellation: Optional stages that generate or refine geometry on the fly.
Fragment shading: Computes the color of each pixel-sized fragment, typically the most arithmetically intensive stage.

Fixed-Function Hardware

Between the programmable stages, dedicated hardware performs operations that are universal to rendering and therefore worth implementing in fixed logic for efficiency. These blocks free the programmable lanes to concentrate on application-specific shading.

Rasterization: Converts geometric primitives into the fragments that cover them on screen.
Texture units: Fetch and filter image data, with hardware interpolation and caching tuned for spatial locality.
Raster operations: Perform depth testing and blending as fragments are written to the framebuffer.
Ray-tracing units: Recent architectures add fixed-function acceleration for traversing spatial data structures used in ray tracing.

The Compute Pipeline

General-purpose computation on the GPU bypasses the fixed-function graphics stages and exposes the streaming multiprocessors directly to the programmer. The application organizes its work as a grid of threads and dispatches it as a compute kernel, which the hardware distributes across the available multiprocessors. This compute path is what powers scientific computing, data analytics, and the training and inference of machine-learning models.

The Thread Hierarchy

Compute work is organized into a hierarchy that maps cleanly onto the hardware. Individual threads are grouped into blocks, blocks are arranged into a grid, and the runtime assigns blocks to streaming multiprocessors. Threads within a block can cooperate through shared memory and synchronize with one another, while threads in different blocks generally cannot.

Threads: The finest unit of execution, each with private registers and a unique index.
Blocks: Groups of threads that share on-chip memory and can synchronize; a block executes entirely on one multiprocessor.
Grid: The full collection of blocks that constitutes a single kernel launch.

Synchronization and Cooperation

Within a block, threads coordinate through barrier synchronization and shared memory, which together enable cooperative algorithms such as tiled matrix multiplication and parallel reductions. Atomic operations allow threads to update shared locations safely without explicit locks, supporting histogram construction and similar patterns.

Barriers: A block-wide barrier ensures all threads reach a point before any proceeds, making shared-memory exchange safe.
Atomic operations: Read-modify-write primitives serialize conflicting updates to a shared address.
Warp-level primitives: Shuffle operations exchange register values directly among the threads of a warp without touching memory.

Programming Models

The capabilities of GPU hardware reach applications through programming models that express data-parallel work and manage the movement of data between host and device. These models present the thread hierarchy and memory spaces to the programmer while hiding the precise details of warp scheduling and resource allocation.

CUDA

CUDA is a proprietary platform that extends C, C++, and other languages with constructs for launching kernels and managing GPU memory. It exposes the thread hierarchy, shared memory, and synchronization primitives directly, and it is accompanied by extensive libraries for linear algebra, signal processing, and deep learning. Its maturity and library ecosystem have made it the dominant model for high-performance and machine-learning workloads on the hardware that supports it.

Kernel launch: A special syntax specifies the grid and block dimensions for a function executed on the device.
Memory management: Explicit or unified memory transfers move data between host and device address spaces.
Libraries: Highly tuned libraries deliver much of the achievable performance without hand-written kernels.

OpenCL and Cross-Vendor Models

OpenCL is an open, royalty-free standard for parallel programming across heterogeneous processors, including GPUs from multiple vendors as well as CPUs and other accelerators. It provides portability across hardware at the cost of requiring more explicit management than CUDA. Other models, including compiler-directive approaches and graphics-oriented compute shaders, address different points on the spectrum between portability and control.

Portability: A single OpenCL program can target devices from different vendors.
Directive-based models: Compiler directives annotate existing code to offload loops to the GPU with minimal restructuring.
Graphics compute shaders: Graphics interfaces expose general computation through compute shaders integrated with rendering.

Throughput Versus Latency Trade-offs

The architectural choices that make a GPU powerful for data-parallel work also make it poorly suited to tasks dominated by sequential dependencies or unpredictable control flow. Appreciating where the GPU excels and where it does not is essential to deploying it effectively, and it explains why GPUs and CPUs coexist as complementary processors rather than one replacing the other.

Where Throughput Wins

The GPU delivers its enormous performance only when work is abundant, regular, and independent. Problems that expose thousands of independent operations with predictable memory access map onto the hardware almost ideally.

Data parallelism: The same operation applied across large arrays keeps every lane productive.
Arithmetic intensity: A high ratio of computation to memory traffic lets the arithmetic units run near their peak.
Regular control flow: Uniform branching across a warp avoids the cost of divergence.

Where Latency Matters

Tasks with long dependency chains, irregular memory access, or heavy branching fit the GPU poorly, because a single thread runs slowly and there may be too few independent threads to hide latency. Such tasks remain the province of the latency-optimized CPU.

Serial dependencies: Computations in which each step depends on the previous expose little parallelism.
Irregular access: Pointer-chasing and sparse data structures defeat memory coalescing.
Heterogeneous execution: Practical systems pair a CPU and GPU, assigning each the portion of the workload it runs best.

Summary

GPU architecture inverts the priorities of conventional processor design, trading the latency of any single operation for the throughput of many. It executes thousands of threads under the single-instruction, multiple-thread model, bundling them into warps that advance in lockstep and switching among them at no cost to hide the latency of memory and arithmetic. The chip is built from an array of streaming multiprocessors, each containing dense execution lanes, a large register file, and an explicitly managed scratchpad, and it scales across product tiers by changing how many such blocks it contains.

The memory hierarchy is tuned for bandwidth, rewarding coalesced access and on-chip data reuse while penalizing scattered or dependent access patterns. GPUs retain a fixed-function graphics pipeline alongside a flexible compute pipeline, and programming models such as CUDA and OpenCL expose the thread hierarchy and memory spaces that the hardware provides. The result is a processor that excels at regular, abundant, independent computation and complements rather than replaces the latency-optimized CPU. As workloads in graphics, simulation, and machine learning continue to grow, the throughput-oriented principles embodied in GPU architecture remain central to high-performance computing.