Memory Hierarchies for Accelerators

Memory system design is one of the most critical aspects of hardware accelerator architecture. While accelerators can provide enormous computational throughput, their performance is frequently limited by the rate at which data can be supplied to processing elements. Unlike general-purpose CPUs that rely heavily on cache hierarchies optimized for irregular access patterns, accelerators employ specialized memory architectures tailored to their specific workloads, including scratchpad memories, stream buffers, and high-bandwidth memory technologies.

The fundamental challenge in accelerator memory design is bridging the gap between massive parallel computation and limited memory bandwidth. A modern GPU, for example, may contain thousands of processing cores capable of performing trillions of operations per second, but memory bandwidth, even with advanced technologies like HBM, typically limits achievable throughput. Understanding and optimizing accelerator memory hierarchies is essential for extracting maximum performance from these specialized computing devices.

Scratchpad Memories

Scratchpad memories, also known as software-managed local memories or tightly coupled memories, are a fundamental building block of accelerator architectures. Unlike hardware-managed caches where placement and eviction decisions are made automatically, scratchpad memories give software explicit control over data movement and storage. This approach eliminates cache overhead and provides deterministic access latencies, making scratchpads ideal for accelerator workloads with predictable access patterns.

Architecture and Organization

A scratchpad memory is typically implemented as a fast, single-cycle SRAM array directly connected to processing elements. Key architectural characteristics include:

Explicit addressing: Software specifies exact addresses for all data placements and accesses, requiring explicit data movement instructions to transfer data between scratchpad and main memory
Banking for parallelism: Scratchpads are often divided into multiple banks that can be accessed simultaneously, supporting parallel read and write operations from multiple processing elements
Configurable allocation: Some architectures allow software to partition scratchpad space dynamically among different data structures or processing stages
Low latency: Direct connection to processing elements provides single-cycle or few-cycle access times, compared to multi-cycle cache lookup and tag comparison

Advantages Over Caches

Scratchpad memories offer several benefits for accelerator applications:

Deterministic timing: Since there are no cache misses or replacement decisions, access latency is completely predictable. This is crucial for real-time systems and simplifies performance analysis and optimization.

Energy efficiency: Scratchpads eliminate the tag storage, comparison logic, and replacement policy hardware required by caches. For accelerators where energy efficiency is paramount, this overhead reduction is significant.

No conflict misses: With explicit software management, there are no conflict misses where useful data is evicted due to address mapping collisions. Software can guarantee that required data remains resident.

Bandwidth optimization: Software can orchestrate data movement to maximize memory bandwidth utilization, overlap computation with data transfer, and minimize redundant accesses.

Programming Considerations

Effective use of scratchpad memories requires careful software design:

Tiling and blocking: Large data sets must be decomposed into tiles that fit within scratchpad capacity, with computation structured to maximize data reuse within each tile
Double buffering: Using two scratchpad regions alternately allows computation on one buffer while loading data into the other, hiding memory latency
Bank conflict avoidance: Access patterns must be designed to avoid multiple simultaneous requests to the same memory bank
Address calculation: Software must compute explicit addresses for all data accesses, adding complexity compared to cache-based systems

Examples in Modern Accelerators

Scratchpad memories appear throughout modern accelerator architectures:

GPU shared memory: NVIDIA CUDA and AMD ROCm expose per-block shared memory as a software-managed scratchpad, enabling threads within a block to share data with low latency.

DSP local memory: Digital signal processors typically include tightly coupled data and program memories for deterministic real-time processing.

Neural network accelerators: Google's TPU and similar AI accelerators use large scratchpad memories to store weight matrices and activation values during inference computations.

Stream Buffers

Stream buffers are specialized memory structures designed to efficiently handle sequential or strided access patterns common in many accelerator workloads. Rather than caching recently accessed data for potential reuse, stream buffers prefetch data ahead of consumption, maintaining a steady flow of data to processing elements.

Stream Buffer Architecture

A typical stream buffer implementation includes:

FIFO queues: Data flows through first-in-first-out queues that buffer data between memory and processing elements
Address generators: Hardware units that generate addresses for sequential or strided access patterns, reducing control overhead
Prefetch logic: Circuitry that initiates memory requests ahead of consumption to maintain buffer occupancy
Flow control: Mechanisms to handle producer-consumer synchronization and prevent buffer overflow or underflow

Access Pattern Support

Stream buffers are optimized for specific access patterns:

Sequential streams: The simplest case where consecutive addresses are accessed in order. Stream buffers excel at hiding memory latency for purely sequential accesses.

Strided access: Many algorithms access data with a constant stride, such as accessing every Nth element of an array or traversing a column of a row-major matrix. Stream buffers can be configured with stride parameters to prefetch non-contiguous data efficiently.

Indirect access: Some stream buffers support indexed or indirect access patterns where addresses are determined by a separate index stream, though this is more complex to implement efficiently.

Applications

Stream buffers are particularly effective for:

Image and video processing: Pixel data is typically processed in raster order, making streaming access patterns natural
Signal processing: Filter operations process input samples sequentially, producing output streams
Matrix operations: Dense matrix computations can be structured as streaming operations over rows or columns
Data compression: Compression and decompression algorithms often process data sequentially

Prefetch Engines

Prefetch engines are hardware units that anticipate future memory accesses and initiate data transfers before the data is actually needed. By bringing data into faster memory levels ahead of demand, prefetching can significantly reduce effective memory latency and improve accelerator throughput.

Prefetch Strategies

Different prefetch strategies suit different access patterns:

Sequential prefetching: The simplest form, where accessing address A triggers prefetch of address A+1, A+2, and so on. Effective for streaming workloads but wastes bandwidth for non-sequential patterns.

Stride prefetching: Detects regular strides in access patterns and prefetches accordingly. Hardware monitors recent accesses to detect stride patterns dynamically, or software can provide stride hints.

Correlation-based prefetching: More sophisticated approaches that learn correlations between memory accesses, prefetching based on observed access sequences. These require significant hardware for tracking and prediction.

Software-directed prefetching: The compiler or programmer inserts explicit prefetch instructions at appropriate points in the code. This approach leverages software knowledge of access patterns but requires additional programming effort.

Prefetch Distance and Degree

Two key parameters govern prefetch behavior:

Prefetch distance: How far ahead of the current access point prefetches are initiated. The distance must be sufficient to hide memory latency but not so large that prefetched data is evicted before use.

Prefetch degree: How many cache lines or data blocks are prefetched on each trigger. Higher degrees increase bandwidth utilization but risk cache pollution if predictions are incorrect.

Challenges and Considerations

Prefetching introduces several challenges:

Cache pollution: Incorrect prefetches consume cache capacity, potentially evicting useful data
Bandwidth consumption: Prefetch traffic competes with demand traffic for memory bandwidth
Timeliness: Prefetches arriving too early may be evicted before use; too late provides no benefit
Irregular patterns: Prefetching is ineffective for random or unpredictable access patterns

Memory Coalescing

Memory coalescing is a technique used in massively parallel architectures to combine multiple individual memory requests from parallel threads into fewer, larger transactions. This optimization is crucial for achieving high memory bandwidth utilization in GPU and similar SIMD-style accelerators.

The Coalescing Problem

In a parallel architecture where many threads execute simultaneously, each thread may generate its own memory request. If these requests are served individually, the overhead of each transaction (address transmission, command decoding, row activation) dramatically reduces effective bandwidth. Memory coalescing addresses this by merging requests that target nearby addresses.

Coalescing Requirements

For requests to be coalesced, they typically must satisfy certain conditions:

Address alignment: Requests should align to memory transaction boundaries, typically 32-byte or 128-byte segments
Spatial locality: Requests must target addresses within the same memory segment or cache line
Temporal coincidence: Requests must occur during the same instruction execution across parallel threads
Access type: Read requests coalesce with reads, write requests with writes

Coalescing in GPUs

GPU architectures provide explicit coalescing hardware:

Warp-level coalescing: In NVIDIA GPUs, threads within a warp (32 threads) that access consecutive 4-byte elements result in a single coalesced 128-byte transaction. Non-coalesced accesses may generate multiple separate transactions.

Memory access patterns: Optimal coalescing occurs when thread i accesses address base + i * sizeof(element). Strided access patterns or random access patterns result in partial or no coalescing.

Performance impact: The difference between fully coalesced and uncoalesced memory access can be an order of magnitude in effective bandwidth, making coalescing-aware programming essential for GPU performance.

Programming for Coalescing

Achieving good coalescing requires attention to data layout and access patterns:

Structure of arrays (SoA): Rather than an array of structures where each structure contains multiple fields, use separate arrays for each field. This ensures that parallel threads accessing the same field access consecutive memory locations.
Array padding: Add padding to ensure array rows begin at aligned addresses, enabling coalesced access to row elements.
Access pattern transformation: Restructure algorithms to generate coalesced access patterns, even if this requires additional computation.

Cache Hierarchies in Accelerators

While scratchpad memories and stream buffers handle many accelerator workloads, cache hierarchies remain important for applications with less predictable access patterns. Accelerator caches differ from CPU caches in organization, policies, and optimization targets.

Texture Caches

Graphics processors include specialized texture caches optimized for 2D and 3D spatial locality:

2D spatial locality: Texture caches are organized to efficiently fetch rectangular regions of image data, not just linear cache lines
Filtering support: Hardware interpolation requires accessing multiple nearby texels simultaneously, which texture cache designs accommodate
Addressing modes: Support for wrapping, clamping, and other texture addressing modes is built into the cache

Constant Caches

Many accelerators include caches optimized for broadcast access patterns:

Broadcast optimization: When many parallel threads read the same constant value, a single cache access can supply all threads, dramatically amplifying effective bandwidth.

Read-only nature: Constant caches need no coherence mechanisms since data is never modified, simplifying design.

L2 Caches in Accelerators

Shared L2 caches in accelerators serve different purposes than in CPUs:

Write coalescing: L2 caches aggregate writes from many parallel threads before committing to main memory
Inter-SM sharing: In GPUs, L2 cache enables data sharing between streaming multiprocessors without main memory round-trips
Bandwidth amplification: Even modest L2 hit rates significantly reduce main memory traffic in bandwidth-limited workloads

Cache Bypassing

Accelerators often provide mechanisms to bypass caches when caching would be counterproductive:

Streaming data: Data that will be accessed only once should bypass caches to avoid polluting them with non-reusable data.

Large working sets: When working sets exceed cache capacity, caching provides no benefit and adds latency. Direct memory access can be more efficient.

Software control: Programming interfaces often expose cache bypass options, allowing software to make informed decisions based on access patterns.

High-Bandwidth Memory

High-Bandwidth Memory (HBM) is an advanced memory technology that provides dramatically higher bandwidth than conventional DRAM, making it the preferred memory technology for high-performance accelerators. HBM achieves its performance through 3D stacking, wide interfaces, and tight integration with the processor die.

HBM Architecture

HBM stacks multiple DRAM dies vertically, connected by through-silicon vias (TSVs):

3D stacking: Multiple DRAM dies (typically 4-8) are stacked vertically, with thousands of TSVs providing high-bandwidth interconnection
Wide interface: Each HBM stack provides a 1024-bit interface, compared to 64 bits for conventional DDR
2.5D integration: HBM stacks sit on an interposer alongside the processor die, enabling short, high-bandwidth connections
Multiple stacks: High-end accelerators use multiple HBM stacks to further increase aggregate bandwidth

Performance Characteristics

HBM offers compelling performance advantages:

Bandwidth: A single HBM2E stack provides up to 460 GB/s bandwidth. With multiple stacks, accelerators achieve aggregate bandwidths exceeding 2 TB/s.

Energy efficiency: Short interconnects and lower voltage operation make HBM significantly more energy-efficient per bit transferred than conventional DRAM.

Latency: While HBM latency is comparable to DDR, the massive parallelism enables much higher throughput for bandwidth-limited workloads.

HBM Generations

HBM technology has evolved through multiple generations:

HBM: First generation, 128 GB/s per stack, up to 4 GB capacity
HBM2: Doubled bandwidth to 256 GB/s, increased capacity to 8 GB per stack
HBM2E: Further increased bandwidth to 460 GB/s with up to 16 GB per stack
HBM3: Latest generation targeting 600+ GB/s per stack with 24 GB capacity

Design Considerations

Implementing HBM requires careful system design:

Thermal management: Stacked dies generate concentrated heat that must be effectively dissipated
Interposer design: The silicon interposer must be carefully designed for signal integrity and thermal performance
Cost: HBM is significantly more expensive than conventional DRAM, limiting its use to high-performance applications
Capacity limits: While bandwidth is excellent, total capacity per stack remains limited compared to conventional DIMM-based systems

Near-Data Processing

Near-data processing (NDP), also known as processing-in-memory (PIM) or near-memory computing, addresses the memory bandwidth bottleneck by moving computation closer to where data resides. Rather than transferring large amounts of data to a remote processor, NDP places processing elements within or adjacent to memory, dramatically reducing data movement.

Motivation and Benefits

The case for near-data processing stems from fundamental physical constraints:

Energy cost of data movement: Moving data across a chip or between chips consumes far more energy than the computation itself. For memory-intensive workloads, data movement dominates total energy consumption.

Bandwidth limitations: The off-chip bandwidth between processor and memory is limited by pin count and signaling constraints. Moving computation to memory bypasses this bottleneck.

Latency reduction: Processing data in place eliminates round-trip memory latency for intermediate results that need not leave the memory subsystem.

NDP Architectures

Several architectural approaches implement near-data processing:

Processing-in-memory (PIM): True PIM integrates processing logic directly into memory arrays, performing computation using modified memory cells. While theoretically powerful, practical PIM faces challenges in integrating logic and memory fabrication processes.

Near-memory processing: Processing elements are placed on the same die or package as memory but remain separate from the memory arrays. This approach is more practical with current technology while still providing bandwidth benefits.

Logic-in-memory (3D stacking): Logic dies are stacked with memory dies, connected by TSVs. The logic can perform computation on data with internal bandwidth far exceeding external interfaces.

Application Domains

Near-data processing is particularly suited for certain workload characteristics:

Data analytics: Operations like filtering, aggregation, and pattern matching process large datasets with low arithmetic intensity
Graph processing: Graph algorithms exhibit irregular access patterns that defeat caching; NDP can process graph data in place
Database operations: Scanning, selection, and simple aggregation benefit from processing data where it resides
Machine learning inference: Memory-bound inference workloads can benefit from in-memory matrix operations

Implementation Challenges

Near-data processing faces several practical challenges:

Programming models: Expressing which computations should occur near memory and managing data placement requires new programming abstractions
Coherence: Maintaining consistency between near-memory processing and conventional caches adds complexity
Flexibility: Fixed-function NDP accelerators may not match diverse workload requirements
Technology constraints: Memory processes are optimized for density, not logic performance; integrating capable processing elements is challenging

Memory System Design Trade-offs

Designing memory hierarchies for accelerators involves balancing multiple competing factors:

Capacity vs. Bandwidth vs. Latency

The fundamental trade-offs in memory design apply to accelerators:

SRAM: Fast and low-latency but expensive and power-hungry per bit; suitable for small scratchpads and caches
DRAM: High density and reasonable bandwidth but higher latency; forms main memory
HBM: Excellent bandwidth through 3D stacking but limited capacity and high cost
GDDR: Lower cost than HBM with good bandwidth, commonly used in consumer GPUs

Software vs. Hardware Management

A key architectural decision is how much memory management responsibility to place in hardware versus software:

Hardware-managed caches: Simplify programming but may make suboptimal decisions for accelerator workloads with specialized access patterns.

Software-managed scratchpads: Enable optimal data placement but require more programming effort and may not adapt to varying runtime conditions.

Hybrid approaches: Many accelerators combine both, using caches for general access patterns and scratchpads for performance-critical data with known access patterns.

Specialization vs. Generality

Memory architectures can be specialized for specific access patterns or designed for generality:

Specialized designs: Texture caches, constant caches, and stream buffers achieve excellent performance for specific patterns but may perform poorly on others
General designs: Conventional cache hierarchies handle diverse workloads but may not achieve peak performance on any particular pattern
Configurable designs: Some architectures allow software to configure memory structures for different uses, balancing specialization with flexibility

Performance Analysis and Optimization

Understanding and optimizing accelerator memory performance requires appropriate tools and methodologies:

Key Metrics

Memory bandwidth utilization: The fraction of theoretical peak bandwidth actually achieved
Arithmetic intensity: The ratio of computation to memory traffic, determining whether a workload is compute-bound or memory-bound
Cache hit rates: For cached accesses, the fraction served from cache versus main memory
Coalescing efficiency: In parallel architectures, the fraction of memory requests that successfully coalesce
Scratchpad utilization: Effective use of available scratchpad capacity

The Roofline Model

The roofline model provides a framework for understanding accelerator performance limits:

Performance is bounded by either computational throughput (the "ceiling") or memory bandwidth (the "roofline" slope). The arithmetic intensity of a kernel determines which bound applies. Kernels with low arithmetic intensity are memory-bound; those with high intensity are compute-bound.

Optimization strategies differ for each regime: memory-bound kernels benefit from memory hierarchy optimization, while compute-bound kernels benefit from computational optimizations.

Optimization Strategies

Common optimization approaches for accelerator memory systems include:

Tiling and blocking: Decompose problems to maximize data reuse within fast memory levels
Data layout optimization: Arrange data to match access patterns and enable coalescing
Prefetching: Use software or hardware prefetching to hide memory latency
Computation restructuring: Modify algorithms to improve arithmetic intensity or memory access patterns
Memory traffic reduction: Use compression, data deduplication, or approximate computing to reduce bandwidth requirements

Summary

Memory hierarchies for accelerators represent a specialized domain of computer architecture that differs significantly from traditional CPU memory systems. Scratchpad memories provide software-managed, deterministic access for predictable workloads. Stream buffers and prefetch engines handle sequential and strided patterns efficiently. Memory coalescing enables parallel architectures to achieve high bandwidth utilization by combining individual thread requests.

Advanced memory technologies like HBM provide the massive bandwidth that modern accelerators require, while near-data processing approaches address the fundamental inefficiency of moving data to computation. Effective accelerator memory design requires balancing capacity, bandwidth, latency, and energy while choosing appropriate trade-offs between hardware and software management.

As accelerators become increasingly important for machine learning, scientific computing, and other demanding applications, understanding memory hierarchy design is essential for both hardware architects developing new accelerator platforms and software developers optimizing applications to extract maximum performance from these specialized computing devices.