Database Storage Engines

A storage engine is the component of a database system responsible for organizing data on persistent storage, retrieving it efficiently, and guaranteeing that committed changes survive failures. It sits beneath the query processor, which decides what data is needed, and translates those needs into reads and writes against the underlying device. The same higher-level database may run atop different storage engines, and the choice of engine determines much of the system's read performance, write performance, space efficiency, and behavior under concurrent access.

The central tension in storage engine design is the same one that runs through all of storage: the device is slow compared with memory, and the patterns of access it favors differ sharply from the patterns applications generate. A storage engine mediates this mismatch through carefully chosen data structures, a buffer of in-memory pages, a durability log, and a concurrency-control scheme that lets many transactions proceed at once. This article examines the storage hierarchy the engine must accommodate, the two dominant on-disk structures (the B-tree and the log-structured merge-tree), page and buffer management, write-ahead logging, indexing, multiversion concurrency control, and the durability guarantees that tie the whole design together.

The Storage Hierarchy and Engine Role

A storage engine operates across a hierarchy of memory and storage whose levels differ by orders of magnitude in both speed and capacity. Its job is to keep the data that matters most in the fast, small upper levels while ensuring that the authoritative copy resides safely in the large, slow lower levels. Every design decision reflects an attempt to serve requests from memory whenever possible and to write to the device in the sequential, batched patterns it handles best.

Levels of the Hierarchy

The hierarchy spans processor registers and caches, main memory, and persistent storage, with each level trading capacity for speed. The storage engine works chiefly with main memory and persistent storage, treating memory as a cache over the durable device.

Main memory: Fast and volatile, it holds the buffer pool of recently used pages and the working state of active transactions.
Persistent storage: Slower and durable, it holds the authoritative data and the log; its access characteristics shape the entire engine.
Access asymmetry: Sequential access vastly outperforms random access on most devices, and writes often behave differently from reads, both of which the engine must accommodate.

Why the Structure Matters

Because the device dominates the cost of any operation it serves, the structures the engine chooses are judged by how few and how favorable the device accesses they require. Two measures capture this and recur throughout engine design.

Read amplification: The number of device accesses required to satisfy a single logical read.
Write amplification: The amount of physical data written for each byte of logical data, which affects both speed and the endurance of flash media.
Space amplification: The ratio of physical storage consumed to the logical size of the data, reflecting overhead and obsolete versions.

B-Tree Storage

The B-tree, and in particular its variant the B+ tree, has been the dominant on-disk structure for databases for decades. It is a balanced, high-fan-out tree that keeps data sorted and supports efficient point lookups, range scans, and ordered traversal, all while guaranteeing that any item can be reached in a small, bounded number of levels. Its update-in-place discipline makes it a natural fit for read-intensive and mixed workloads.

Structure and Operation

A B+ tree stores keys in interior nodes that direct the search and stores the actual records in leaf nodes, which are typically linked together to support range scans. Each node occupies one page, and the tree's high fan-out keeps it shallow even for enormous datasets, so a lookup touches only a handful of pages.

High fan-out: Each interior node holds many keys, so even billions of records reside in a tree only a few levels deep.
Sorted leaves: Records are kept in key order, making range queries and ordered iteration efficient.
Bounded lookup: The number of page reads for a lookup equals the tree height, which grows only logarithmically with data size.

Maintaining Balance

The B-tree preserves its shallow, balanced shape automatically as data is inserted and deleted. When a node fills, it splits and pushes a separator key upward; when a node empties past a threshold, it merges with a neighbor. These operations keep every leaf at the same depth.

Node splitting: A full node divides into two, propagating a separator key to the parent and possibly cascading to the root.
Update in place: Modifications rewrite the affected page directly, which is simple but produces random writes scattered across the device.
Write characteristics: Random in-place updates and the resulting page rewrites are the chief cost of the B-tree under write-heavy workloads.

Log-Structured Merge-Trees

The log-structured merge-tree, or LSM-tree, takes the opposite approach to the B-tree. Instead of updating data in place, it buffers writes in memory and flushes them to disk in large, sorted, immutable files, later merging those files in the background. This converts the random writes of a B-tree into sequential ones, dramatically improving write throughput at some cost to read performance, which makes the LSM-tree the structure of choice for write-intensive systems.

The Write Path

Incoming writes are first recorded in a durability log and inserted into an in-memory sorted structure. When that structure grows large enough, it is flushed to disk in one sequential pass as an immutable sorted file, after which a fresh in-memory structure receives subsequent writes.

In-memory buffer: A sorted memory structure absorbs writes at memory speed and is never updated in place on disk.
Sorted runs: Flushing produces immutable, sorted files written sequentially, which suits both disks and flash.
Deletions as markers: A delete is recorded as a tombstone marker rather than an in-place removal, to be resolved during later merging.

Compaction and the Read Path

Because data accumulates across many sorted files, a read may need to consult several of them, and obsolete versions pile up. A background process called compaction merges files together, discarding superseded records and tombstones, which bounds the number of files a read must examine and reclaims space.

Compaction: Merging sorted files removes obsolete data and limits how many files a lookup must search.
Bloom filters: Compact probabilistic structures let a read skip files that certainly do not contain the sought key, taming read amplification.
Write versus read trade-off: The LSM-tree favors write throughput and space efficiency, accepting higher read and compaction cost in return.

Page and Buffer Management

Whatever on-disk structure an engine uses, it transfers data between memory and storage in fixed-size units called pages. The buffer pool, a region of memory that caches pages, is the engine's most important performance asset, because it lets the great majority of accesses complete without touching the device. Managing this pool well is central to throughput.

Pages and Records

A page is the unit of input and output and the granularity at which the engine caches and locks data. Within a page, records are arranged so that they can be located, added, and removed without rewriting the whole page unnecessarily.

Fixed-size pages: Uniform page sizes simplify buffer management and align with the device's transfer units.
Slotted pages: A directory of slots within the page locates variable-length records and allows them to move without changing their identifiers.
Page headers: Metadata in each page records free space, record count, and the information needed for recovery.

The Buffer Pool

The buffer pool caches pages in memory and decides which to keep and which to evict when space runs short. A page modified in memory is marked dirty and must be written back to the device before its buffer can be reused, subject to the ordering rules that durability imposes.

Replacement policy: Algorithms approximating least-recently-used behavior choose eviction victims, retaining hot pages in memory.
Dirty-page writeback: Modified pages are flushed to storage before eviction, coordinated with the durability log.
Pinning: A page in active use is pinned to prevent its eviction until the operation completes.

Write-Ahead Logging

To guarantee that committed transactions survive crashes while still allowing the buffer pool to defer writes, storage engines record every change in a write-ahead log before applying it to the data pages. The log is written sequentially, which is fast, and it provides the authoritative record from which recovery reconstructs the correct state. Write-ahead logging is the foundation of both durability and atomicity in most engines.

The Logging Protocol

The governing rule is that the log record describing a change must reach durable storage before the corresponding data page is allowed to. A transaction is considered committed once its commit record is durable in the log, even if the modified data pages remain only in the buffer pool.

Write-ahead rule: Log records precede the data-page writes they describe, so the log always knows at least as much as the data.
Sequential log writes: Appending to the log is sequential and therefore fast, decoupling commit latency from random data writes.
Group commit: Batching the durable flush of many transactions' commit records amortizes the cost of forcing the log to storage.

Recovery

After a crash, the engine replays the log to restore a transactionally consistent state. A widely used protocol, ARIES, proceeds in three phases: an analysis phase that reconstructs which transactions were in progress and which pages were dirty, a redo phase that replays logged changes to bring the database to the moment of the crash, and an undo phase that reverses the effects of transactions that had not committed.

Redo: Committed changes recorded in the log are reapplied to ensure no durable update is lost.
Undo: The effects of transactions that were still in flight at the crash are rolled back, preserving atomicity.
Checkpoints: Periodic checkpoints bound how much log must be replayed, keeping recovery time manageable.

Indexing

An index is an auxiliary structure that accelerates the location of records matching a query, sparing the engine from scanning all of the data. Indexes are themselves persistent structures the storage engine must maintain, and their design strongly influences both read speed and the cost of writes, since every index must be updated when the data it covers changes.

Primary and Secondary Indexes

The engine distinguishes the index that determines the physical organization of the data from additional indexes that merely point into it. This distinction governs how lookups proceed and how much work an update entails.

Clustered organization: When the table is stored in the order of its primary key, that key's index is the data itself, making primary-key lookups especially efficient.
Secondary indexes: Additional indexes map other attributes to the records they describe, enabling fast lookup on non-primary columns.
Update cost: Every secondary index covering a changed column must also be updated, so indexes speed reads at the expense of writes.

Index Structures and Techniques

Indexes are commonly built from the same balanced-tree structures used for primary storage, but other forms serve particular needs. Several techniques reduce lookup cost or index size for specific query patterns.

Tree indexes: B+ tree indexes support both point lookups and range queries on ordered keys.
Hash indexes: Hash-based indexes give fast point lookups but do not support range queries.
Covering indexes: An index that includes all columns a query needs answers the query from the index alone, avoiding a second lookup into the data.

Multiversion Concurrency Control

Databases must let many transactions read and write concurrently while preserving the illusion that each runs in isolation. Multiversion concurrency control, abbreviated MVCC, achieves this by keeping multiple versions of each record, so that readers see a consistent snapshot without blocking writers and writers do not block readers. It is the concurrency mechanism of most modern storage engines.

Versions and Snapshots

Under MVCC, an update does not overwrite the existing record but creates a new version tagged with the transaction that produced it. Each transaction reads the versions that were committed as of a consistent snapshot, so it sees a stable view of the data even as others modify it.

Version chains: Each record may have several versions, each marked with the transaction identifiers that created and superseded it.
Snapshot reads: A transaction observes only the versions visible at its snapshot, giving it a consistent and unchanging view.
Reduced blocking: Because readers consult old versions, they neither wait for nor block concurrent writers.

Reclaiming Old Versions

Retaining old versions consumes space, so the engine must eventually remove versions that no active transaction can still see. This reclamation, performed by a background process, parallels the garbage collection of log-structured systems and is essential to controlling space amplification.

Visibility horizon: A version older than the oldest active snapshot can be discarded because no transaction can observe it.
Background cleanup: A maintenance process reclaims obsolete versions to bound space growth.
Bloat control: Neglecting reclamation lets dead versions accumulate, degrading both space efficiency and scan performance.

Durability and Transactional Guarantees

The promises a storage engine makes about its behavior under concurrency and failure are summarized by the durability and consistency guarantees it provides. These guarantees, often described by the properties of atomicity, consistency, isolation, and durability, are what distinguish a database from a mere file, and the mechanisms described above exist precisely to uphold them.

The Transactional Properties

A transaction is a unit of work that the engine treats as indivisible. The four classical properties describe what the engine guarantees about each transaction in the presence of concurrency and failure.

Atomicity: A transaction either takes effect entirely or not at all, enforced by undo logging and careful commit ordering.
Consistency: Each transaction moves the database from one valid state to another, preserving declared constraints.
Isolation: Concurrent transactions do not interfere, an illusion maintained by MVCC or locking.
Durability: Once committed, a transaction's effects survive crashes, guaranteed by the write-ahead log.

Tuning the Guarantees

Real systems expose choices that trade strictness for performance, because the strongest guarantees carry the highest cost. Understanding these settings is essential to operating a database that meets its requirements without needless overhead.

Isolation levels: Weaker isolation permits certain anomalies in exchange for greater concurrency, while stronger levels forbid them at a cost.
Synchronous commit: Forcing the log to durable storage at commit maximizes safety, whereas deferring the flush trades a small risk of loss for higher throughput.
Replication: Committing to additional copies before acknowledging protects against the loss of any single node at the cost of added latency.

Summary

A database storage engine bridges the gap between the data structures an application needs and the storage device that must hold the data durably and serve it quickly. It works across a memory and storage hierarchy whose levels differ enormously in speed, striving to satisfy requests from memory and to write to the device in the sequential, batched patterns it favors. The two dominant on-disk structures embody opposite strategies: the B-tree updates data in place to give excellent read and range performance, while the log-structured merge-tree buffers and batches writes sequentially to maximize write throughput, paying with background compaction and higher read cost.

Around these structures the engine assembles a buffer pool that caches pages, a write-ahead log that guarantees durability and atomicity while allowing writes to be deferred, indexes that accelerate lookups at the cost of update work, and multiversion concurrency control that lets many transactions proceed without blocking one another. The transactional properties of atomicity, consistency, isolation, and durability summarize what the engine guarantees, and the mechanisms throughout exist to uphold them, with tunable settings that trade strictness for performance. Mastery of these components allows engineers to choose and operate storage engines whose read performance, write performance, and durability match the demands of their workloads.