Filesystem Design

A filesystem is the layer of software that imposes structure on raw storage, transforming an undifferentiated array of fixed-size blocks into the familiar abstraction of named files arranged in directories. The storage device itself offers only the ability to read or write numbered blocks; it has no notion of a file, a name, a size, or an owner. The filesystem supplies all of this, mapping the logical structure that applications expect onto the physical blocks the device provides, and doing so in a way that survives crashes, uses space efficiently, and performs well under realistic workloads.

Filesystem design sits at the intersection of data structures, concurrency, and reliability engineering. The designer must choose how to lay metadata and data across the device, how to allocate space as files grow and shrink, how to keep the structures consistent when power fails mid-update, and how to exploit memory caching without sacrificing durability. This article develops these concerns in turn: the on-disk layout and the structures that describe files, the allocation strategies that place data, the journaling techniques that protect consistency, the caching and buffering that drive performance, the recovery mechanisms that restore order after a crash, and the alternative log-structured and copy-on-write designs that rethink the problem from the ground up.

On-Disk Layout

Every filesystem begins with a plan for how to divide the storage device into regions that hold distinct kinds of information. At a minimum the device must hold the file data itself, the metadata that describes each file, the directory structures that map names to files, and bookkeeping that records which blocks are in use. The arrangement of these regions is the on-disk layout, and it is fixed when the filesystem is created.

Blocks and the Superblock

The filesystem groups the device's native sectors into logical blocks, typically of four kilobytes, which form the unit of allocation. A distinguished structure called the superblock sits at a known location and records the parameters of the entire filesystem, so that the operating system can interpret everything else after locating it.

Block size: A larger block reduces metadata overhead and favors large files, while a smaller block wastes less space on small files; the choice is a fundamental trade-off.
Superblock contents: The total size, block size, locations of key structures, and free-space counts are stored so the filesystem can be mounted.
Redundant copies: Because the superblock is indispensable, filesystems keep backup copies at known offsets to survive corruption of the primary.

Allocation Groups and Locality

Many filesystems partition the device into several self-contained regions, variously called block groups, cylinder groups, or allocation groups, each holding its own metadata and data blocks. Distributing metadata throughout the device keeps it physically near the data it describes, which reduces seek distance on rotating media and limits the damage from localized corruption.

Locality of reference: Placing a file's metadata and data in the same group shortens the access path.
Parallelism: Independent groups allow concurrent allocation without contending on a single global structure.
Damage containment: Corruption within one group need not compromise the rest of the filesystem.

Inodes and File Metadata

The central metadata structure in the classic filesystem is the index node, universally abbreviated as the inode. Each file is described by exactly one inode, which holds the file's attributes and the information needed to locate its data blocks, but notably not its name. Separating the name from the inode is what allows a single file to be referenced from multiple directory entries.

The Inode Structure

An inode records the administrative facts about a file together with pointers to where its contents reside. The name lives in the directory, not the inode, so the inode is a pure description of the file's data and properties.

Attributes: Size, ownership, permissions, and timestamps for access, modification, and metadata change.
Link count: The number of directory entries that reference the inode; the file's storage is reclaimed only when this count reaches zero.
Data location: Pointers or descriptors that record where the file's data blocks lie on the device.

Block Pointers Versus Extents

Two principal schemes describe where a file's data resides. The traditional scheme stores an array of direct block pointers supplemented by indirect blocks for larger files; the modern scheme stores extents, each of which records a starting block and a length, describing a contiguous run of data compactly.

Direct and indirect pointers: A small number of direct pointers cover small files, while single, double, and triple indirect blocks extend the reach to very large files at the cost of extra lookups.
Extents: A single extent can describe thousands of contiguous blocks, shrinking metadata and improving sequential performance for large, unfragmented files.
Trade-off: Pointer arrays handle arbitrary fragmentation gracefully, whereas extents are most efficient when allocation keeps files contiguous.

Directories

A directory is itself a file whose contents are a mapping from names to inode numbers. Looking up a path walks this structure component by component, resolving each name to an inode and then reading that inode to continue. Large directories benefit from indexed structures rather than linear lists.

Name resolution: Each path component is resolved by searching the relevant directory for the matching entry.
Hard links: Multiple directory entries may point to the same inode, which is why the inode tracks a link count.
Indexed directories: Hashing or tree structures accelerate lookup in directories holding many thousands of entries.

Space Allocation

As files are created, extended, and deleted, the filesystem must continually decide which free blocks to assign and must track which blocks remain available. Good allocation keeps the data of each file contiguous to favor sequential access, spreads unrelated files apart to leave room for growth, and avoids fragmenting free space into unusable scraps.

Tracking Free Space

The filesystem records availability either as a bitmap, with one bit per block, or as a list of free extents. Bitmaps are compact and support fast scanning for runs of free blocks, while extent-based free lists describe large contiguous regions efficiently.

Free-space bitmaps: One bit per block gives a dense, easily scanned record of availability.
Free extent trees: Recording free regions as extents suits filesystems that allocate in large contiguous chunks.
Reservation: Many designs reserve a margin of free space so that allocation quality does not collapse as the device fills.

Allocation Strategies

Beyond merely finding a free block, the allocator chooses where to place new data to balance immediate efficiency against future fragmentation. Several techniques work together toward this end.

Delayed allocation: Deferring the choice of physical blocks until data is flushed lets the allocator see the full size of a write and place it contiguously.
Block groups: Keeping a file's blocks within one allocation group preserves locality and limits seek distance.
Preallocation: Reserving blocks ahead of need keeps a growing file contiguous and reduces fragmentation.

Journaling and Consistency

A single logical operation, such as moving a file or extending it, may require several independent updates to disk: the inode, the directory, and the free-space map might all change. If power fails after some but not all of these writes complete, the filesystem is left inconsistent. Journaling solves this by recording the intended changes in a dedicated log before applying them, so that an interrupted operation can be completed or discarded cleanly.

The Write-Ahead Principle

The journal embodies the write-ahead principle: the description of a change is committed to a sequential log before the change is written to its final location. After a crash, the recovery process replays the committed entries of the log, bringing the main filesystem to a consistent state without an exhaustive scan.

Transaction grouping: The several writes that constitute one logical operation are recorded together so they take effect atomically.
Commit records: A transaction becomes durable only when its commit marker reaches the log; partial transactions are ignored during recovery.
Checkpointing: Once changes are safely written to their home locations, the corresponding journal space is reclaimed.

Levels of Journaling

Journaling all data is safe but doubles write traffic, since every byte is written first to the log and then to its home. Most filesystems therefore offer a choice of how much to protect, trading durability against performance.

Metadata journaling: Only structural metadata is logged, protecting filesystem integrity while letting data writes proceed directly.
Ordered mode: Data blocks are forced to disk before the metadata that references them is committed, preventing metadata from ever pointing at stale data.
Full data journaling: Both data and metadata pass through the journal, giving the strongest guarantee at the highest cost.

Caching and Buffering

Storage devices are far slower than memory, so filesystems keep recently and frequently accessed data and metadata in main memory. The page cache holds file contents, while separate caches hold inodes and directory-lookup results. Effective caching is responsible for the majority of a filesystem's observed performance, because most accesses are satisfied without touching the device at all.

The Page Cache

File data is cached in memory at the granularity of pages. Reads are satisfied from the cache when possible, and writes are absorbed into the cache and written back to the device later, which decouples application progress from device latency.

Read caching: Pages read once remain in memory to satisfy subsequent reads cheaply.
Readahead: Detecting sequential access, the filesystem prefetches upcoming blocks to convert random latency into sequential throughput.
Writeback: Dirty pages accumulate in the cache and are flushed in batches, enabling coalescing and better allocation.

Metadata Caches and Durability Control

Caching writes improves performance but creates a window in which acknowledged data lives only in volatile memory. Applications that require durability must be able to force data to stable storage, and the filesystem must respect those requests precisely.

Inode and directory caches: Frequently traversed metadata is cached so that path resolution avoids repeated device access.
Synchronization barriers: An explicit flush forces buffered writes to the device and returns only when they are durable.
Write barriers: Ordering primitives ensure that a device's own write cache does not reorder writes in a way that violates journaling guarantees.

Crash Recovery

However careful the design, a system can lose power at any instant, and the filesystem must return to a consistent state when it restarts. The recovery strategy follows directly from the consistency mechanism the filesystem employs: a journaling filesystem replays its log, while an unjournaled filesystem must reconstruct consistency by inspection.

Log Replay

For a journaling filesystem, recovery is fast and bounded. On mount, the recovery process scans only the journal, replaying every committed transaction and discarding any that lack a commit record. The size of the journal, not the size of the filesystem, bounds the work, so recovery completes in seconds regardless of capacity.

Idempotent replay: Replaying a committed transaction more than once yields the same result, so recovery is safe even if it is itself interrupted.
Bounded time: Only the journal is examined, making recovery time independent of total storage size.
Discarding incomplete work: Transactions without a commit marker are treated as if they never happened.

Consistency Checking

Filesystems without a journal, or those suspected of deeper corruption, rely on a full consistency check that examines every structure and repairs inconsistencies. This check is thorough but slow, and its duration grows with the amount of data, which is precisely the cost that journaling was introduced to avoid.

Full scan: Every inode, directory, and free-space record is examined for internal consistency.
Reference reconciliation: Link counts and free-space maps are recomputed and corrected against the actual structures.
Scaling cost: Because the work is proportional to occupied space, checking a large filesystem can take a very long time.

Log-Structured Filesystems

A log-structured filesystem rethinks the layout entirely by treating the whole device as one continuous, append-only log. Rather than updating data and metadata in place, it writes every change sequentially to the head of the log. The motivation is that sequential writes are far faster than scattered ones on most media, and that caching already absorbs the bulk of reads, so optimizing write placement yields the greatest benefit.

Sequential Writing

All new data and metadata, including updated inodes, are gathered into large segments and appended to the log in one sequential stream. This converts the random-write workload of a conventional filesystem into a sequential one, which suits both rotating disks and the erase-block structure of flash.

Segment writing: Changes are buffered and written in large contiguous segments rather than as scattered small writes.
Inode map: Because inodes move each time they are written, a separate map tracks the current location of every inode.
Flash affinity: Sequential, append-only writing aligns naturally with the way flash memory must be erased and rewritten.

Garbage Collection

Because nothing is overwritten in place, superseded versions of data accumulate and must eventually be reclaimed. A cleaner process reads partially obsolete segments, copies the still-live data forward into new segments, and frees the old ones, restoring contiguous free space at the cost of additional write traffic.

Live-data copying: The cleaner consolidates valid data from sparsely used segments into fresh, fully packed segments.
Cleaning overhead: The extra writes performed by the cleaner compete with application writes, the principal cost of the approach.
Segment selection: Choosing which segments to clean, favoring those with the least live data, governs efficiency.

Copy-on-Write Filesystems

Copy-on-write filesystems never modify existing data in place. When a block changes, the filesystem writes the new version to a free location and then updates the structures that point to it, propagating the change upward through a tree of references to the root. This discipline yields powerful integrity and snapshot capabilities and underlies several modern filesystems.

The Copy-on-Write Discipline

Because updates are written to new locations and the pointers are switched atomically at the root, the filesystem transitions instantaneously from one consistent state to the next. There is never a moment at which a partially updated structure is visible, which removes the need for a separate journal to protect metadata.

Atomic root switch: A single update to the root pointer publishes an entire batch of changes at once.
Always consistent: The on-disk state is consistent at every instant, so a crash simply leaves the last completed state intact.
Tree of references: Updating a block forces updates to every block that points to it, up to the root.

Snapshots and Integrity

Retaining the old blocks rather than freeing them immediately makes inexpensive snapshots natural: a snapshot is simply a preserved old root that continues to reference the data as it was. Pairing copy-on-write with checksums stored in the parent pointers lets the filesystem detect and, given redundancy, repair silent corruption.

Snapshots: A point-in-time image costs almost nothing because it shares unchanged blocks with the live filesystem.
Checksumming: Each pointer carries a checksum of the block it references, enabling end-to-end detection of corruption.
Self-healing: When a checksum reveals a bad block and a redundant copy exists, the good copy transparently replaces the bad one.

Summary

Filesystem design turns a featureless array of storage blocks into the structured world of files and directories, and every design choice balances performance, space efficiency, and resilience. The on-disk layout fixes how metadata and data are distributed; inodes describe files and locate their contents through pointer arrays or extents; and allocation strategies place data to favor contiguity and locality while tracking free space through bitmaps or extent trees. Above these structures, journaling records intended changes in a write-ahead log so that an interrupted operation can be replayed or discarded, bounding crash recovery to the size of the journal rather than the size of the device.

Caching and buffering deliver most of the observed performance by satisfying accesses from memory and coalescing writes, while explicit synchronization restores the durability that caching defers. Alternative designs reframe the problem: log-structured filesystems make all writing sequential at the cost of garbage collection, and copy-on-write filesystems never overwrite data, gaining always-consistent state, cheap snapshots, and end-to-end integrity checking. Together these techniques allow filesystems to keep vast quantities of data both accessible and safe across the inevitable failures of real hardware, and the same principles increasingly inform the storage engines built atop them.