Chapter 9.1

Storage in the AI Lifecycle: Why It Determines GPU Efficiency

Storage is not the thing that holds your data — it is the thing that decides whether a $40,000 accelerator computes or idles, and the only way to design it right is to stop treating it as one system and recognize the four mutually-hostile I/O personalities competing for the same hardware.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Whether you size the storage subsystem to a per-GPU bandwidth target (GB/s/GPU) or to a capacity target (PB) — because picking the wrong primary axis strands one of the two and you pay for it in idle GPUs or idle flash.
Which of the four I/O personalities (ingestion, checkpointing, many-small-files, KV-cache) dominates your workload, and therefore which one the storage tier is actually optimized for — they pull in opposite directions and a system tuned for one starves the others.
Where each tier physically lives — node-local NVMe scratch vs networked parallel filesystem vs object capacity vs archive — and which fabric carries its traffic, because mis-placing checkpoint writes onto the training fabric collides with the collectives it is supposed to protect.
Whether to budget for throughput, IOPS, latency, or metadata as the binding constraint of your dominant personality — the four are not interchangeable and the benchmark that looks good on a datasheet is usually the wrong one.
Whether your data is even movable — because at petabyte scale, egress economics and data gravity may have already decided where your GPUs get built before you ever specced a rack.

The accelerator is the most expensive depreciating asset in the building, and the entire reason the storage subsystem exists is to keep it from sitting idle. Storage is a GPU-efficiency problem, not a capacity problem. A single H100-class server runs $283k–$318k all-in; a GB200 NVL72 rack is on the order of $3M. When the data path cannot feed those accelerators fast enough, they stall, and a stalled GPU burns depreciation and energized megawatts to produce nothing. The metric that matters is goodput, not how many petabytes you can store: the fraction of GPU-hours that turn into useful forward progress rather than waiting on I/O or recovering from a failure. → Chapter 12.2.

This chapter sets up the rest of Part 9 by refusing the most common mistake in storage design — treating "AI storage" as a single thing to be procured. It is not. It is four distinct I/O personalities that share the same flash and the same fabric while pulling in opposite directions: high-throughput sequential ingestion, bursty large-sequential checkpointing, metadata-heavy many-small-files access, and latency-critical KV-cache offload for inference. Each one wants a different thing — throughput, write bandwidth, metadata ops, or tail latency — and a system tuned for one will starve the others.

The economic argument, stated plainly

Start from the unit economics, because they are unforgiving. A debt-financed GPU cluster breaks even near ~70% utilization; below that it loses money every month. Storage is one of the few subsystems that can move utilization in either direction by tens of points without anyone in the building touching a GPU. Underfeed the accelerators with a naive data path and a vision-training job can sit below 50% GPU utilization — half your most expensive asset, gone, before you have accounted for a single failure. Feed them properly with a sharded, sequential, prefetched pipeline and the same job clears 90%. That delta is not a tuning nicety; on a 16k-GPU fleet it is the difference between a cluster that pays its debt and one that does not.

The second economic lever is failure. At scale, hardware fails constantly: a 16,384-GPU Llama-3-class run logged 419 unplanned interruptions over 54 days — a mean time to interrupt of roughly 192 minutes — and a synchronous training job has no choice but to restart from its last checkpoint when any node drops. Storage is what makes that restart cheap or catastrophic. If checkpointing is slow, you check less often, lose more work per failure, and bleed goodput; if recovery reads are slow, every restart stalls the whole cluster. The storage subsystem is therefore doing two jobs at once — feeding the GPUs during steady state and insuring the run against the inevitable interruption — and these two jobs have completely different I/O signatures. That is the seed of the four-personality framing.

The four I/O personalities

The reason "AI storage" cannot be procured as a single system is that the AI lifecycle generates four workloads whose I/O profiles are not just different but actively contradictory. They share infrastructure — the same parallel filesystem, the same flash, often the same fabric — but optimizing the shared substrate for one of them de-optimizes it for the others. Hold these four in mind as the lens for every chapter that follows.

1. Ingestion (training-data reads) is high-throughput, large-sequential, read-dominated, and bandwidth-bound. The data loader streams shards of training examples to the GPUs every step, and the only thing it wants is sustained read GB/s. It is forgiving of latency (you prefetch ahead of need) and indifferent to write performance. The failure mode is a CPU-side or storage-side bottleneck that cannot sustain the per-GPU read target, which directly caps GPU utilization. → Chapter 9.5.

2. Checkpointing (state writes) is bursty, large-sequential, write-dominated, and tolerant of latency but intolerant of duration. Every checkpoint interval the entire cluster pauses (or, done right, overlaps a brief stall) to flush model weights plus optimizer state — ~14 bytes per parameter — to durable storage. The personality is the inverse of ingestion: it wants peak write bandwidth in short violent bursts, and it wants those bursts to finish fast so the GPUs get back to computing. This is why NVIDIA's reference rule is that write bandwidth should be at least half of read bandwidth — checkpoints must drain before the next interval. → Chapter 9.4.

3. Many-small-files / metadata (preprocessing, LOSF) is the personality that breaks naive systems. Datasets composed of millions of tiny files — individual images, JSON records, audio clips — turn the workload from a bandwidth problem into a metadata problem: opens, stats, lookups, and directory operations that hammer the filesystem's metadata service rather than its data path. This is the classic Lots-Of-Small-Files (LOSF) problem, and it is where centralized-metadata filesystems collapse and where distributed-metadata designs earn their keep. Throughput benchmarks tell you nothing about this personality; metadata ops/sec is the relevant number. → Chapter 9.2.

4. KV-cache offload (inference) is the newest personality and the one that broke the old mental model entirely. Long-context, reasoning, and agentic inference generate enormous key-value caches that must be retained, retrieved, and reused across requests — and the binding constraint is tail latency, measured in microseconds, not throughput measured in GB/s. KV-cache offload to NVMe or Ethernet-attached flash is a memory-hierarchy problem, and it is the reason inference is now a first-class storage workload rather than an afterthought. → Chapter 9.7.

The four I/O personalities → what each one actually demands

Personality	I/O pattern	Binding metric	Latency sensitivity	Where it lives	Failure mode if mis-sized
Ingestion (data reads)	Large-sequential, read-heavy, sustained	Read GB/s (per-GPU)	Low — prefetched ahead of need	Parallel FS + local NVMe cache; object capacity behind	GPU utilization caps below 50%; accelerators stall waiting on data
Checkpointing (state writes)	Bursty large-sequential, write-heavy	Write GB/s (peak burst, fast drain)	Low per-op, but burst must finish fast	Local NVMe fast tier → async drain to durable FS	Long stalls per interval; you checkpoint less, lose more work per failure
Many-small-files / metadata	Random small reads, open/stat-heavy	Metadata ops/sec (IOPS)	Moderate — per-file latency compounds	Distributed-metadata parallel FS; sharded formats	Metadata service saturates; throughput collapses regardless of flash speed
KV-cache offload (inference)	Small-random, read-and-write, reuse-driven	Tail latency (microseconds)	Extreme — sits in the request critical path	HBM → DRAM → NVMe → Ethernet-flash hierarchy	Time-to-first-token blows the SLO; fewer concurrent users per GPU

The shared-infrastructure paradox: these four run on the same flash and fabric but optimize for different primitives. The 'binding metric' column is the number you must benchmark — the others mislead. Figures are 2026-current; see keynumbers for sources.

The personalities are not options you select among. Most facilities run several at once, and the storage subsystem must serve all of them on shared hardware. The art is in placement and tiering: route each personality to the tier that fits its binding metric, and isolate the ones that would otherwise collide. The most consequential collision in the building is checkpoint-write incast slamming into ingestion reads (and, worse, into the training collectives if checkpoints are mis-placed onto the back-end fabric). The fix is architectural: a local-NVMe fast tier that absorbs the checkpoint burst and drains it asynchronously, on a fabric that is not the one carrying the all-reduce. This is a co-design problem spanning storage and network. → fabric isolation in Chapter 8.5; checkpoint mechanics in Chapter 9.4.

Deep dive: why the same 14 bytes/parameter is both a tiny number and a tyrant

Checkpoint sizing is governed by one rule of thumb: a mixed-precision training checkpoint is roughly 14 bytes per parameter — weights plus optimizer state (momentum and variance for Adam-class optimizers), in FP32 master copies. That makes the absolute sizes surprisingly modest. GPT-3 175B checkpoints at ~2.45 TB; a 100B model at ~1.4 TB; a full 1-trillion-parameter model at ~13.8 TB. These are not large numbers by capacity standards — a single dense-flash server holds far more.

The tyranny is not the size, it is the cadence under failure. The optimal checkpoint interval scales with how often the cluster fails, and at scale it fails constantly: a 405B run on 16k H100s sees a mean-time-to-interrupt around 192 minutes, which pushes you toward roughly a hundred-plus checkpoints per day — on the order of one every several minutes. Scale to 100k accelerators and the interrupt rate forces a checkpoint roughly every 1.5 minutes. Now the math bites: if you must write ~14 TB every few minutes and finish fast enough that the stall stays under ~10% of training time, the required bandwidth — not capacity — is what sizes the storage tier. The good news is that even frontier runs sustain this with well under 1 TB/s of global checkpoint bandwidth, and async drains are observed at 50–200 GB/s. The lesson is the through-line of this whole part: storage at AI scale is almost always a bandwidth-and-latency problem, not a capacity problem. The full Young/Daly optimal-interval derivation lives in Chapter 9.4.

Where storage lives: the tier hierarchy

Storage in an AI facility is not a single pool but a hierarchy, and every tier exists because the tier above it is too expensive or too small to hold everything. The decision at each boundary is the same shape: how much do you pay in cost-per-terabyte to gain bandwidth and drop latency? Walking the hierarchy from fastest to cheapest:

Node-local NVMe scratch — the fastest networked-free tier, physically inside the GPU server. It is the natural home for the checkpoint fast-tier (absorb the burst locally, drain asynchronously) and for hot data-loader caching. It is scratch: not durable, not shared, wiped on reschedule. PCIe 5.0 today, PCIe 6.0 and E1.S/E3.S form factors arriving to push per-drive bandwidth higher. → Chapter 9.3.
Networked parallel/distributed filesystem (the primary hot tier) — the all-flash, NVMe-native, RDMA-attached shared namespace that feeds the whole cluster: WEKA, VAST, DDN EXAScaler/Lustre, IBM Storage Scale/GPFS. This is where ingestion reads and durable checkpoints land, and where the throughput-vs-metadata tradeoff is fought. → Chapter 9.2.
Object capacity tier (the data lake) — S3-compatible, increasingly QLC-flash-backed rather than HDD, holding the full corpus, dataset versions, and cold checkpoints. Cheaper per TB, lower bandwidth, the backbone for model distribution and the inference cold tier. → Chapter 9.6.
Archive — the cheapest, coldest tier for retention, lineage, and compliance copies, where retrieval latency is measured in minutes-to-hours and nobody cares because nothing in the training loop touches it.

The cross-cutting distinction that matters more than the tier names is scratch vs durable vs archive. Scratch (local NVMe, ephemeral) is fast and cheap-per-IOP but loses data on failure — perfect for caches and checkpoint staging, fatal for anything you cannot reconstruct. Durable (the parallel FS, replicated or erasure-coded) survives failures and is where your insurance copies live. Archive is durable-and-cheap-and-slow. Mis-classify data across these — e.g. treat the local-NVMe checkpoint stage as durable and skip the async drain — and a node failure takes your checkpoint with it, defeating the entire point of checkpointing.

Throughput, IOPS, latency, and metadata are NOT interchangeable

The most common storage-procurement error is benchmarking the wrong primitive. These four numbers measure different things and a system can be excellent at one and terrible at another. Throughput (GB/s) is what ingestion and checkpointing care about — large-sequential bandwidth. IOPS (operations/sec) is what random small-file and KV workloads care about — and a system rated for huge sequential throughput can choke on small random I/O. Latency (microseconds, especially tail/p99) is what KV-cache offload lives and dies by — and it is invisible on a throughput chart. Metadata ops/sec is the LOSF killer — and it appears on no bandwidth datasheet at all. Demand the number that matches your dominant personality. A vendor quoting 10 TB/s of sequential read has told you nothing about whether their metadata service survives a 50-million-file dataset, and MLPerf Storage exists precisely because datasheet peaks diverge from mixed-pipeline reality. → benchmarking in Chapter 9.8.

~4 GB/s/GPU

NVIDIA reference read target for vision training; ~1 GB/s/GPU practical floor; 4–10 GB/s/GPU for real multimodal/checkpoint-heavy runs

2025NVIDIA DGX SuperPOD storage RA; DDN AI400X2-Turbo baseline

250 / 124 GB/s

per scalable unit read / write, B300 SuperPOD 'Enhanced' tier (8-SU: 2,000 / 992 GB/s); write ≥ ½ read rule

2025NVIDIA DGX B300 SuperPOD Storage Architecture

~14 bytes/param

checkpoint size incl. optimizer state — 100B ~1.4 TB; GPT-3 175B ~2.45 TB; 1T params ~13.8 TB; frontier runs sustain on < 1 TB/s global checkpoint BW

2025VAST Data; NVIDIA guidance

~192 min

MTTI for a 405B run on 16k H100 (419 interrupts in 54 days); forces frequent checkpoints (~1 every several min); 100k accelerators push toward ~1 per 1.5 min

2024Meta (Revisiting Reliability, arXiv 2410.21680)

40+ GB/s

GPUDirect Storage direct-to-GPU; VAST/GDS sustains >90% of 200 Gbps line rate; 2–3x utilization lift in I/O-bound configs

2025NVIDIA GPUDirect Storage Design Guide; VAST

< 50% vs > 90%

GPU utilization: naive CSV/small-file loading vs sharded WebDataset/TFRecord + DALI; shard sweet spot ~100 MB–1 GB

2025AWS / NVIDIA DALI data-loader guidance

$0.05–0.09/GB

cloud egress (first-tier to at-scale); ~240 TB LAION-5B ~$15,800 to move; data gravity makes PB-scale datasets economically immovable

2025Pure Storage; hyperscaler egress schedules

~$36B / ~24% CAGR

AI/high-performance storage market 2025, growing ~24–25% CAGR; vendor consolidation underway

2025Market research (DataM, SNS Insider et al.)

How the personalities cascade into the build

Just as the workload archetype cascades into the whole facility in Chapter 1.1, the dominant I/O personality cascades into the whole storage build — and the cascade is causal, not decorative. The personality sets the binding metric; the binding metric sets the media and filesystem; those set the fabric and placement; and that sets the cost structure and the failure blast radius.

Ingestion-dominant (large LLM/vision pre-training) pushes you toward sustained read bandwidth: an all-flash parallel FS sized to the per-GPU read target, fronted by local-NVMe caching, with object capacity behind it. The fabric carries large-sequential reads that prefetch ahead of need, so latency is forgiving but bandwidth is everything.

Checkpoint-dominant (frontier-scale synchronous training) pushes you toward write bandwidth and fast drains: a local-NVMe fast tier to absorb the burst, async/tiered checkpointing to overlap it with compute, and — critically — isolation of checkpoint incast from the training collectives so the insurance system does not sabotage the thing it insures.

Many-small-files-dominant (multimodal preprocessing, classic vision datasets) pushes you toward distributed metadata and sharded data formats. The fix is often upstream of storage entirely: repackage millions of tiny files into a few large shards (WebDataset, TFRecord, Parquet, MDS) so the metadata problem evaporates and the workload becomes a clean sequential-read personality again. This is the cheapest big win in the data path. → Chapter 9.5.

KV-cache-dominant (long-context / agentic inference) pushes you toward a new memory hierarchy entirely — HBM, DRAM, NVMe, and Ethernet-attached flash — managed for microsecond tail latency and cache reuse rather than bulk bandwidth. This personality barely existed in the training-only mental model and now reshapes inference fleet economics. → Chapter 9.7.

Deep dive: the CPU-bypass shift and why storage is leaving the CPU's hands

For decades the data path ran through the CPU: data moved from storage into host DRAM, the CPU copied and staged it, then it crossed into GPU memory. At AI scale that path is a bottleneck — the CPU becomes a tollbooth on a highway it was never sized for, and the bounce through host DRAM wastes bandwidth and adds latency. The 2024–2026 shift is to take the CPU out of the data path entirely.

GPUDirect Storage (GDS) establishes a direct DMA path from NVMe (local or NVMe-oF) into GPU memory, bypassing the host-DRAM bounce — delivering 40+ GB/s direct-to-GPU and 2–3x utilization lifts in I/O-bound configurations. The 2026 frontier pushes further: GPU-initiated I/O (NVIDIA's SCADA) puts the storage control path on the GPU so the CPU is out of the loop entirely, and DPU-offloaded storage (BlueField-4 STX/CMX at 800 Gb/s) moves storage, networking, and security off the host onto a dedicated processor — with WEKA, VAST, and DDN building platforms on it for H2 2026. The consequence for the data-center designer is that the storage data path and the network fabric are converging into one co-designed system, and the CPU:GPU ratio assumptions inherited from the host-centric era are being renegotiated. The mechanics live in Chapter 9.3; the fabric placement decision in Chapter 8.5.

Data gravity: the decision that may already be made

The last decision in this chapter is the one that often outranks all the others, because it can be irreversible before the storage tier is even specced: is your data movable? At petabyte scale, the answer is usually no — not for physics reasons but for economic ones. Cloud egress runs ~$0.05–0.09/GB; moving a ~240 TB dataset like LAION-5B costs ~$15,800 every time, and a 50 PB corpus at even 5%/yr movement runs into tens of millions of dollars annually. That cost, plus sovereign and data-residency rules, means the dataset frequently cannot be relocated to wherever the cheapest GPUs are. The dataset has gravity: it pulls compute toward itself.

This inverts the naive build logic. Instead of "site the cluster on cheap power, then move the data to it," data gravity forces move-compute-to-data: the storage placement becomes a primary data-center siting input rather than an afterthought, and it can constrain the power-first siting hierarchy of Chapter 1.1 before the interconnection queue is ever consulted. The fork — replicate the dataset to multiple sites and eat the storage cost, or anchor compute at the data and eat the power-and-siting cost — is one of the highest-leverage strategic decisions in the whole build. It is treated in full, with the multi-site and egress economics, in Chapter 9.8.

The master storage fork: bandwidth-sized vs capacity-sized

Decide whether your storage tier is bandwidth-sized (the primary spec is GB/s, driven by the per-GPU feed and checkpoint-drain requirement) or capacity-sized (the primary spec is PB, driven by corpus size and checkpoint retention). Almost every AI hot tier is bandwidth-sized — and operators who size it by capacity buy a huge array that cannot keep the GPUs fed, stranding accelerators. Almost every AI capacity tier is capacity-sized — and operators who size it by bandwidth overpay for flash speed that the cold corpus never uses. The error in either direction strands the asset you under-prioritized: idle GPUs on one side, idle flash on the other. Name which axis governs each tier before you read a single datasheet, because the datasheet will try to sell you the other one.

Each thread of this chapter opens into its own full treatment: the parallel/distributed filesystems that serve the hot tier in Chapter 9.2; the NVMe media hierarchy and CPU-bypass data path (GPUDirect Storage, SCADA, BlueField DPUs) in Chapter 9.3; the checkpointing math and the Young/Daly optimal interval — the canonical home for that derivation — in Chapter 9.4; the data-loader and preprocessing path in Chapter 9.5; the object/capacity tier in Chapter 9.6; the inference KV-cache memory hierarchy in Chapter 9.7; and sizing, data gravity, and resilience in Chapter 9.8. The goodput framing that makes storage a GPU-efficiency problem is developed in Chapter 12.2; the workload archetype that sets the dominant personality is the subject of Chapter 1.1 and Chapter 1.3; the fabric isolation that keeps checkpoint incast off the training collectives is engineered in Chapter 8.5; and the 2026 storage roadmap forward-pointer lands in Chapter 16.2.