Chapter 9.3
NVMe Tiers, GPUDirect Storage & the CPU-Bypass Data Path
The bottleneck that strands a $40k GPU is rarely the file system's aggregate bandwidth — it is the per-request tax of bouncing every byte through a host CPU and a bounce buffer, and the engineering answer in 2026 is to delete the CPU from the data path entirely.
What you'll decide here
- Where each tier of the media hierarchy lives — node-local NVMe scratch vs networked all-flash vs an Ethernet-attached context tier — and therefore which I/O personality (ingest, checkpoint, KV-cache) each path is sized for.
- How far you bypass the CPU: legacy buffered POSIX, GPUDirect Storage (data path off the CPU), or 2026-class GPU/DPU-initiated I/O (SCADA, BlueField-4) that takes the control path off the CPU too.
- TLC vs QLC per tier — endurance and write-bandwidth headroom for scratch and checkpoint vs $/TB density for the read-mostly capacity and context tiers.
- The PCIe generation and SSD form factor (E1.S vs E3.S) you commit the storage node to — which fixes per-drive bandwidth, drive count, and whether the storage shelf needs liquid cooling.
- Where the storage fabric sits — a dedicated storage rail, converged onto the back-end compute fabric, or on the front-end/management network — and the NVMe-oF transport (RoCE vs TCP) that rides it.
Chapter 9.2 chose the file system. This chapter chooses the data path underneath it — the physical and software route a byte travels from flash to a GPU's HBM — and that path, not the file system's headline aggregate, is what determines whether the accelerators stay fed. The reason is structural. A parallel file system can advertise multiple terabytes per second across a cluster and still starve an individual GPU, because the classic I/O path routes every read through the host CPU: the NIC DMAs data into a kernel bounce buffer in system DRAM, the CPU copies it across the PCIe root complex into GPU memory, and only then does the kernel see it. Each hop adds latency, consumes CPU cycles and DRAM bandwidth, and — at the small-request, high-concurrency access patterns that dominate AI — leaves the PCIe bus underutilized while the GPU spins. The file system was never the problem. The bounce buffer was.
So the engineering of this chapter is a sequence of bypasses, each one deleting a piece of the host from the path and each one a real fork with a downstream cost. First, the media hierarchy: how many tiers of flash, where they sit, and which I/O personality each absorbs. Then GPUDirect Storage (GDS), which takes the data path off the CPU. Then the 2026 frontier — SCADA and BlueField-4 — which take the control path off the CPU too, letting the GPU or a DPU initiate I/O with the host processor entirely out of the loop. Underneath all of it, the silicon decisions — TLC vs QLC, PCIe 5.0 vs 6.0, E1.S vs E3.S — and the fabric question of where the storage traffic rides. Each fork carries a downstream cost in goodput, watts, or stranded capacity.
The media hierarchy: four tiers, four personalities
AI storage is not one medium but a stack, and the stack exists because the four I/O personalities of Chapter 9.1 — sequential ingest, bursty checkpoint writes, metadata-heavy small-file access, and latency-critical KV-cache retrieval — pull in incompatible directions. No single tier serves all four economically. The hierarchy, from hottest to coldest:
Node-local NVMe scratch is the closest persistent medium to the GPU: one to a handful of NVMe drives inside the GPU server itself, on the host PCIe complex, contributing no network traffic at all. It absorbs the staleness-tolerant, node-private workloads — the local checkpoint tier (write to local flash in seconds, drain to the global store asynchronously), data-loader shuffle buffers, and spill space. Its bandwidth is the per-drive ceiling times the drive count, and its great virtue is that it never touches the storage fabric. The cost of skipping it: every checkpoint and every shuffle becomes networked traffic that collides with training collectives. → Chapter 9.4 treats the local-NVMe checkpoint tier in depth.
Networked all-flash (the hot tier) is the parallel/distributed file system itself — WEKA, VAST, DDN/Lustre, IBM Storage Scale — built on TLC NVMe and reached over RDMA. This is the tier sized to the per-GPU read target and the checkpoint write budget; it is where 9.2's file-system choice lives. QLC capacity flash (the warm tier) trades endurance and write speed for density: read-mostly datasets, model registries, and the all-flash data lake that has displaced HDD for active data. The Ethernet-attached context tier is the 2026 newcomer — petabyte-scale flash addressed as inference KV-cache memory (NVIDIA CMX/ICMSP-class), a tier that did not exist in the training-only storage stack and that Chapter 9.7 treats as a memory-hierarchy problem rather than a file-system one.
| Tier | Primary personality | Medium | Reach / placement | Sized for |
|---|---|---|---|---|
| Node-local NVMe scratch | Local checkpoint drain, shuffle/spill | TLC NVMe (high-endurance) | Inside the GPU server (off-fabric) | Seconds-to-local write; async drain to global |
| Networked all-flash (hot) | Sequential ingest + checkpoint writes | TLC NVMe over RDMA | Dedicated storage rail or converged | Per-GPU read target; write ≥ ½ read |
| QLC capacity (warm) | Read-mostly datasets, model registry | QLC NVMe + data reduction | Networked; same namespace or object | $/TB density; all-flash data lake |
| Ethernet-attached context | Inference KV-cache offload | QLC/TLC flash as 'memory' | Ethernet-attached, shared per GPU pod | KV reuse; tens-of-µs retrieval |
| Object / archive (cold) | Durable corpus, cold checkpoints | QLC or HDD; erasure-coded | Object store; on-prem or cloud | Durability and $/TB; egress economics |
The discipline is to map personality to tier and resist the temptation to collapse them. Putting checkpoint writes on the QLC capacity tier burns its limited endurance and saturates its weak write path; serving KV-cache from the cold object tier blows the latency budget. The hierarchy is the answer to a sizing question, not a procurement convenience — and the placement column is where it intersects the fabric decision at the end of this chapter.
GPUDirect Storage: taking the data path off the CPU
GPUDirect Storage is the first and most established bypass. It establishes a direct DMA path between storage and GPU memory, eliminating the bounce buffer in host DRAM: the NIC (or local NVMe controller) DMAs data straight into HBM across the PCIe root complex, skipping the CPU-mediated copy entirely. The host CPU still initiates the transfer — it owns the control path, issuing the I/O and managing the file system — but it is no longer in the data path. The payoff is threefold: lower latency (one fewer copy), reclaimed CPU cycles and DRAM bandwidth, and far higher effective bandwidth on the small, concurrent requests that defeat the buffered path. NVIDIA's own benchmarks show direct storage-to-GPU rates above 40 GB/s, with VAST and other platforms sustaining over 90% of line rate over GDS, and I/O-bound configurations seeing a 2–3x lift in GPU utilization once the copy is removed.
GDS is not free to adopt, and the consequences of getting it wrong are concrete. It requires an end-to-end supported stack — a GDS-aware file-system client, a supported NIC, a clean PCIe topology (the GPU and the NIC/NVMe should sit under the same PCIe switch or root complex so the DMA does not traverse the CPU's inter-socket link), and applications that issue I/O through the cuFile API rather than ordinary POSIX read(). Miss any of these and the stack silently falls back to the buffered path — you pay for GDS-capable hardware and get bounce-buffer performance. The fork here is whether your file system and data loaders are GDS-native: a WEKA or VAST or GPFS deployment with a GDS client and a DALI/cuFile loader gets the bypass; a generic NFS mount with a PyTorch loader doing buffered reads does not, no matter what NICs you bought. → loader integration in Chapter 9.5.
The 2026 frontier: SCADA and BlueField-4 take the control path too
GDS removed the host from the data path but left it owning the control path — the CPU still decides what to read and when. The 2026 generation removes the control path as well, and it does so along two distinct routes that are easy to conflate but architecturally different.
SCADA (GPU-initiated I/O) pushes the storage control path onto the GPU. Where GDS kept the CPU as the initiator, SCADA lets the GPU itself issue and control storage I/O operations directly — the host processor is fully out of both the data and control loops. The motivating problem is inference: a GPU sustaining more than a thousand parallel threads cannot wait on a CPU to orchestrate millions of tiny reads against a multi-petabyte context dataset. Wiwynn's SCADA-class reference server pairs this with extreme media density — 96 liquid-cooled E3.S drives for roughly 2.9 PB in a single server on PCIe 6.0 — precisely because the GPU-initiated path can finally drive that many drives without a CPU bottleneck in the way.
BlueField-4 (DPU-offloaded storage) takes the opposite route: it moves the storage, networking, and security control plane onto a DPU sitting between the network and the GPU, so the host CPU is bypassed by offload rather than by GPU initiation. NVIDIA's BlueField-4 STX architecture, announced at GTC 2026, is built around a storage-optimized BlueField-4 DPU (cited at 800 Gb/s and roughly 6x the compute of BlueField-3) plus a ConnectX-9 SuperNIC, routing data through a dedicated accelerated-storage layer via RDMA over Spectrum-X Ethernet. WEKA, VAST, and DDN are building platforms on it for H2 2026 availability. The same silicon underpins the CMX context-memory tier for inference KV-cache. → the inference memory hierarchy in Chapter 9.7; DPU security offload in Chapter 10.6.
| Path | Data path on CPU? | Control path on CPU? | Initiator | Availability |
|---|---|---|---|---|
| Buffered POSIX (legacy) | Yes (bounce buffer) | Yes | Host CPU | Universal |
| GPUDirect Storage (GDS) | No (direct DMA to HBM) | Yes | Host CPU | Mature (CUDA/cuFile stack) |
| SCADA (GPU-initiated) | No | No | GPU | Emerging 2026 |
| BlueField-4 / STX (DPU-offload) | No | No (on DPU) | DPU | H2 2026 (WEKA/VAST/DDN) |
The silicon: TLC vs QLC, PCIe generation, and form factor
Underneath the data-path software sit three hardware forks that fix the per-node economics, and each maps cleanly onto a tier.
TLC vs QLC is an endurance-and-write-bandwidth versus density tradeoff. TLC (3 bits/cell) has higher write endurance (DWPD) and better sustained write performance — the right medium for the scratch and hot/checkpoint tiers that absorb bursty, write-heavy traffic. QLC (4 bits/cell) packs ~33% more capacity per die at lower $/TB but with materially lower endurance and weaker write throughput, which is exactly tolerable for the read-mostly capacity, data-lake, and context tiers. The mistake is using QLC where writes land hard: a checkpoint stream or a shuffle-spill workload on QLC burns endurance and chokes on the write cliff. Pair the medium to the personality — TLC where it is written, QLC where it is read.
PCIe 5.0 vs 6.0 sets per-drive ceiling. A PCIe 5.0 x4 enterprise NVMe drive tops out near 14 GB/s sequential read; PCIe 6.0 doubles the lane rate, and the first mass-production Gen6 enterprise drives (Micron's 9650, in volume in 2026) reach roughly 28 GB/s read / 14 GB/s write at around 5.5M IOPS. The consequence is drive count: a fixed per-node bandwidth target needs half as many Gen6 drives as Gen5, which reshapes the storage shelf — and the density of GPU-initiated servers packing 96 drives is only sane on Gen6. E1.S vs E3.S is the form-factor fork that rides along: E1.S (the ruler) is the dense, hot-swap, often liquid-cooled-friendly format favored inside compute and the densest flash servers; E3.S is the higher-power, higher-capacity format common in dedicated storage nodes. The 9650 ships in both, with E1.S the liquid-cooling target — a reminder that at Gen6 power and density, the storage shelf inherits a cooling decision of its own.
Storage-fabric placement: where the I/O traffic rides
The last fork is topological: every byte from the networked tiers has to cross a fabric, and which fabric is a co-design decision with the back-end compute network of Part 8. Three placements, three sets of consequences.
A dedicated storage rail — separate switches and NICs for storage — gives the cleanest isolation: checkpoint-write incast and dataset-read bursts never collide with the all-reduce collectives that the training job lives or dies on. It is the conservative, highest-goodput choice, and it is what NVIDIA's reference architectures lean toward (storage on the front-end/storage network, deliberately not on the back-end compute fabric, precisely to keep rebuild and checkpoint traffic off the InfiniBand collectives). The cost is extra NICs, switch ports, and optics — capex and a few watts you spend to buy isolation.
Converging storage onto the back-end compute fabric reuses the non-blocking InfiniBand/RoCE network you already paid for, saving the dedicated rail. It is tempting at small scale and brutal at large: a checkpoint write from 16k GPUs is a synchronized incast event, and dropping it onto the same fabric as the collectives turns a checkpoint into a training stall. If you converge, you must isolate the traffic classes (separate queues, congestion control tuned for incast) or you trade capex for goodput — and goodput is the more expensive currency. → isolating checkpoint incast as a fabric problem in Chapter 8.5 and congestion control in Chapter 8.6. The front-end/management network is the right home for cold, slow, control-plane storage traffic (image pulls, logs, low-rate object access) — never for the hot path.
| Placement | Isolation from collectives | Capex | Best for | Failure mode if wrong |
|---|---|---|---|---|
| Dedicated storage rail | Full — separate switches/NICs | Highest (extra ports/optics) | Training clusters; checkpoint-heavy runs | None for isolation; you overspend if traffic is light |
| Converged onto back-end fabric | Partial — needs traffic-class isolation | Lowest (reuses compute fabric) | Smaller clusters; budget-constrained | Checkpoint incast stalls collectives → goodput loss |
| Front-end / management network | Full (different network) | Low | Cold/control-plane storage only | Hot-path traffic chokes the slow front-end link |
On the rail, the transport itself is the final sub-fork: NVMe-oF over RDMA (RoCE or InfiniBand) delivers tens-of-microseconds latency and is what makes networked flash feel local; NVMe/TCP trades that latency for ubiquity and operational simplicity (no lossless-Ethernet PFC/ECN tuning, runs on any IP network). For the hot tier feeding GPUs, RDMA is effectively mandatory — TCP's latency forfeits the per-request advantage the whole bypass stack exists to capture. Between the two RDMA flavors, the InfiniBand-vs-RoCE decision is the same one Part 8 makes for the compute fabric, with the same congestion-control caveat: RoCE has closed most of the gap and Spectrum-X makes converged Ethernet viable for GDS and KV-offload at scale, but untuned RoCE under checkpoint incast is where tail latency goes to die. → the IB-vs-RoCE transport decision in Chapter 8.5.
Deep dive: walking a single training read through the buffered path vs GDS vs SCADA
Trace one read of one data shard and the bypasses become concrete. Buffered POSIX: the data loader calls read(); the kernel issues an NVMe-oF request; the target's NIC DMAs the data into a kernel bounce buffer in the host's system DRAM; the CPU copies it from the bounce buffer across the PCIe root complex into a user buffer; the framework then copies it again into pinned memory and DMAs it to GPU HBM. Two or three copies, a context switch, and CPU + DRAM bandwidth consumed on every shard — fine for one stream, fatal for ten thousand concurrent small reads.
GPUDirect Storage: the loader calls cuFile instead; the GDS stack programs a DMA directly from the storage target (or local NVMe) into GPU HBM across the PCIe complex. The bounce buffer is gone, the CPU copy is gone, the DRAM round-trip is gone. The CPU still issued the request — it owns the control path — but it is no longer touched by the bytes. This is the 2–3x utilization lift in I/O-bound configs, and it is also why PCIe topology matters: if the NIC and GPU sit on different root complexes, the DMA crosses the CPU's inter-socket link and you lose part of the win.
SCADA / DPU-offload: now even the request is gone from the CPU. The GPU itself (SCADA) or a BlueField-4 DPU (STX) initiates and manages the I/O; the host processor is out of both paths. For an inference engine running a thousand parallel threads against a 2.9 PB context store, this is the difference between feeding the GPU and starving it on CPU orchestration overhead — the host simply cannot dispatch millions of sub-4 KB reads fast enough, so you stop asking it to. The progression is monotone: buffered removes nothing, GDS removes the data path, SCADA/DPU removes the control path. Each rung buys goodput; each rung costs portability.
Deep dive: why QLC won the capacity tier and where it must not go
The all-flash-everywhere trend of 2026 rests on QLC plus data reduction beating HDD on total cost for active data — not on QLC matching TLC. QLC's economics come from packing 4 bits per cell (33% more capacity per die than TLC's 3), and layered with dedup, compression, and similarity reduction (VAST-style), it makes a petabyte-scale all-flash data lake cost-competitive while delivering flash read latency the corpus actually benefits from during training ingest. That is the right home for QLC: read-mostly datasets, model registries, the warm capacity tier, and the read-dominated context tier.
The trap is writes. QLC's endurance (drive-writes-per-day) is a fraction of TLC's, and its sustained write throughput collapses once the SLC cache fills. Land a checkpoint stream — bursty, large, write-heavy, repeated every few minutes at scale — on QLC and you both burn through its rated endurance and hit the write cliff exactly when the synchronous checkpoint needs full bandwidth, stalling the training job. The rule is mechanical: TLC where it is written hard (scratch, hot/checkpoint), QLC where it is read (capacity, lake, context). The checkpoint write-bandwidth target — NVIDIA's guidance that write should be at least half of read — is a TLC-tier sizing constraint, not a QLC one. → checkpoint bandwidth sizing in Chapter 9.4 and the capacity tier in Chapter 9.6.
Anti-patterns
The recurring mis-builds in this layer all come from optimizing the file system's headline number while ignoring the path:
- Buying GDS-capable hardware and running buffered I/O. A supported NIC and file system mean nothing if the data loader issues ordinary
read()and silently falls back to the bounce buffer. You pay for the bypass and get bounce-buffer latency. Validate the path, not the bill of materials. - QLC under the checkpoint stream. Putting write-heavy, bursty checkpoint traffic on the cheap capacity tier — burning endurance and hitting the write cliff at the worst moment. TLC absorbs writes; QLC serves reads.
- Converging storage onto the collective fabric without traffic isolation. Saving a dedicated rail by dropping 16k-GPU checkpoint incast onto the same InfiniBand that carries all-reduce — converting a checkpoint into a training stall. The capex you saved is dwarfed by the goodput you lost.
- Treating the storage shelf as a thermal afterthought at Gen6. Speccing a 96-drive Gen6 flash server into an air-cooled storage row and stranding the density behind a cooling wall the GPUs already crossed years ago.