Chapter 9.3

NVMe Tiers, GPUDirect Storage & the CPU-Bypass Data Path

The bottleneck that strands a $40k GPU is rarely the file system's aggregate bandwidth — it is the per-request tax of bouncing every byte through a host CPU and a bounce buffer, and the engineering answer in 2026 is to delete the CPU from the data path entirely.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Where each tier of the media hierarchy lives — node-local NVMe scratch vs networked all-flash vs an Ethernet-attached context tier — and therefore which I/O personality (ingest, checkpoint, KV-cache) each path is sized for.
How far you bypass the CPU: legacy buffered POSIX, GPUDirect Storage (data path off the CPU), or 2026-class GPU/DPU-initiated I/O (SCADA, BlueField-4) that takes the control path off the CPU too.
TLC vs QLC per tier — endurance and write-bandwidth headroom for scratch and checkpoint vs $/TB density for the read-mostly capacity and context tiers.
The PCIe generation and SSD form factor (E1.S vs E3.S) you commit the storage node to — which fixes per-drive bandwidth, drive count, and whether the storage shelf needs liquid cooling.
Where the storage fabric sits — a dedicated storage rail, converged onto the back-end compute fabric, or on the front-end/management network — and the NVMe-oF transport (RoCE vs TCP) that rides it.

Chapter 9.2 chose the file system. This chapter chooses the data path underneath it — the physical and software route a byte travels from flash to a GPU's HBM — and that path, not the file system's headline aggregate, is what determines whether the accelerators stay fed. The reason is structural. A parallel file system can advertise multiple terabytes per second across a cluster and still starve an individual GPU, because the classic I/O path routes every read through the host CPU: the NIC DMAs data into a kernel bounce buffer in system DRAM, the CPU copies it across the PCIe root complex into GPU memory, and only then does the kernel see it. Each hop adds latency, consumes CPU cycles and DRAM bandwidth, and — at the small-request, high-concurrency access patterns that dominate AI — leaves the PCIe bus underutilized while the GPU spins. The file system was never the problem. The bounce buffer was.

So the engineering of this chapter is a sequence of bypasses, each one deleting a piece of the host from the path and each one a real fork with a downstream cost. First, the media hierarchy: how many tiers of flash, where they sit, and which I/O personality each absorbs. Then GPUDirect Storage (GDS), which takes the data path off the CPU. Then the 2026 frontier — SCADA and BlueField-4 — which take the control path off the CPU too, letting the GPU or a DPU initiate I/O with the host processor entirely out of the loop. Underneath all of it, the silicon decisions — TLC vs QLC, PCIe 5.0 vs 6.0, E1.S vs E3.S — and the fabric question of where the storage traffic rides. Each fork carries a downstream cost in goodput, watts, or stranded capacity.

The media hierarchy: four tiers, four personalities

AI storage is not one medium but a stack, and the stack exists because the four I/O personalities of Chapter 9.1 — sequential ingest, bursty checkpoint writes, metadata-heavy small-file access, and latency-critical KV-cache retrieval — pull in incompatible directions. No single tier serves all four economically. The hierarchy, from hottest to coldest:

Node-local NVMe scratch is the closest persistent medium to the GPU: one to a handful of NVMe drives inside the GPU server itself, on the host PCIe complex, contributing no network traffic at all. It absorbs the staleness-tolerant, node-private workloads — the local checkpoint tier (write to local flash in seconds, drain to the global store asynchronously), data-loader shuffle buffers, and spill space. Its bandwidth is the per-drive ceiling times the drive count, and its great virtue is that it never touches the storage fabric. The cost of skipping it: every checkpoint and every shuffle becomes networked traffic that collides with training collectives. → Chapter 9.4 treats the local-NVMe checkpoint tier in depth.

Networked all-flash (the hot tier) is the parallel/distributed file system itself — WEKA, VAST, DDN/Lustre, IBM Storage Scale — built on TLC NVMe and reached over RDMA. This is the tier sized to the per-GPU read target and the checkpoint write budget; it is where 9.2's file-system choice lives. QLC capacity flash (the warm tier) trades endurance and write speed for density: read-mostly datasets, model registries, and the all-flash data lake that has displaced HDD for active data. The Ethernet-attached context tier is the 2026 newcomer — petabyte-scale flash addressed as inference KV-cache memory (NVIDIA CMX/ICMSP-class), a tier that did not exist in the training-only storage stack and that Chapter 9.7 treats as a memory-hierarchy problem rather than a file-system one.

The media hierarchy → personality, medium, and placement

Tier	Primary personality	Medium	Reach / placement	Sized for
Node-local NVMe scratch	Local checkpoint drain, shuffle/spill	TLC NVMe (high-endurance)	Inside the GPU server (off-fabric)	Seconds-to-local write; async drain to global
Networked all-flash (hot)	Sequential ingest + checkpoint writes	TLC NVMe over RDMA	Dedicated storage rail or converged	Per-GPU read target; write ≥ ½ read
QLC capacity (warm)	Read-mostly datasets, model registry	QLC NVMe + data reduction	Networked; same namespace or object	$/TB density; all-flash data lake
Ethernet-attached context	Inference KV-cache offload	QLC/TLC flash as 'memory'	Ethernet-attached, shared per GPU pod	KV reuse; tens-of-µs retrieval
Object / archive (cold)	Durable corpus, cold checkpoints	QLC or HDD; erasure-coded	Object store; on-prem or cloud	Durability and $/TB; egress economics

Per-tier sizing follows the I/O personality it absorbs; bandwidth figures are 2026-class reference points (see keynumbers). 'Off-fabric' means traffic stays inside the GPU node and never touches the storage network.

The discipline is to map personality to tier and resist the temptation to collapse them. Putting checkpoint writes on the QLC capacity tier burns its limited endurance and saturates its weak write path; serving KV-cache from the cold object tier blows the latency budget. The hierarchy is the answer to a sizing question, not a procurement convenience — and the placement column is where it intersects the fabric decision at the end of this chapter.

GPUDirect Storage: taking the data path off the CPU

GPUDirect Storage is the first and most established bypass. It establishes a direct DMA path between storage and GPU memory, eliminating the bounce buffer in host DRAM: the NIC (or local NVMe controller) DMAs data straight into HBM across the PCIe root complex, skipping the CPU-mediated copy entirely. The host CPU still initiates the transfer — it owns the control path, issuing the I/O and managing the file system — but it is no longer in the data path. The payoff is threefold: lower latency (one fewer copy), reclaimed CPU cycles and DRAM bandwidth, and far higher effective bandwidth on the small, concurrent requests that defeat the buffered path. NVIDIA's own benchmarks show direct storage-to-GPU rates above 40 GB/s, with VAST and other platforms sustaining over 90% of line rate over GDS, and I/O-bound configurations seeing a 2–3x lift in GPU utilization once the copy is removed.

GDS is not free to adopt, and the consequences of getting it wrong are concrete. It requires an end-to-end supported stack — a GDS-aware file-system client, a supported NIC, a clean PCIe topology (the GPU and the NIC/NVMe should sit under the same PCIe switch or root complex so the DMA does not traverse the CPU's inter-socket link), and applications that issue I/O through the cuFile API rather than ordinary POSIX read(). Miss any of these and the stack silently falls back to the buffered path — you pay for GDS-capable hardware and get bounce-buffer performance. The fork here is whether your file system and data loaders are GDS-native: a WEKA or VAST or GPFS deployment with a GDS client and a DALI/cuFile loader gets the bypass; a generic NFS mount with a PyTorch loader doing buffered reads does not, no matter what NICs you bought. → loader integration in Chapter 9.5.

Why aggregate bandwidth lies and per-request latency tells the truth

A file system rated at 4 TB/s can still starve a single GPU. AI access patterns are not one giant sequential stream — they are thousands of concurrent small reads (samples, shards, KV blocks), and at small request sizes the per-request overhead dominates. The buffered path adds a kernel copy and a context switch to every one of those requests; multiply by thousands of threads and the PCIe bus sits idle while the CPU thrashes. NVIDIA's own framing for SCADA is exactly this: a GPU running 1,000+ parallel inference threads cannot be fed by SSDs whose response time for sub-4 KB requests leaves the PCIe link underutilized. The lesson for sizing: validate the path with a small-request, high-queue-depth benchmark (and MLPerf Storage's metadata-heavy mixes), not just a streaming fio that flatters the datasheet. → benchmarking and acceptance in Chapter 9.8.

The 2026 frontier: SCADA and BlueField-4 take the control path too

GDS removed the host from the data path but left it owning the control path — the CPU still decides what to read and when. The 2026 generation removes the control path as well, and it does so along two distinct routes that are easy to conflate but architecturally different.

SCADA (GPU-initiated I/O) pushes the storage control path onto the GPU. Where GDS kept the CPU as the initiator, SCADA lets the GPU itself issue and control storage I/O operations directly — the host processor is fully out of both the data and control loops. The motivating problem is inference: a GPU sustaining more than a thousand parallel threads cannot wait on a CPU to orchestrate millions of tiny reads against a multi-petabyte context dataset. Wiwynn's SCADA-class reference server pairs this with extreme media density — 96 liquid-cooled E3.S drives for roughly 2.9 PB in a single server on PCIe 6.0 — precisely because the GPU-initiated path can finally drive that many drives without a CPU bottleneck in the way.

BlueField-4 (DPU-offloaded storage) takes the opposite route: it moves the storage, networking, and security control plane onto a DPU sitting between the network and the GPU, so the host CPU is bypassed by offload rather than by GPU initiation. NVIDIA's BlueField-4 STX architecture, announced at GTC 2026, is built around a storage-optimized BlueField-4 DPU (cited at 800 Gb/s and roughly 6x the compute of BlueField-3) plus a ConnectX-9 SuperNIC, routing data through a dedicated accelerated-storage layer via RDMA over Spectrum-X Ethernet. WEKA, VAST, and DDN are building platforms on it for H2 2026 availability. The same silicon underpins the CMX context-memory tier for inference KV-cache. → the inference memory hierarchy in Chapter 9.7; DPU security offload in Chapter 10.6.

The CPU-bypass ladder → what each rung removes

Path	Data path on CPU?	Control path on CPU?	Initiator	Availability
Buffered POSIX (legacy)	Yes (bounce buffer)	Yes	Host CPU	Universal
GPUDirect Storage (GDS)	No (direct DMA to HBM)	Yes	Host CPU	Mature (CUDA/cuFile stack)
SCADA (GPU-initiated)	No	No	GPU	Emerging 2026
BlueField-4 / STX (DPU-offload)	No	No (on DPU)	DPU	H2 2026 (WEKA/VAST/DDN)

Each rung deletes more of the host from the path; later rungs require newer silicon and a narrower supported-stack window. 'Control path' = who decides/initiates I/O; 'data path' = where the bytes flow.

The lock-in tax of climbing the bypass ladder

Each rung up the ladder narrows your supplier set and deepens vendor coupling. GDS is broadly supported across file systems and NICs. SCADA and BlueField-4 are, as of 2026, a far more concentrated ecosystem — a specific DPU/SuperNIC generation, a specific transport (RDMA over Spectrum-X), and file-system platforms that have explicitly ported to it. Adopting the top rung can mean committing the storage and the network and the security plane to one vendor's silicon roadmap. The decision is genuine: GPU/DPU-initiated I/O is the right answer for petabyte-context inference and the only way to drive 96-drive flash servers without a CPU wall — but price the portability you are trading away, and keep the GDS path as the fallback your file system still speaks. → fabric and DPU-isolation tradeoffs in Chapter 8.5.

The silicon: TLC vs QLC, PCIe generation, and form factor

Underneath the data-path software sit three hardware forks that fix the per-node economics, and each maps cleanly onto a tier.

TLC vs QLC is an endurance-and-write-bandwidth versus density tradeoff. TLC (3 bits/cell) has higher write endurance (DWPD) and better sustained write performance — the right medium for the scratch and hot/checkpoint tiers that absorb bursty, write-heavy traffic. QLC (4 bits/cell) packs ~33% more capacity per die at lower $/TB but with materially lower endurance and weaker write throughput, which is exactly tolerable for the read-mostly capacity, data-lake, and context tiers. The mistake is using QLC where writes land hard: a checkpoint stream or a shuffle-spill workload on QLC burns endurance and chokes on the write cliff. Pair the medium to the personality — TLC where it is written, QLC where it is read.

PCIe 5.0 vs 6.0 sets per-drive ceiling. A PCIe 5.0 x4 enterprise NVMe drive tops out near 14 GB/s sequential read; PCIe 6.0 doubles the lane rate, and the first mass-production Gen6 enterprise drives (Micron's 9650, in volume in 2026) reach roughly 28 GB/s read / 14 GB/s write at around 5.5M IOPS. The consequence is drive count: a fixed per-node bandwidth target needs half as many Gen6 drives as Gen5, which reshapes the storage shelf — and the density of GPU-initiated servers packing 96 drives is only sane on Gen6. E1.S vs E3.S is the form-factor fork that rides along: E1.S (the ruler) is the dense, hot-swap, often liquid-cooled-friendly format favored inside compute and the densest flash servers; E3.S is the higher-power, higher-capacity format common in dedicated storage nodes. The 9650 ships in both, with E1.S the liquid-cooling target — a reminder that at Gen6 power and density, the storage shelf inherits a cooling decision of its own.

40+ GB/s

direct storage-to-GPU over GPUDirect Storage; VAST/GDS sustains >90% of line rate; 2–3x GPU-utilization lift in I/O-bound configs

2025NVIDIA GPUDirect Storage; Introl; VAST

~28 / 14 GB/s

PCIe 6.0 x4 enterprise SSD sequential read/write (Micron 9650, first Gen6 in mass production), ~5.5M IOPS; E1.S & E3.S

2026Micron; Tom's Hardware

~2.9 PB

per SCADA-class flash server: 96 liquid-cooled E3.S drives on PCIe 6.0 (GPU-initiated I/O)

2026Wiwynn / NVIDIA SCADA; Tom's Hardware

800 Gb/s

BlueField-4 DPU throughput (STX storage architecture), ~6x BlueField-3 compute; WEKA/VAST/DDN platforms

H2 2026NVIDIA Newsroom (GTC 2026)

250 / 124 GB/s

per scalable unit, DGX B300 SuperPOD 'Enhanced' storage tier (read/write); 8-SU = 2,000 / 992 GB/s

2025NVIDIA DGX SuperPOD (B300) RA

tens of µs

NVMe-oF read latency over RDMA (RoCE/IB) vs milliseconds for legacy NFS/iSCSI

2025Introl; NVMe-oF spec

~4 GB/s

NVIDIA reference read-bandwidth target per GPU (vision); practical floor ~1 GB/s/GPU; multimodal/checkpoint-heavy 4–10 GB/s/GPU

2025NVIDIA SuperPOD storage guidance

~$36B

AI storage market (2025), ~24% CAGR; QLC + data reduction displacing HDD across hot and capacity tiers

2025Industry surveys (per domain research)

Storage-fabric placement: where the I/O traffic rides

The last fork is topological: every byte from the networked tiers has to cross a fabric, and which fabric is a co-design decision with the back-end compute network of Part 8. Three placements, three sets of consequences.

A dedicated storage rail — separate switches and NICs for storage — gives the cleanest isolation: checkpoint-write incast and dataset-read bursts never collide with the all-reduce collectives that the training job lives or dies on. It is the conservative, highest-goodput choice, and it is what NVIDIA's reference architectures lean toward (storage on the front-end/storage network, deliberately not on the back-end compute fabric, precisely to keep rebuild and checkpoint traffic off the InfiniBand collectives). The cost is extra NICs, switch ports, and optics — capex and a few watts you spend to buy isolation.

Converging storage onto the back-end compute fabric reuses the non-blocking InfiniBand/RoCE network you already paid for, saving the dedicated rail. It is tempting at small scale and brutal at large: a checkpoint write from 16k GPUs is a synchronized incast event, and dropping it onto the same fabric as the collectives turns a checkpoint into a training stall. If you converge, you must isolate the traffic classes (separate queues, congestion control tuned for incast) or you trade capex for goodput — and goodput is the more expensive currency. → isolating checkpoint incast as a fabric problem in Chapter 8.5 and congestion control in Chapter 8.6. The front-end/management network is the right home for cold, slow, control-plane storage traffic (image pulls, logs, low-rate object access) — never for the hot path.

Storage-fabric placement → isolation, cost, and risk

Placement	Isolation from collectives	Capex	Best for	Failure mode if wrong
Dedicated storage rail	Full — separate switches/NICs	Highest (extra ports/optics)	Training clusters; checkpoint-heavy runs	None for isolation; you overspend if traffic is light
Converged onto back-end fabric	Partial — needs traffic-class isolation	Lowest (reuses compute fabric)	Smaller clusters; budget-constrained	Checkpoint incast stalls collectives → goodput loss
Front-end / management network	Full (different network)	Low	Cold/control-plane storage only	Hot-path traffic chokes the slow front-end link

Placement is a co-design decision with the Part 8 back-end fabric. The dedicated rail is the reference default for training clusters where checkpoint incast must not touch collectives.

On the rail, the transport itself is the final sub-fork: NVMe-oF over RDMA (RoCE or InfiniBand) delivers tens-of-microseconds latency and is what makes networked flash feel local; NVMe/TCP trades that latency for ubiquity and operational simplicity (no lossless-Ethernet PFC/ECN tuning, runs on any IP network). For the hot tier feeding GPUs, RDMA is effectively mandatory — TCP's latency forfeits the per-request advantage the whole bypass stack exists to capture. Between the two RDMA flavors, the InfiniBand-vs-RoCE decision is the same one Part 8 makes for the compute fabric, with the same congestion-control caveat: RoCE has closed most of the gap and Spectrum-X makes converged Ethernet viable for GDS and KV-offload at scale, but untuned RoCE under checkpoint incast is where tail latency goes to die. → the IB-vs-RoCE transport decision in Chapter 8.5.

Deep dive: walking a single training read through the buffered path vs GDS vs SCADA

Trace one read of one data shard and the bypasses become concrete. Buffered POSIX: the data loader calls read(); the kernel issues an NVMe-oF request; the target's NIC DMAs the data into a kernel bounce buffer in the host's system DRAM; the CPU copies it from the bounce buffer across the PCIe root complex into a user buffer; the framework then copies it again into pinned memory and DMAs it to GPU HBM. Two or three copies, a context switch, and CPU + DRAM bandwidth consumed on every shard — fine for one stream, fatal for ten thousand concurrent small reads.

GPUDirect Storage: the loader calls cuFile instead; the GDS stack programs a DMA directly from the storage target (or local NVMe) into GPU HBM across the PCIe complex. The bounce buffer is gone, the CPU copy is gone, the DRAM round-trip is gone. The CPU still issued the request — it owns the control path — but it is no longer touched by the bytes. This is the 2–3x utilization lift in I/O-bound configs, and it is also why PCIe topology matters: if the NIC and GPU sit on different root complexes, the DMA crosses the CPU's inter-socket link and you lose part of the win.

SCADA / DPU-offload: now even the request is gone from the CPU. The GPU itself (SCADA) or a BlueField-4 DPU (STX) initiates and manages the I/O; the host processor is out of both paths. For an inference engine running a thousand parallel threads against a 2.9 PB context store, this is the difference between feeding the GPU and starving it on CPU orchestration overhead — the host simply cannot dispatch millions of sub-4 KB reads fast enough, so you stop asking it to. The progression is monotone: buffered removes nothing, GDS removes the data path, SCADA/DPU removes the control path. Each rung buys goodput; each rung costs portability.

Deep dive: why QLC won the capacity tier and where it must not go

The all-flash-everywhere trend of 2026 rests on QLC plus data reduction beating HDD on total cost for active data — not on QLC matching TLC. QLC's economics come from packing 4 bits per cell (33% more capacity per die than TLC's 3), and layered with dedup, compression, and similarity reduction (VAST-style), it makes a petabyte-scale all-flash data lake cost-competitive while delivering flash read latency the corpus actually benefits from during training ingest. That is the right home for QLC: read-mostly datasets, model registries, the warm capacity tier, and the read-dominated context tier.

The trap is writes. QLC's endurance (drive-writes-per-day) is a fraction of TLC's, and its sustained write throughput collapses once the SLC cache fills. Land a checkpoint stream — bursty, large, write-heavy, repeated every few minutes at scale — on QLC and you both burn through its rated endurance and hit the write cliff exactly when the synchronous checkpoint needs full bandwidth, stalling the training job. The rule is mechanical: TLC where it is written hard (scratch, hot/checkpoint), QLC where it is read (capacity, lake, context). The checkpoint write-bandwidth target — NVIDIA's guidance that write should be at least half of read — is a TLC-tier sizing constraint, not a QLC one. → checkpoint bandwidth sizing in Chapter 9.4 and the capacity tier in Chapter 9.6.

Anti-patterns

The recurring mis-builds in this layer all come from optimizing the file system's headline number while ignoring the path:

Buying GDS-capable hardware and running buffered I/O. A supported NIC and file system mean nothing if the data loader issues ordinary read() and silently falls back to the bounce buffer. You pay for the bypass and get bounce-buffer latency. Validate the path, not the bill of materials.
QLC under the checkpoint stream. Putting write-heavy, bursty checkpoint traffic on the cheap capacity tier — burning endurance and hitting the write cliff at the worst moment. TLC absorbs writes; QLC serves reads.
Converging storage onto the collective fabric without traffic isolation. Saving a dedicated rail by dropping 16k-GPU checkpoint incast onto the same InfiniBand that carries all-reduce — converting a checkpoint into a training stall. The capex you saved is dwarfed by the goodput you lost.
Treating the storage shelf as a thermal afterthought at Gen6. Speccing a 96-drive Gen6 flash server into an air-cooled storage row and stranding the density behind a cooling wall the GPUs already crossed years ago.

This chapter sits between the file-system choice in Chapter 9.2 and its consumers. The I/O personalities it serves are defined in Chapter 9.1; the local-NVMe checkpoint tier and write-bandwidth math live in Chapter 9.4; the data-loader / cuFile integration that decides whether GDS actually engages is in Chapter 9.5; the QLC capacity/object tier in Chapter 9.6; and the Ethernet-attached context tier and DPU-offloaded KV-cache in Chapter 9.7. The fabric-placement and NVMe-oF transport decisions are co-designed with the back-end network in Chapter 8.5 and the congestion control in Chapter 8.6; the Gen6-flash cooling consequence with the DLC discussion in Chapter 5.4; DPU offload and health telemetry in Chapter 10.6; sizing, benchmarking (MLPerf Storage), and acceptance in Chapter 9.8; and the forward roadmap for GPU/DPU-initiated I/O and deeper CXL tiering in Chapter 16.2.