Chapter 9.7

Inference & KV-Cache Storage: The New Memory Hierarchy

Inference turned the KV-cache into a first-class storage problem: the bytes a request must keep resident now spill far past HBM, and where you let them land — DRAM, CXL-expanded DRAM, NVMe, or Ethernet-attached flash — sets your tokens-per-second, your cost-per-token, and how many users one GPU can serve.

GOODPUTDENSITY-RAMP

What you'll decide here

Whether your inference fleet treats the KV-cache as ephemeral HBM state (recompute on miss) or as a managed, multi-tier asset that is offloaded, persisted, and reused across requests and sessions.
Where the offload tier physically lives — node-local DRAM, CXL-expanded/pooled DRAM, local NVMe, or a network-attached Ethernet-flash KV tier — and the latency/bandwidth/cost envelope each choice imposes on time-to-first-token.
The rule for when CXL-expanded DRAM beats NVMe KV-offload (and when it does not): a capacity-and-reuse-rate decision, not a vendor preference.
Whether to adopt a KV-transfer and reuse stack (Dynamo/KVBM + NIXL, LMCache, vLLM PagedAttention, Mooncake-class pools) or recompute, and how that choice couples to prefill/decode disaggregation and the back-end fabric.
How model/weight serving and cold-start are tiered alongside the KV-cache — because the same memory hierarchy that holds context also gates how fast a model loads onto a freed GPU.

For a decade the storage conversation in AI was about training — feed the GPUs fast enough (ingestion), and don't lose the run (checkpointing). Inference was assumed to be storage-light: load the weights once, serve from HBM, done. That assumption is dead. The workload that now consumes roughly two-thirds of AI compute (Chapter 1.3) created a new storage tier with no precedent in the training stack: the KV-cache — the per-token attention state every autoregressive request must keep resident for as long as it is generating. Reasoning models that emit tens of thousands of decode tokens, agents that carry long histories, and RAG pipelines with multi-thousand-token system prompts have turned that state from a rounding error into the dominant consumer of the most expensive memory in the building.

This chapter treats the KV-cache as what it has become: a memory-hierarchy problem with a storage answer. We walk the hierarchy from HBM down through local DRAM, CXL-expanded DRAM, NVMe, Ethernet-attached flash, and object storage; we make the present-tense case for CXL as a real tier (and the rule for when it beats NVMe); and we cover the KV offload-and-reuse machinery — Dynamo/KVBM, NIXL, LMCache, prefix caching — plus the model-weight-serving and cold-start problem that rides the same hierarchy. Every tier you add buys capacity and reuse at the price of latency on the miss path, and the wrong placement shows up directly as a blown time-to-first-token SLO.

Why inference created a new tier

Start with the arithmetic. The KV-cache stores two vectors (a key and a value) per token, per layer, per attention head. The bytes scale linearly with context length and batch size, and they must sit in the same HBM as the weights. The numbers are large even for mid-size models: a single Llama 3 70B request at 128K context needs roughly 42 GB of KV-cache — more than half of an 80 GB H100, for one user. Llama 3.1 405B burns about 516 KB per token of context; at long context and modest batch the cache routinely exceeds the model itself. Past ~128K tokens the KV-cache dominates HBM; at 1M tokens it can eat 70–90% of GPU VRAM (DigitalApplied / Spheron KV-cache analyses, 2026).

That collides with a hard ceiling. HBM is the scarcest, most expensive memory in the system and the binding supply constraint on the whole industry (entire 2026 HBM production sold out; Chapter 16.2). You cannot buy your way out by stacking more HBM — there is none to buy, and it would cost more than the GPU. So the cache must spill, and the only question is where. Architectural compression (GQA, MLA, FP8/FP4 KV quantization) buys a 4–40x reduction and is the first line of defence — but it changes the model, not the infrastructure, and it does not eliminate the spill for long-context, high-concurrency serving. Once you accept the spill, you are no longer doing memory management; you are doing tiered storage, on the request's critical path, with a latency budget measured in milliseconds.

The hierarchy: HBM → DRAM → CXL → NVMe → Ethernet-flash → object

The new memory hierarchy is a ladder of declining cost and rising latency. Each rung holds colder, larger, cheaper KV state than the one above it. The engineering job is to keep the hottest, most-reused blocks high and let everything else fall — and to size the bandwidth between rungs so that a promotion on a hit does not blow the budget the hit was supposed to save.

NVIDIA's BlueField-4 Context Memory Storage Platform (CMX, formerly ICMSP) formalized this ladder into named tiers — G1 (GPU HBM), G2 (host CPU DRAM), G3 (node-local NVMe), a new G3.5 (an Ethernet-attached flash tier optimized specifically for KV-cache), and G4 (external durable NVMe) — and it is the clearest sign that the industry now treats KV-cache placement as a storage-architecture decision, not an inference-engine implementation detail (NVIDIA / Blocks & Files, 2026). The table below is the decision surface.

The KV-cache / inference memory hierarchy — the placement decision

Tier	Medium	Access latency	Bandwidth (per device)	Capacity / cost	Role in KV serving
G1	GPU HBM (HBM3E / HBM4)	~100 ns	~5–8 TB/s (HBM3E), ~2 TB/s/stack (HBM4)	Tiny / extreme $/GB	Live working set: active decode KV + weights
G2	Host CPU DRAM (DDR5)	~80–140 ns local	~40–50 GB/s per channel	Modest / high $/GB	First spill tier; warm reusable blocks
CXL	CXL-expanded / pooled DRAM	~250–600 ns (load/store)	~tens of GB/s per link	Large, memory-semantic	Byte-addressable capacity tier; pooled reuse
G3	Node-local NVMe (TLC/QLC)	~10–100 µs	~6–14 GB/s (PCIe 5.0/6.0)	Large / low $/GB	Cost-optimized prefix/session cache
G3.5	Ethernet-attached flash (CMX)	~100 µs class	320 GB/s read class (WEKA/STX)	Very large / shared	Networked KV tier across the fleet
G4 / cold	External NVMe / object store	ms+	Throughput-bound	Effectively unbounded / cheapest	Durable context, persisted sessions, model store

Latency and bandwidth are 2026 order-of-magnitude reference points (HBM3E/HBM4 per-stack; CXL.mem load latency; NVMe/Ethernet-flash streaming). Sources: SK hynix/Micron HBM specs, CXL Consortium, NVIDIA CMX, Samsung CMM-D white paper. Use as relative tiers, not procurement specs.

The hierarchy is a set of cliffs, not a smooth gradient. The two that matter most: the jump from DRAM/CXL (nanoseconds, byte-addressable, load/store) to NVMe (microseconds, block-addressable, DMA) is roughly a 1,000x latency step — that is the line between "memory" and "storage," and it is where the programming model changes. The jump from node-local to network-attached (G3 to G3.5/G4) trades single-node capacity for fleet-wide reuse, at the price of a fabric round-trip — which is why this tier only pays off when the same context is reused across many GPUs, not within one. Place a hot block one cliff too low and the latency of a hit approaches that of a miss; place a cold block one cliff too high and you have evicted a block someone else needed to make room for one nobody will reuse.

CXL DRAM as a present-tense tier

CXL stopped being a roadmap slide. By 2026 it is a deployable tier with two distinct uses that get conflated and should not be. Memory expansion adds byte-addressable DRAM to a single host beyond its DIMM-channel limit, over the PCIe physical layer with cache-coherent load/store semantics — the OS and the inference engine see more memory, full stop. Memory pooling disaggregates DRAM into a shared resource that multiple hosts (or GPUs, via the host) can attach, enabling KV blocks to be reused across servers without recomputation. Commercial CXL pools reached 100 TiB scale in 2025 with larger 2026 deployments, and Samsung's CMM-D-in-a-CXL-switch work is explicitly pitched at KV-cache offload (Samsung CMM-D white paper, June 2026; Introl, 2025).

The decisive property is that CXL keeps the cache memory-semantic. NVMe offload forces a block-mode DMA round-trip and a copy; CXL-expanded DRAM is a load away. Published results bear this out: CXL memory pools delivered ~3.8x speedup over 200G RDMA and >5x over SSD-based KV caching for inference, and offloading KV to CXL cut GPU memory usage by up to 87% while still meeting latency targets (ITME / CXL-KV research, arXiv 2026). The cost is real estate and complexity: CXL load latency (~250–600 ns) is several times host-local DRAM, the controllers and switches are a new BOM line, and pooling needs a memory-fabric topology that most halls were not built for (Chapter 8.5 for the fabric framing).

The rule: when CXL-expanded DRAM beats NVMe KV-offload

Both are spill tiers; they win in different regimes. Choose CXL-expanded DRAM when the working set is too big for host DRAM but the reuse is latency-sensitive — multi-turn chat, agentic loops, and RAG where the same warm KV blocks are touched again within milliseconds and a microsecond NVMe round-trip would breach TTFT. CXL keeps those blocks a load instruction away and lets you raise batch size and concurrency without recompute. Choose NVMe (G3/G4) when the state is large, cold, and reused on a human timescale or not at all — persisted sessions, long-tail prefixes, and durable context where capacity-per-dollar dominates and a ~100 µs fetch is invisible against a multi-second human pause. The pivot is reuse rate, not reuse existence: if a block is re-touched faster than NVMe latency, CXL pays; if slower, NVMe's 5–10x lower $/GB wins. Most real fleets run both — CXL as the warm capacity tier, NVMe/Ethernet-flash as the cold one — and let the KV manager decide placement per block.

KV-cache offload, reuse, and the transfer fabric

A hierarchy is inert without a manager that moves blocks across it and decides what to keep. In 2026 that machinery converged on a recognizable stack. PagedAttention (vLLM) made the KV-cache a paged, non-contiguous structure — the precondition for everything else, because you cannot offload or share a cache you cannot address in blocks. On top of it, prefix caching reuses the KV of a shared prompt prefix across requests: for RAG and agent workloads with long, repeated system prompts, the prefix's blocks stay resident instead of being recomputed per request, and vendors report serving on the order of 10x more users on the same GPU once prefix reuse and offload are combined (Spheron, 2026).

The transport layer is where the storage and networking disciplines fuse. NVIDIA Dynamo introduced the KV Block Manager (KVBM), which decouples KV memory management from the inference engine and orchestrates movement across G1–G4; it uses NIXL (the NVIDIA Inference Transfer Library) as the unified transport across NVLink, RDMA NICs, and GPUDirect Storage, and integrates LMCache for reuse and eviction. The reported wins are large: Dynamo 1.0 cited ~7x throughput on reasoning/MoE workloads, and NIXL+GDS demonstrated ~10x faster prefill for large-context scenarios by streaming KV from storage instead of recomputing it (NVIDIA GTC 2026; Spheron/LMCache, 2026). Open alternatives — LMCache standalone, Mooncake-class disaggregated KV pools, llm-d — chase the same pattern. The architectural consequence is that KV reuse pulls you toward prefill/decode disaggregation: separate the compute-bound prefill pool from the bandwidth-bound decode pool, generate KV once in prefill, and transfer it to decode over the fabric — which makes the KV-transfer path a primary fabric-sizing input, not an afterthought (Chapter 8.5; Chapter 9.3 for the GPUDirect Storage data path).

Deep dive: recompute vs offload — the decision the KV manager makes thousands of times a second

On every cache miss the serving stack faces the same micro-decision: recompute the missing KV (run prefill again, spend GPU FLOPs) or fetch it from a lower tier (spend transfer latency and bandwidth). The crossover is a function of three numbers: the prefill cost of the prompt (FLOPs, which grow with prefix length), the fetch latency of the tier holding the block, and the bandwidth available to move it. For a short prompt on a busy GPU, recompute is cheaper than a fabric round-trip — the block is small and the GPU has spare prefill capacity at low concurrency. For a long shared prefix (a 4,000-token RAG system prompt reused across thousands of requests), recompute is catastrophically wasteful and fetching the cached KV — even from G3.5 over the network — wins by a wide margin, which is exactly the regime where NIXL+GDS reported ~10x prefill speedups.

The trap is treating this as a static config. The right answer flips with concurrency: as the decode pool fills and prefill capacity becomes the bottleneck, the recompute option gets more expensive (it competes with paying users for FLOPs) and offload gets relatively cheaper, so the crossover point moves toward fetch. A KV manager that decides once at deploy time leaves goodput on the table at both ends; the ones that earn their keep (KVBM, LMCache) make the recompute-vs-fetch call per block, per tier, against live queue depth. This is the inference analogue of the checkpoint-interval optimization in Chapter 9.4 — a cost-balanced decision, made continuously instead of once.

~42 GB

KV-cache for one Llama 3 70B request at 128K context — over half an 80 GB H100, for a single user

2026DigitalApplied / Spheron KV-cache analyses

~516 KB

KV-cache per token of context, Llama 3.1 405B (327 KB for Qwen-2.5 72B)

2026mbrenndoerfer / DEV KV-cache calculators

70–90%

share of GPU VRAM the KV-cache consumes at 1M-token context

2026Spheron / DigitalApplied KV optimization guides

4–40x

long-context cost reduction from KV compression (GQA/MLA + FP8/FP4); MLA+FP8 cut 135 GB → 8 GB (~17x)

2026Spheron / DeltaKV / KV optimization research

~10x

more users served per GPU with prefix caching + KV offload combined

2026Spheron KV-cache optimization guide

>5x

CXL KV-pool speedup vs SSD-based caching; ~3.8x vs 200G RDMA; up to 87% less GPU memory used

2026ITME / CXL-KV research (arXiv); Samsung CMM-D

~7x / ~10x

Dynamo 1.0 reasoning-throughput gain; NIXL+GPUDirect Storage prefill speedup for long context

2026NVIDIA GTC 2026; LMCache / Spheron

320 GB/s

read-throughput class cited for an Ethernet-attached KV (G3.5) tier; 4–10x more context tokens/s

2026NVIDIA CMX; WEKA/STX (Blocks & Files)

Model and weight serving: the cold-start tax on the same hierarchy

The KV-cache is not the only thing riding this hierarchy. Multi-model and autoscaled inference fleets must also load weights onto GPUs that were just freed — and the same tiers govern how fast that happens. A 70B model in FP8 is ~70 GB; loading it from a slow object store across the network is a multi-minute stall during which a freshly-scaled GPU earns nothing. This is the cold-start tax, and it is the mirror image of the KV problem: where KV is about keeping per-request state warm, weight serving is about keeping per-model state reachable fast enough to scale.

The placement logic is the same ladder applied to weights. The hot path keeps the working set of frequently-served models on node-local NVMe (G3) or a fast network-flash tier so a scale-up event is a fast block read, not a cold object pull; the cold path keeps the long tail of rarely-served models and durable artifacts in object storage (Chapter 9.6, which treats object as the inference cold tier and model-distribution backbone). The consequence is concrete: under-provision the weight-serving tier and your autoscaler's response time is gated by storage, not by GPU availability — you scale slower than your traffic spikes, breach SLOs during the ramp, and over-provision idle GPUs to compensate. The KV hierarchy and the weight hierarchy are the same physical media, contending for the same bandwidth, and they must be sized together.

Multi-tier KV management and the placement policy

With the tiers, the transport, and the workloads in place, the remaining decision is policy: what gets promoted, what gets evicted, and where each block lands. This is where most of the realized goodput is won or lost, because the hardware ladder only sets the ceiling — the policy determines how close you get to it.

Three policy axes matter. Eviction: LRU is the default, but reuse-aware policies that weight by prefix-sharing frequency keep high-fanout system prompts resident far longer than a naive recency scheme would, which is the whole point of prefix caching. Placement: the manager must route each block to the tier whose latency matches its expected reuse interval — the CXL-vs-NVMe rule above, applied per block rather than per fleet. Coherence and sharing: a pooled or networked tier (CXL pool, G3.5 Ethernet-flash) lets many GPUs reuse one copy of a context, which is enormously efficient for shared prefixes but introduces a consistency and lifetime-management problem that node-local caches never had. Get the policy wrong and the symptoms are unambiguous: thrashing between tiers (blocks promoted and evicted before reuse), cache pollution (cold blocks crowding out warm ones), or a TTFT tail that tracks the slowest tier instead of the fastest hit. The fleet-level sizing of these tiers — how much DRAM, CXL, NVMe, and network-flash per GPU, and the bandwidth between them — is the storage:compute co-design problem taken up in Chapter 9.8.

The cross-session KV-cache is a confidentiality boundary, not just a performance tier

Reuse is the goal — but a KV-cache shared across requests, sessions, and tenants is, by construction, retained inference state that may carry user content. A pooled or networked KV tier that serves a prefix block from one tenant's prompt to another's request is a data-leak, and a persisted session cache on durable NVMe is now an at-rest copy of user input subject to whatever retention and privacy regime governs it. Enabling cross-tenant prefix sharing or session persistence is a security decision, not just a performance one: it must be scoped to a trust boundary, and in multi-tenant serving it interacts directly with GPU isolation (Chapter 10.3) and confidential-computing guarantees (Chapter 11.5). Do not let an eviction-policy tuning exercise quietly turn into a cross-tenant data path.

Deep dive: why disaggregated inference makes the KV transfer a fabric-design problem

Prefill and decode have opposite hardware appetites. Prefill is compute-bound: it processes the whole prompt in parallel and saturates GPU FLOPs. Decode is memory-bandwidth-bound: it generates one token at a time and is starved by HBM bandwidth, not FLOPs. Running both on the same GPU means each phase under-utilizes the resource the other needs. Disaggregated inference splits them into separate pools — a prefill pool that builds the KV-cache and a decode pool that consumes it — so each can be sized and scaled independently. NVIDIA's GB200 NVL72 + Dynamo work is built around exactly this split, transferring KV over NVLink between the pools.

The consequence lands in the fabric. Disaggregation means the KV-cache produced in prefill must move to wherever decode runs, on every request, on the critical path. The transfer is large (gigabytes for long context) and latency-sensitive (it sits in front of the first decode token), so the link between prefill and decode pools becomes a primary fabric-sizing constraint — it must carry KV at HBM-adjacent rates or the disaggregation that was supposed to raise utilization instead injects TTFT latency. This is why NIXL spans NVLink, RDMA, and GPUDirect Storage with one API: the KV transfer may ride any of them depending on where the tiers sit, and the fabric must be co-designed to carry it. Topology and oversubscription for this traffic are engineered in Chapter 8.5; the CPU-bypass data path that makes storage-to-GPU KV streaming viable is Chapter 9.3.

Anti-patterns

The recurring mistakes all come from importing a training-era storage mental model into an inference fleet, or from treating the KV-cache as an implementation detail instead of an architecture:

Recompute-everything by default. Disabling or under-sizing KV offload because "storage is slow" — then paying full prefill FLOPs on every shared-prefix request. For RAG and agent workloads this is the single largest source of wasted inference compute, and it scales with how good your traffic is (more repeated prompts = more waste).
One tier too low for the reuse rate. Parking warm, millisecond-reused KV on NVMe to save DRAM, and discovering that the ~100 µs fetch blew the TTFT SLO. The fix is the CXL-vs-NVMe rule — match tier latency to reuse interval — not buying more GPUs.
Sizing the weight-serving tier as an afterthought. Provisioning HBM and fabric for steady state, then watching the autoscaler stall on cold model pulls from object storage during every traffic spike. Cold-start is a storage-bandwidth problem; size G3 for it or scale slower than your traffic.
Shared KV across an untrusted boundary. Turning on cross-request prefix sharing or session persistence for the throughput win without scoping it to a trust boundary — converting a performance tier into a multi-tenant data path. → Chapter 11.5.

This chapter is the inference half of the storage story; the training half — ingestion and checkpointing — opens Part 9 in Chapter 9.1 (the four I/O personalities, of which the KV-cache is one) and Chapter 9.4 (checkpointing). The CPU-bypass data path that makes storage-to-GPU KV streaming viable — GPUDirect Storage, NVMe-oF, BlueField DPU offload — is engineered in Chapter 9.3; object storage as the inference cold tier and model-distribution backbone is Chapter 9.6; fleet-level storage:compute sizing and the network co-design that isolates KV-transfer from collectives is Chapter 9.8. The inference workload that created this tier is Chapter 1.3; the fabric that carries disaggregated KV transfer is Chapter 8.5; multi-tenant isolation and the confidentiality boundary on a shared KV-cache are Chapter 10.3 and Chapter 11.5; and the deeper-CXL-tiering trajectory is in the consolidated roadmap, Chapter 16.2.