Chapter 1.3
Inference Data Centers: Bursty, Distributed, Always-On
An inference data center is not a smaller training cluster — it is a different machine optimized for a different objective: many independent requests served against a latency SLO, always-on, close to users, with goodput-per-dollar and tokens-per-watt as the scoreboard rather than the speed of a single synchronous job.
What you'll decide here
- Which inference sub-mode dominates your facility — interactive/online, batch/offline, or agentic/long-context — because each sets a different SLO, a different prefill:decode mix, and a different fleet-sizing math.
- Whether to disaggregate prefill from decode (and pay the KV-cache transfer tax) or co-locate them — the single architectural fork that most shapes accelerator mix, fabric, and goodput in a 2026 serving stack.
- How far to oversubscribe power and fabric: inference's uncorrelated per-request peaks open ~21% power headroom and 2:1-3:1 fabric oversubscription that training cannot touch — money you either capture or strand.
- Where to site the fleet against a latency budget, and how many regions: proximity-to-users, not cheap stranded power, is the inference siting driver, and it trades directly against energy cost.
- GPU vs inference-ASIC selection scored on tokens-per-watt and tokens-per-dollar at your real batch sizes and context lengths — not peak FLOPS — because the deflation of token prices punishes a wrong silicon bet fast.
For most operators, inference is the business. Training builds the asset; inference is what earns against it, request by request, token by token, every hour of every day. By 2026 the compute has followed the money: inference is roughly two-thirds of all AI compute — up from about half in 2025 and a third in 2023 — and at large operators it is 80-90% of the served draw (Deloitte TMT Predictions 2026). Yet the reflex in the industry is still to design the inference fleet as if it were a training cluster with the dial turned down. That reflex is expensive. An inference data center optimizes a fundamentally different objective function, and nearly every subsystem decision flips sign because of it.
This chapter is the inference half of the master fork introduced in Chapter 1.1: training-shaped facilities optimize one tightly-coupled job; inference-shaped facilities optimize many independent requests against a latency SLO. We work the consequences of that one difference all the way down — the inference taxonomy and the economics shift, the reasoning-driven demand multiplier that is reshaping the decode-heavy future, latency-driven siting, the prefill/decode disaggregation that has become the defining 2026 serving pattern, the reliability and uptime posture, and the GPU-vs-ASIC selection that is finally a real fork. The serving engineering — batching, scheduling, the disaggregation tax in detail — is owned by Chapter 10.11; here we cover what those choices mean for the building.
The inference taxonomy: three sub-modes, three design bases
"Inference" is not one workload. It is at least three, and they pull the facility in different directions. The mistake that recurs is treating a fleet as homogeneous when its dominant sub-mode silently dictates the SLO, the prefill:decode ratio, the memory hierarchy, and the fleet size.
Interactive / online inference is a human (or an interactive agent) waiting on the output in real time. It is governed by two latency metrics — time-to-first-token (TTFT), set by the prefill of the prompt, and time-per-output-token (TPOT), set by decode — and it is bursty and always-on: traffic can swing from 30% to 90% of capacity in minutes on a diurnal-plus-spike pattern. The design basis is high availability, proximity to users, and enough headroom to absorb the peak without breaching the SLO.
Batch / offline inference — embeddings generation, document and corpus processing, synthetic-data creation, evaluation sweeps, nightly re-scoring — has no user waiting. It is throughput-bound, not latency-bound, so it tolerates queuing, interruption, and aggressive oversubscription, and it is the natural consumer of spot capacity, off-peak power, and curtailable interconnections. It is the cheapest inference to host and the most flexible to schedule, which makes it the load you shift to soak up the headroom the interactive fleet leaves on the table.
Agentic / long-context inference is the fast-growing third mode and the one that breaks naive sizing. An agent issues many model calls per user action, carries long and growing context (tool outputs, retrieved documents, prior turns reaching toward 1M+ tokens), and interleaves reasoning with tool use. It inflates the prefill share (huge prompts), explodes the KV-cache footprint (long context held live across many concurrent sessions), and shifts the GPU:CPU ratio toward more host work for orchestration and tool calls. A fleet sized on single-turn chat assumptions is undersized for agents on both memory and prefill compute.
The economics shift: inference is the revenue workload
The structural fact of 2026 is that inference is now the larger and faster-growing share of AI infrastructure, and it grows differently from training. McKinsey's base case has inference capacity rising from ~20.9 GW to ~93.3 GW by 2030 — a ~35% CAGR — against training's ~23.1 GW to ~62.2 GW at ~22%. The crossover has happened: inference is both the bigger pool and the steeper curve. The consequence for siting and procurement is direct. Training capacity concentrates in a handful of gigawatt campuses chasing cheap firm power; inference capacity distributes toward demand, into more, smaller, latency-sited halls. As the trade press puts it, training built the campuses; inference will choose the markets.
The other structural fact is deflation. The market price of a million tokens has fallen on the order of ~10x per year for a given capability tier (LLMflation; Introl/a16z synthesis) — a Jevons-paradox dynamic where unit cost collapses while aggregate spend rises because demand more than compensates. For an inference operator this is the dominant business risk: a fleet scoped to today's $/Mtoken can be underwater in a year if its tokens-per-dollar does not improve on the same curve. That is why the inference design basis fixates on efficiency per token — tokens-per-watt and tokens-per-dollar at real batch sizes — rather than peak throughput. The full unit-economics build-up and the deflation risk are scored in Chapter 1.8.
| Sub-mode | What waits | Prefill:decode tilt | KV-cache pressure | Fabric / power oversubscription | Redundancy | Siting driver |
|---|---|---|---|---|---|---|
| Interactive / online | A user, on TTFT + TPOT (sub-second to seconds) | Balanced, decode-heavy for reasoning | High — many live sessions held concurrently | Fabric 2:1-3:1; power up to ~21% headroom | 2N / Tier-IV-class + N+1 cooling; geo-redundant | Sub-50 ms proximity to users; latency-first |
| Batch / offline | Nothing — throughput-bound | Prefill-heavy (large corpora, short outputs) | Low-moderate; reuse/prefix caching helps | Heavily oversubscribed; cost-optimized | N — queue-and-retry, interruption-tolerant | Cheapest curtailable power; off-peak |
| Agentic / long-context | A user or pipeline, across many chained calls | Prefill-heavy + long decode; tool-call CPU work | Very high — 1M+ token context held live | Fabric moderate; power lumpy, harder to cap | High availability; session-state durability matters | Proximity + KV-storage tier near compute |
The table is a cascade, like the one in Chapter 1.1: the left two columns are what you choose; everything to the right is consequence. Choose an interactive-dominant fleet and you have implicitly chosen geo-distribution, 2N-class uptime, a high KV-cache budget, and a willingness to oversubscribe power but not availability. Choose batch-dominant and you have chosen the opposite building — consolidated, curtailable, interruption-tolerant. The agentic column is the one most operators under-provision today, because it looks like interactive chat until the context lengths and call counts reveal a memory-and-prefill problem the single-turn sizing never anticipated.
Reasoning and test-time compute: the demand multiplier
The largest force reshaping inference demand since 2025 is the rise of reasoning models that spend compute at inference time to improve answers — the post-o1, post-R1 paradigm. A reasoning model emits a long internal chain of thought before its visible answer, turning a request that once decoded a few hundred tokens into one that may decode tens of thousands. This is test-time compute, and it has three structural consequences for the fleet that compound:
- The prefill:decode ratio shifts toward decode. Long chains of thought are autoregressive generation, which is memory-bandwidth-bound, not compute-bound. Decode now dominates the token budget for reasoning traffic, which changes which silicon and which memory hierarchy you want — and tilts the disaggregation math below.
- KV-cache pressure inflates. Every token of context and every token generated must keep its keys and values resident for attention. Long-decode reasoning, multiplied across many concurrent sessions, makes the KV-cache a first-class capacity constraint — often the binding one — not an afterthought. This is the demand driver behind the new inference-memory tier in Chapter 9.7.
- Effective demand per request multiplies. If the average request decodes 10-50x more tokens, a fleet sized on the old token-per-request assumption is undersized by roughly the same factor at constant request volume. Reasoning is, in effect, a demand multiplier hiding inside a flat user count.
The strategic read: treat the decode-heavy future as the base case, not a tail risk. It argues for memory-bandwidth-rich silicon, a deep KV-cache hierarchy, and fleet headroom for a per-request token budget that keeps climbing. The serving-engineering levers that absorb this — continuous batching, chunked prefill, speculative decoding — live in Chapter 10.11.
Latency-driven siting and regional distribution
Training siting chases the cheapest firm megawatt and the coldest free-cooling climate, indifferent to where users are. Inference inverts the hierarchy: the binding constraint is the latency budget back to the user, and that turns siting into a coverage problem. Interactive inference lives or dies on the 30/50/100 ms perceptibility thresholds (treated in full for the edge in Chapter 1.5); a centralized fleet two time-zones from its users cannot meet a sub-50 ms TTFT no matter how fast the silicon, because the speed of light in fiber is not negotiable — roughly 0.5 ms per 100 km one-way (about 1 ms round-trip), before any switching or queuing.
The consequence is a different geographic footprint. Where training consolidates into a few gigawatt campuses, inference distributes into more, smaller, latency-sited regions — typically Tier-1 and Tier-2 metros within fiber reach of population centers. Every kilometer closer to users you buy proximity and pay for it in energy cost (metro power runs 2-4x stranded-rural power) and in the operational overhead of running many sites instead of one. Site too centrally and you breach the SLO in distant regions; site too widely and you fragment your fleet below the scale where batching efficiency and utilization hold up. The fiber-and-latency screen that governs this is Chapter 3.6; the market-cluster playbook is Chapter 3.13; the reordered siting hierarchy is Chapter 3.1.
Disaggregated inference: prefill vs decode, and KV-cache as a resource
The defining serving architecture of 2026 is the recognition that the two phases of a request want different hardware. Prefill — processing the entire prompt to produce the first token and the initial KV-cache — is compute-bound: it saturates the accelerator's matrix engines and scales with prompt length. Decode — generating each subsequent token autoregressively — is memory-bandwidth-bound: each step reads the whole model and the growing KV-cache to emit one token, leaving the matrix engines mostly idle. Run both on the same GPU and they interfere: a long prefill stalls the decode stream of every other request sharing the device, and you cannot tune the hardware for either because it is doing both.
Disaggregated serving splits them onto separate pools — a prefill pool sized for compute, a decode pool sized for memory bandwidth and capacity — connected by a fast fabric that ships the KV-cache from prefill to decode. By early 2026 this is supported across every major open-source engine (vLLM, SGLang, TensorRT-LLM) and orchestrated by frameworks like NVIDIA Dynamo and llm-d, with NIXL the standard KV-transfer transport over RDMA/NVLink. The payoff is higher goodput per dollar and the ability to mix silicon — expensive compute-dense parts for prefill, cheaper memory-rich parts for decode. The cost is the disaggregation tax: every request now pays a KV-cache transfer across the fabric, which only nets out when the pools are large enough and the fabric fast enough that the transfer is cheaper than the interference it eliminates.
The facility consequence is that the KV-cache becomes a first-class, tiered resource — held in HBM where it is hottest, spilled to local NVMe and Ethernet-attached flash as it cools, reused across requests that share a prefix. That memory hierarchy is its own subsystem now, engineered in Chapter 9.7. For the building, disaggregation means the inference hall is no longer a uniform sea of identical nodes; it is a heterogeneous fleet whose pool ratios you tune to your prefill:decode mix — which, per the reasoning shift above, is moving.
Deep dive: when disaggregation pays — and when co-location wins
Disaggregation is not free goodput; it is a trade you only win at scale. The KV-cache that prefill produces can be large — gigabytes for a long-context request — and shipping it to a decode worker consumes fabric bandwidth and adds latency to TTFT. The trade pays when three conditions hold together: (1) request volume is high enough that prefill and decode pools each stay busy independently (a small fleet cannot keep two specialized pools utilized); (2) the prefill:decode mix is imbalanced enough that co-location wastes one resource (uniform short chat is the weakest case; long-prompt or long-decode reasoning is the strongest); and (3) the interconnect is fast enough — NVLink within a rack, high-bandwidth RDMA across racks — that the transfer is cheaper than the head-of-line blocking it removes.
The consequence of getting it wrong cuts both ways. Disaggregate too small and you strand capacity in under-utilized pools and pay the transfer tax for no benefit — co-location would have been simpler and cheaper. Co-locate at scale and you cap your goodput on phase interference and forfeit the ability to run cheaper memory-rich silicon for decode. The 2026 default for large interactive fleets is to disaggregate; for small or batch-dominant fleets, co-location with continuous batching is often the better-scoped choice. The engine-level mechanics — chunked prefill as a middle path, KV-aware routing, the transfer-tax accounting — are in Chapter 10.11.
Oversubscription: the headroom training cannot touch
Two of inference's defining properties — loose coupling and uncorrelated per-request load — open headroom that a synchronous training job forbids, and capturing it is one of the clearest efficiency levers in the fleet.
Fabric. Most inference requests fit inside a single node or a small scale-up domain, so the back-end fabric does not carry the all-reduce traffic that forces training to 1:1 non-blocking. Inference runs comfortably at 2:1-3:1 oversubscription, and a 2:1 "optimized" fabric cuts back-end network cost by roughly 31% (SemiAnalysis; contested, single-source). Building a non-blocking fabric for an inference business spends that 31% on bisection bandwidth that never carries traffic — the exact anti-pattern named in Chapter 1.1. The money saved is better spent on geo-distribution and uptime. Topology and oversubscription are engineered in Chapter 8.5.
Power. Training's synchronous steps make thousands of GPUs draw their peak in lockstep, leaving only ~3% power-oversubscription headroom — the facility must provision near the synchronized peak. Inference's per-request peaks are uncorrelated, so the aggregate load is far smoother, opening ~21% headroom: you can provision more served capacity behind the same megawatts and reclaim power from low-priority work with simple priority-based capping (Uptime Institute Journal; arXiv power-profile studies). The consequence is real density-per-megawatt that training cannot match — but it must be engineered with capping and ride-through, because reasoning and agentic traffic make the per-request peaks lumpier than single-turn chat. The grid-facing transient behavior this implies is in Chapter 12.2's reliability frame.
Reliability and uptime: the posture flips
This is where inference and training are most starkly opposed, and where copying the training posture costs the most. A synchronous training job already tolerates failure by checkpoint-and-resume — a node dies, the job restarts from the last checkpoint, and 2N facility power buys nines the workload does not value (the over-provisioned-redundancy anti-pattern of Chapter 1.1). An always-on inference business is the inverse: every minute of outage is lost revenue and a breached SLA, and there is no checkpoint to resume from — the request is simply dropped and the user sees an error.
So the posture flips. Inference justifies 2N / Tier-IV-class facility power with N+1 cooling on standby — the ~26 min/yr unavailability of a Tier-IV target (99.995%) is a business requirement, not gold-plating. But the deeper move, and the one that distinguishes a well-scoped inference fleet, is to buy resilience at the fleet level rather than only the facility level: geo-redundancy. Because inference is stateless per-request and distributed for latency anyway, a failed site can fail over to another region, so the marginal availability of any single hall matters less than the redundancy of the whole footprint. The right design often spends on a second region before it spends on the last increment of single-site uptime — N+1 across sites can beat 2N within one. The reliability rethink that reframes facility availability against fleet goodput is Chapter 12.2.
GPU vs inference ASIC: tokens-per-watt, tokens-per-dollar
Training has effectively one merchant answer — the latest flagship GPU — because the workload rewards raw FLOPS, the largest scale-up domain, and the most mature collective-comms software. Inference is the first archetype where the silicon choice is a genuine, contestable fork, because the scoring metric changes: not peak FLOPS but tokens-per-watt and tokens-per-dollar at your real batch sizes, context lengths, and SLOs. On that scoreboard, purpose-built inference ASICs become competitive, and several are now in volume.
Google's Ironwood (TPU v7) is explicitly "the first TPU for the age of inference": 192 GiB of HBM3E at 7.4 TB/s per chip, 4,614 FP8 TFLOPS, scaling to 9,216-chip pods delivering 42.5 FP8 ExaFLOPS — a memory-bandwidth-rich design aimed squarely at the decode-heavy future. AWS's Inferentia/Trainium line and the hyperscaler XPUs (Maia, MTIA) target the same tokens-per-dollar objective for captive fleets. These are surveyed in Chapter 7.4; the selection-and-TCO methodology is Chapter 7.11.
An ASIC can win decisively on tokens-per-watt for a stable, high-volume model family — but it carries software-ecosystem and lock-in risk (the CUDA moat is real, and a custom part with a thin software stack can leave throughput on the table), and it is a bet that the model architecture you optimize for stays put. Against an inference market deflating ~10x/year and reasoning reshaping the prefill:decode mix, a part optimized for last year's workload can be the wrong part this year. The rational posture for most operators is to score the fork at their own batch sizes and context lengths — peak-FLOPS datasheets are nearly useless here — and to keep the silicon decision as reversible as the procurement decision, because both move faster than a facility's depreciation clock.
Deep dive: why peak FLOPS is the wrong number for inference silicon
A training buyer can almost get away with comparing peak FLOPS, because training keeps the matrix engines busy. An inference buyer cannot, because the workload's two phases stress different parts of the chip and the datasheet headline reflects neither at the operating point that matters. Decode — the dominant phase for reasoning and chat — is memory-bandwidth-bound: it emits one token per pass over the model weights plus the KV-cache, so it is gated by HBM bandwidth and capacity, and a chip with twice the FLOPS but the same bandwidth decodes no faster. This is precisely why an inference-era part like Ironwood leads on HBM (192 GiB at 7.4 TB/s) rather than on FLOPS, and why memory, not compute, is the figure of merit for the decode pool.
Prefill is compute-bound and does reward FLOPS — but only at the prompt lengths and batch sizes you actually run, and the achievable fraction of peak (the realized MFU) varies enormously with the software stack and the kernel quality for your model shape. The consequence: the only defensible silicon comparison is a measured tokens-per-watt and tokens-per-dollar at your SLO, prefill:decode mix, and context distribution. Two parts with identical datasheets can differ by 2x in realized goodput once batching, quantization (FP8/FP4), and KV-cache behavior are accounted for. The benchmarking discipline and the cross-vendor MFU gaps are in Chapter 7.11; precision and quantization in Chapter 7.4.
Anti-patterns
The recurring inference mis-scopes all trace to one root cause: designing the inference fleet as a de-rated training cluster instead of from its own objective function. Four are worth naming:
- Training fabric for an inference business. A 1:1 non-blocking back-end for a workload whose requests fit inside a node wastes ~31% of back-end cost on bisection bandwidth that never carries traffic. Oversubscribe 2:1-3:1 and spend the savings on geo-distribution and uptime.
- Sizing the whole fleet to single-turn chat. Reasoning and agentic traffic multiply the per-request token budget and explode KV-cache pressure; a fleet sized on old token-per-request assumptions is undersized on memory and decode by the same factor. Size to the decode-heavy base case.
- Centralizing inference for power economics. Consolidating an interactive fleet onto one cheap-power gigawatt campus breaches the latency SLO for distant users — the speed of light does not negotiate. Distribute toward demand and pay the energy premium knowingly.
- Buying peak FLOPS instead of tokens-per-watt. Selecting inference silicon on datasheet FLOPS ignores that decode is memory-bound; the chip that wins on paper can lose 2x on realized goodput at your batch sizes and context lengths.