Guide › Strategy, Workload Archetypes & Economics › 1.3

Chapter 1.3

Inference Data Centers: Bursty, Distributed, Always-On

An inference data center is not a smaller training cluster — it is a different machine optimized for a different objective: many independent requests served against a latency SLO, always-on, close to users, with goodput-per-dollar and tokens-per-watt as the scoreboard rather than the speed of a single synchronous job.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Which inference sub-mode dominates your facility — interactive/online, batch/offline, or agentic/long-context — because each sets a different SLO, a different prefill:decode mix, and a different fleet-sizing math.
Whether to disaggregate prefill from decode (and pay the KV-cache transfer tax) or co-locate them — the single architectural fork that most shapes accelerator mix, fabric, and goodput in a 2026 serving stack.
How far to oversubscribe power and fabric: inference's uncorrelated per-request peaks open ~21% power headroom and 2:1-3:1 fabric oversubscription that training cannot touch — money you either capture or strand.
Where to site the fleet against a latency budget, and how many regions: proximity-to-users, not cheap stranded power, is the inference siting driver, and it trades directly against energy cost.
GPU vs inference-ASIC selection scored on tokens-per-watt and tokens-per-dollar at your real batch sizes and context lengths — not peak FLOPS — because the deflation of token prices punishes a wrong silicon bet fast.

For most operators, inference is the business. Training builds the asset; inference is what earns against it, request by request, token by token, every hour of every day. By 2026 the compute has followed the money: inference is roughly two-thirds of all AI compute — up from about half in 2025 and a third in 2023 — and at large operators it is 80-90% of the served draw (Deloitte TMT Predictions 2026). Yet the reflex in the industry is still to design the inference fleet as if it were a training cluster with the dial turned down. That reflex is expensive. An inference data center optimizes a fundamentally different objective function, and nearly every subsystem decision flips sign because of it.

This chapter is the inference half of the master fork introduced in Chapter 1.1: training-shaped facilities optimize one tightly-coupled job; inference-shaped facilities optimize many independent requests against a latency SLO. We work the consequences of that one difference all the way down — the inference taxonomy and the economics shift, the reasoning-driven demand multiplier that is reshaping the decode-heavy future, latency-driven siting, the prefill/decode disaggregation that has become the defining 2026 serving pattern, the reliability and uptime posture, and the GPU-vs-ASIC selection that is finally a real fork. The serving engineering — batching, scheduling, the disaggregation tax in detail — is owned by Chapter 10.11; here we cover what those choices mean for the building.

The inference taxonomy: three sub-modes, three design bases

"Inference" is not one workload. It is at least three, and they pull the facility in different directions. The mistake that recurs is treating a fleet as homogeneous when its dominant sub-mode silently dictates the SLO, the prefill:decode ratio, the memory hierarchy, and the fleet size.

Interactive / online inference is a human (or an interactive agent) waiting on the output in real time. It is governed by two latency metrics — time-to-first-token (TTFT), set by the prefill of the prompt, and time-per-output-token (TPOT), set by decode — and it is bursty and always-on: traffic can swing from 30% to 90% of capacity in minutes on a diurnal-plus-spike pattern. The design basis is high availability, proximity to users, and enough headroom to absorb the peak without breaching the SLO.

Batch / offline inference — embeddings generation, document and corpus processing, synthetic-data creation, evaluation sweeps, nightly re-scoring — has no user waiting. It is throughput-bound, not latency-bound, so it tolerates queuing, interruption, and aggressive oversubscription, and it is the natural consumer of spot capacity, off-peak power, and curtailable interconnections. It is the cheapest inference to host and the most flexible to schedule, which makes it the load you shift to soak up the headroom the interactive fleet leaves on the table.

Agentic / long-context inference is the fast-growing third mode and the one that breaks naive sizing. An agent issues many model calls per user action, carries long and growing context (tool outputs, retrieved documents, prior turns reaching toward 1M+ tokens), and interleaves reasoning with tool use. It inflates the prefill share (huge prompts), explodes the KV-cache footprint (long context held live across many concurrent sessions), and shifts the GPU:CPU ratio toward more host work for orchestration and tool calls. A fleet sized on single-turn chat assumptions is undersized for agents on both memory and prefill compute.

The inference fork: what is actually waiting on the output?

Before you size anything, answer one question for each sub-mode in your mix: is a user waiting, and on what budget? If yes and the budget is sub-second TTFT, you have bought a latency-first building — geo-distributed, high-availability, headroom-rich, and willing to pay 2-4x more for power to sit near users. If no one is waiting, you have bought a throughput-first building — consolidate it on the cheapest curtailable power you can find and oversubscribe everything. Most fleets are a weighted blend, but the dominant sub-mode sets the design basis, exactly as the dominant archetype does in Chapter 1.1. Sizing the whole fleet to the strictest SLO when only 20% of traffic needs it is the inference equivalent of building a training cluster's fabric for a chat workload — over-provisioned nines the revenue does not value.

The economics shift: inference is the revenue workload

The structural fact of 2026 is that inference is now the larger and faster-growing share of AI infrastructure, and it grows differently from training. McKinsey's base case has inference capacity rising from ~20.9 GW to ~93.3 GW by 2030 — a ~35% CAGR — against training's ~23.1 GW to ~62.2 GW at ~22%. The crossover has happened: inference is both the bigger pool and the steeper curve. The consequence for siting and procurement is direct. Training capacity concentrates in a handful of gigawatt campuses chasing cheap firm power; inference capacity distributes toward demand, into more, smaller, latency-sited halls. As the trade press puts it, training built the campuses; inference will choose the markets.

The other structural fact is deflation. The market price of a million tokens has fallen on the order of ~10x per year for a given capability tier (LLMflation; Introl/a16z synthesis) — a Jevons-paradox dynamic where unit cost collapses while aggregate spend rises because demand more than compensates. For an inference operator this is the dominant business risk: a fleet scoped to today's $/Mtoken can be underwater in a year if its tokens-per-dollar does not improve on the same curve. That is why the inference design basis fixates on efficiency per token — tokens-per-watt and tokens-per-dollar at real batch sizes — rather than peak throughput. The full unit-economics build-up and the deflation risk are scored in Chapter 1.8.

Inference sub-mode → requirements cascade

Sub-mode	What waits	Prefill:decode tilt	KV-cache pressure	Fabric / power oversubscription	Redundancy	Siting driver
Interactive / online	A user, on TTFT + TPOT (sub-second to seconds)	Balanced, decode-heavy for reasoning	High — many live sessions held concurrently	Fabric 2:1-3:1; power up to ~21% headroom	2N / Tier-IV-class + N+1 cooling; geo-redundant	Sub-50 ms proximity to users; latency-first
Batch / offline	Nothing — throughput-bound	Prefill-heavy (large corpora, short outputs)	Low-moderate; reuse/prefix caching helps	Heavily oversubscribed; cost-optimized	N — queue-and-retry, interruption-tolerant	Cheapest curtailable power; off-peak
Agentic / long-context	A user or pipeline, across many chained calls	Prefill-heavy + long decode; tool-call CPU work	Very high — 1M+ token context held live	Fabric moderate; power lumpy, harder to cap	High availability; session-state durability matters	Proximity + KV-storage tier near compute

How the dominant inference sub-mode propagates into the rest of the fleet. Latency thresholds and oversubscription figures are 2026-current; see keynumbers for sources and vintages.

The table is a cascade, like the one in Chapter 1.1: the left two columns are what you choose; everything to the right is consequence. Choose an interactive-dominant fleet and you have implicitly chosen geo-distribution, 2N-class uptime, a high KV-cache budget, and a willingness to oversubscribe power but not availability. Choose batch-dominant and you have chosen the opposite building — consolidated, curtailable, interruption-tolerant. The agentic column is the one most operators under-provision today, because it looks like interactive chat until the context lengths and call counts reveal a memory-and-prefill problem the single-turn sizing never anticipated.

Reasoning and test-time compute: the demand multiplier

The largest force reshaping inference demand since 2025 is the rise of reasoning models that spend compute at inference time to improve answers — the post-o1, post-R1 paradigm. A reasoning model emits a long internal chain of thought before its visible answer, turning a request that once decoded a few hundred tokens into one that may decode tens of thousands. This is test-time compute, and it has three structural consequences for the fleet that compound:

The prefill:decode ratio shifts toward decode. Long chains of thought are autoregressive generation, which is memory-bandwidth-bound, not compute-bound. Decode now dominates the token budget for reasoning traffic, which changes which silicon and which memory hierarchy you want — and tilts the disaggregation math below.
KV-cache pressure inflates. Every token of context and every token generated must keep its keys and values resident for attention. Long-decode reasoning, multiplied across many concurrent sessions, makes the KV-cache a first-class capacity constraint — often the binding one — not an afterthought. This is the demand driver behind the new inference-memory tier in Chapter 9.7.
Effective demand per request multiplies. If the average request decodes 10-50x more tokens, a fleet sized on the old token-per-request assumption is undersized by roughly the same factor at constant request volume. Reasoning is, in effect, a demand multiplier hiding inside a flat user count.

The strategic read: treat the decode-heavy future as the base case, not a tail risk. It argues for memory-bandwidth-rich silicon, a deep KV-cache hierarchy, and fleet headroom for a per-request token budget that keeps climbing. The serving-engineering levers that absorb this — continuous batching, chunked prefill, speculative decoding — live in Chapter 10.11.

~2/3

inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80-90% of draw at large operators

the money is in serving — plan capacity around the workload that bills customers

2026Deloitte TMT Predictions 2026

20.9 → 93.3 GW

AI inference capacity to 2030 (~35% CAGR) vs training 23.1 → 62.2 GW (~22%)

inference is the faster-growing market — bet your build on serving capacity, not training

2026McKinsey, 'The next big shifts in AI workloads'

>$50B

market for inference-optimized chips in 2026; most inference stays in data centers, not at the edge

the chip market you're buying into — and a sign the edge won't absorb serving load

2026Deloitte TMT Predictions 2026

~21% vs ~3%

power-oversubscription headroom: inference (uncorrelated per-request peaks) vs training (synchronous peaks)

inference safely sells more compute per megawatt — capacity training can't unlock

2026Uptime Institute Journal; arXiv power-profile studies

2:1-3:1

inference fabric oversubscription (vs 1:1 non-blocking for training); 2:1 cuts back-end cost ~31% (contested — single-source)

right-sizing the inference fabric strips a third off network capex you'd overbuild

2025SemiAnalysis AI Neocloud Playbook; Juniper

192 GiB / 7.4 TB/s

HBM3E per Ironwood TPU v7 (inference-era ASIC); 9,216-chip pods, 42.5 FP8 ExaFLOPS, 4,614 FP8 TFLOPS/chip

purpose-built inference ASICs are a real alternative to GPUs — and a vendor-lock-in fork

2025Google Cloud; SemiAnalysis

~$1.90 → ~$2.50/M tok

self-hosted vs market-avg inference cost per million tokens; ~10x/yr token-price deflation (LLMflation)

your price falls ~10x a year — today's margin evaporates unless cost falls faster

2025Introl / NVIDIA synthesis; a16z

Tier IV ~26 min/yr

inference uptime target (99.995%) vs training's checkpoint-tolerant N/N+1 posture

serving a paying SLA demands costlier redundancy than training ever needed

2025Uptime Institute (Tier classes)

Latency-driven siting and regional distribution

Training siting chases the cheapest firm megawatt and the coldest free-cooling climate, indifferent to where users are. Inference inverts the hierarchy: the binding constraint is the latency budget back to the user, and that turns siting into a coverage problem. Interactive inference lives or dies on the 30/50/100 ms perceptibility thresholds (treated in full for the edge in Chapter 1.5); a centralized fleet two time-zones from its users cannot meet a sub-50 ms TTFT no matter how fast the silicon, because the speed of light in fiber is not negotiable — roughly 0.5 ms per 100 km one-way (about 1 ms round-trip), before any switching or queuing.

The consequence is a different geographic footprint. Where training consolidates into a few gigawatt campuses, inference distributes into more, smaller, latency-sited regions — typically Tier-1 and Tier-2 metros within fiber reach of population centers. Every kilometer closer to users you buy proximity and pay for it in energy cost (metro power runs 2-4x stranded-rural power) and in the operational overhead of running many sites instead of one. Site too centrally and you breach the SLO in distant regions; site too widely and you fragment your fleet below the scale where batching efficiency and utilization hold up. The fiber-and-latency screen that governs this is Chapter 3.6; the market-cluster playbook is Chapter 3.13; the reordered siting hierarchy is Chapter 3.1.

Disaggregated inference: prefill vs decode, and KV-cache as a resource

The defining serving architecture of 2026 is the recognition that the two phases of a request want different hardware. Prefill — processing the entire prompt to produce the first token and the initial KV-cache — is compute-bound: it saturates the accelerator's matrix engines and scales with prompt length. Decode — generating each subsequent token autoregressively — is memory-bandwidth-bound: each step reads the whole model and the growing KV-cache to emit one token, leaving the matrix engines mostly idle. Run both on the same GPU and they interfere: a long prefill stalls the decode stream of every other request sharing the device, and you cannot tune the hardware for either because it is doing both.

Disaggregated serving splits them onto separate pools — a prefill pool sized for compute, a decode pool sized for memory bandwidth and capacity — connected by a fast fabric that ships the KV-cache from prefill to decode. By early 2026 this is supported across every major open-source engine (vLLM, SGLang, TensorRT-LLM) and orchestrated by frameworks like NVIDIA Dynamo and llm-d, with NIXL the standard KV-transfer transport over RDMA/NVLink. The payoff is higher goodput per dollar and the ability to mix silicon — expensive compute-dense parts for prefill, cheaper memory-rich parts for decode. The cost is the disaggregation tax: every request now pays a KV-cache transfer across the fabric, which only nets out when the pools are large enough and the fabric fast enough that the transfer is cheaper than the interference it eliminates.

The facility consequence is that the KV-cache becomes a first-class, tiered resource — held in HBM where it is hottest, spilled to local NVMe and Ethernet-attached flash as it cools, reused across requests that share a prefix. That memory hierarchy is its own subsystem now, engineered in Chapter 9.7. For the building, disaggregation means the inference hall is no longer a uniform sea of identical nodes; it is a heterogeneous fleet whose pool ratios you tune to your prefill:decode mix — which, per the reasoning shift above, is moving.

Deep dive: when disaggregation pays — and when co-location wins

Disaggregation is not free goodput; it is a trade you only win at scale. The KV-cache that prefill produces can be large — gigabytes for a long-context request — and shipping it to a decode worker consumes fabric bandwidth and adds latency to TTFT. The trade pays when three conditions hold together: (1) request volume is high enough that prefill and decode pools each stay busy independently (a small fleet cannot keep two specialized pools utilized); (2) the prefill:decode mix is imbalanced enough that co-location wastes one resource (uniform short chat is the weakest case; long-prompt or long-decode reasoning is the strongest); and (3) the interconnect is fast enough — NVLink within a rack, high-bandwidth RDMA across racks — that the transfer is cheaper than the head-of-line blocking it removes.

The consequence of getting it wrong cuts both ways. Disaggregate too small and you strand capacity in under-utilized pools and pay the transfer tax for no benefit — co-location would have been simpler and cheaper. Co-locate at scale and you cap your goodput on phase interference and forfeit the ability to run cheaper memory-rich silicon for decode. The 2026 default for large interactive fleets is to disaggregate; for small or batch-dominant fleets, co-location with continuous batching is often the better-scoped choice. The engine-level mechanics — chunked prefill as a middle path, KV-aware routing, the transfer-tax accounting — are in Chapter 10.11.

Oversubscription: the headroom training cannot touch

Two of inference's defining properties — loose coupling and uncorrelated per-request load — open headroom that a synchronous training job forbids, and capturing it is one of the clearest efficiency levers in the fleet.

Fabric. Most inference requests fit inside a single node or a small scale-up domain, so the back-end fabric does not carry the all-reduce traffic that forces training to 1:1 non-blocking. Inference runs comfortably at 2:1-3:1 oversubscription, and a 2:1 "optimized" fabric cuts back-end network cost by roughly 31% (SemiAnalysis; contested, single-source). Building a non-blocking fabric for an inference business spends that 31% on bisection bandwidth that never carries traffic — the exact anti-pattern named in Chapter 1.1. The money saved is better spent on geo-distribution and uptime. Topology and oversubscription are engineered in Chapter 8.5.

Power. Training's synchronous steps make thousands of GPUs draw their peak in lockstep, leaving only ~3% power-oversubscription headroom — the facility must provision near the synchronized peak. Inference's per-request peaks are uncorrelated, so the aggregate load is far smoother, opening ~21% headroom: you can provision more served capacity behind the same megawatts and reclaim power from low-priority work with simple priority-based capping (Uptime Institute Journal; arXiv power-profile studies). The consequence is real density-per-megawatt that training cannot match — but it must be engineered with capping and ride-through, because reasoning and agentic traffic make the per-request peaks lumpier than single-turn chat. The grid-facing transient behavior this implies is in Chapter 12.2's reliability frame.

Reliability and uptime: the posture flips

This is where inference and training are most starkly opposed, and where copying the training posture costs the most. A synchronous training job already tolerates failure by checkpoint-and-resume — a node dies, the job restarts from the last checkpoint, and 2N facility power buys nines the workload does not value (the over-provisioned-redundancy anti-pattern of Chapter 1.1). An always-on inference business is the inverse: every minute of outage is lost revenue and a breached SLA, and there is no checkpoint to resume from — the request is simply dropped and the user sees an error.

So the posture flips. Inference justifies 2N / Tier-IV-class facility power with N+1 cooling on standby — the ~26 min/yr unavailability of a Tier-IV target (99.995%) is a business requirement, not gold-plating. But the deeper move, and the one that distinguishes a well-scoped inference fleet, is to buy resilience at the fleet level rather than only the facility level: geo-redundancy. Because inference is stateless per-request and distributed for latency anyway, a failed site can fail over to another region, so the marginal availability of any single hall matters less than the redundancy of the whole footprint. The right design often spends on a second region before it spends on the last increment of single-site uptime — N+1 across sites can beat 2N within one. The reliability rethink that reframes facility availability against fleet goodput is Chapter 12.2.

Geo-redundancy is cheaper resilience than the last nine

Because online inference is already distributed for latency and is stateless per request, the fleet has a resilience lever training lacks: cross-region failover. The consequence is a budgeting insight — past a point, dollars spent on geo-redundancy buy more effective availability than dollars spent driving a single hall from N+1 toward 2N. A request that would have errored in a failed Virginia site is instead served from Ohio with a few extra milliseconds of latency. Operators who internalize this stop scoring availability per-facility and start scoring it across the footprint — which also relaxes the redundancy required at any one site and frees capital for more regions, more silicon, or deeper KV-cache tiers. The cross-region fabric that makes this real is in Chapter 3.6; the goodput-vs-availability accounting is in Chapter 12.2.

GPU vs inference ASIC: tokens-per-watt, tokens-per-dollar

Training has effectively one merchant answer — the latest flagship GPU — because the workload rewards raw FLOPS, the largest scale-up domain, and the most mature collective-comms software. Inference is the first archetype where the silicon choice is a genuine, contestable fork, because the scoring metric changes: not peak FLOPS but tokens-per-watt and tokens-per-dollar at your real batch sizes, context lengths, and SLOs. On that scoreboard, purpose-built inference ASICs become competitive, and several are now in volume.

Google's Ironwood (TPU v7) is explicitly "the first TPU for the age of inference": 192 GiB of HBM3E at 7.4 TB/s per chip, 4,614 FP8 TFLOPS, scaling to 9,216-chip pods delivering 42.5 FP8 ExaFLOPS — a memory-bandwidth-rich design aimed squarely at the decode-heavy future. AWS's Inferentia/Trainium line and the hyperscaler XPUs (Maia, MTIA) target the same tokens-per-dollar objective for captive fleets. These are surveyed in Chapter 7.4; the selection-and-TCO methodology is Chapter 7.11.

An ASIC can win decisively on tokens-per-watt for a stable, high-volume model family — but it carries software-ecosystem and lock-in risk (the CUDA moat is real, and a custom part with a thin software stack can leave throughput on the table), and it is a bet that the model architecture you optimize for stays put. Against an inference market deflating ~10x/year and reasoning reshaping the prefill:decode mix, a part optimized for last year's workload can be the wrong part this year. The rational posture for most operators is to score the fork at their own batch sizes and context lengths — peak-FLOPS datasheets are nearly useless here — and to keep the silicon decision as reversible as the procurement decision, because both move faster than a facility's depreciation clock.

Deep dive: why peak FLOPS is the wrong number for inference silicon

A training buyer can almost get away with comparing peak FLOPS, because training keeps the matrix engines busy. An inference buyer cannot, because the workload's two phases stress different parts of the chip and the datasheet headline reflects neither at the operating point that matters. Decode — the dominant phase for reasoning and chat — is memory-bandwidth-bound: it emits one token per pass over the model weights plus the KV-cache, so it is gated by HBM bandwidth and capacity, and a chip with twice the FLOPS but the same bandwidth decodes no faster. This is precisely why an inference-era part like Ironwood leads on HBM (192 GiB at 7.4 TB/s) rather than on FLOPS, and why memory, not compute, is the figure of merit for the decode pool.

Prefill is compute-bound and does reward FLOPS — but only at the prompt lengths and batch sizes you actually run, and the achievable fraction of peak (the realized MFU) varies enormously with the software stack and the kernel quality for your model shape. The consequence: the only defensible silicon comparison is a measured tokens-per-watt and tokens-per-dollar at your SLO, prefill:decode mix, and context distribution. Two parts with identical datasheets can differ by 2x in realized goodput once batching, quantization (FP8/FP4), and KV-cache behavior are accounted for. The benchmarking discipline and the cross-vendor MFU gaps are in Chapter 7.11; precision and quantization in Chapter 7.4.

Anti-patterns

The recurring inference mis-scopes all trace to one root cause: designing the inference fleet as a de-rated training cluster instead of from its own objective function. Four are worth naming:

Training fabric for an inference business. A 1:1 non-blocking back-end for a workload whose requests fit inside a node wastes ~31% of back-end cost on bisection bandwidth that never carries traffic. Oversubscribe 2:1-3:1 and spend the savings on geo-distribution and uptime.
Sizing the whole fleet to single-turn chat. Reasoning and agentic traffic multiply the per-request token budget and explode KV-cache pressure; a fleet sized on old token-per-request assumptions is undersized on memory and decode by the same factor. Size to the decode-heavy base case.
Centralizing inference for power economics. Consolidating an interactive fleet onto one cheap-power gigawatt campus breaches the latency SLO for distant users — the speed of light does not negotiate. Distribute toward demand and pay the energy premium knowingly.
Buying peak FLOPS instead of tokens-per-watt. Selecting inference silicon on datasheet FLOPS ignores that decode is memory-bound; the chip that wins on paper can lose 2x on realized goodput at your batch sizes and context lengths.

Inference sits inside the archetype framework of Chapter 1.1 and opposite the training treatment of Chapter 1.2; the hybrid middle (RL as inference-heavy training) is Chapter 1.4 and the latency-bound extreme is Chapter 1.5. The serving engineering this chapter defers — batching, chunked prefill, disaggregation tax, goodput-optimal scheduling — is owned by Chapter 10.11; the KV-cache memory hierarchy by Chapter 9.7. Latency-driven siting connects to Chapter 3.1, Chapter 3.6, and Chapter 3.13; fabric oversubscription to Chapter 8.5; silicon selection to Chapter 7.4 and Chapter 7.11; the reliability flip to Chapter 12.2; and the unit economics and deflation risk to Chapter 1.8.