Guide › Strategy, Workload Archetypes & Economics › 1.7

Chapter 1.7

The Requirements-and-Consequences Matrix

Once you have named the workload archetype, the rest of the facility is no longer a menu of options — it is a forced sequence of subsystem commitments, and this chapter is the lookup table that turns one requirement into a defensible, signed design basis.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

The cooling modality each hall is plumbed for — air, rear-door, or direct-to-chip liquid — which the density target sets before steel is cut and which a retrofit cannot cheaply undo.
The back-end fabric blocking ratio and the GPU:CPU / GPU:memory / GPU:storage ratios, which coupling and read-bandwidth demand fix per archetype — over-provision and you strand capex, under-provision and you starve the accelerators.
The storage tier and its throughput floor (checkpoint write bandwidth, data-loader read bandwidth, KV-cache capacity), mapped to the archetype's tolerance for a stalled GPU.
The redundancy tier — N, N+1, or 2N/Tier-IV — mapped to whether the workload survives a node loss via checkpoint-and-resume or loses revenue on every outage.
Whether the site is scored power-first or latency-first, and the per-archetype reference design-basis sheet you freeze and sign before ordering long-lead equipment.

Chapter 1.1 established the master variable — the workload archetype — and walked the cascade qualitatively. This chapter is the engineering instrument that operationalizes it: a requirements-and-consequences matrix that takes one archetype and returns a concrete, numbered design basis for every subsystem that follows. Where 1.1 said "pre-training implies liquid cooling," this chapter says which inlet temperature, which flow rate, which floor-loading class, which blocking ratio, which storage throughput floor, and which redundancy tier — and names the downstream cost of each cell you fill in wrong.

The altitude here is lower than in 1.1. We move through four mappings in the order an engineer actually commits them — density to the cooling cliff, fabric and the GPU:CPU/memory/storage ratios, storage and redundancy against interruption tolerance, and siting as power-first versus latency-first — and close on the reference design-basis sheets that capture all four per archetype. Read this chapter with Chapter 1.1 open: this is the table its cascade was promising.

Mapping 1 — density to the cooling cliff

The first irreversible commitment is the cooling modality, and it is set entirely by one number: peak rack density. This is physics, not preference. Air saturates as a heat-removal medium around ~41 kW/rack under realistic containment; rear-door heat exchangers (RDHx) and air-assisted liquid push the ceiling to ~50–100 kW without facility water at the rack; past ~100 kW the only answer is direct-to-chip liquid (DLC). A GB200 NVL72 draws ~120–132 kW — roughly ~115 kW removed by liquid and ~17 kW by residual air — which lands it firmly past the cliff. The next generation does not soften this: Rubin VR200-class racks are ~190–230 kW, and Rubin Ultra Kyber is on a ~600 kW / 800 VDC path. The density target therefore does not influence the cooling plant; it determines it.

The consequence of mis-reading the cliff is a discontinuity, not a slope. A hall scoped for 40 kW air-cooled racks cannot absorb a 132 kW DLC rack by tuning airflow — there is no containment scheme, no warmer supply air, no economizer setting that closes a ~90 kW gap. You are over the cliff, and crossing it in a retrofit costs ~$5–10M/MW while still stranding capacity, because the slab cannot bear ~3,000 lb wet racks, the plenum was never sized for liquid distribution, and facility water was never provisioned. The map below is the lookup; the engineering lives in Chapter 5.1 (the density wall) through Chapter 5.4 (DLC).

Density-to-cooling-cliff map

Rack density band	Cooling modality	Facility water at rack?	Floor / structural basis	Typical PUE band	Archetypes that land here
Up to ~41 kW	Air (containment, CRAH/in-row)	No	Standard raised floor / slab	1.4–1.6 (legacy air)	Edge; modest-density batch inference
~41–100 kW	Rear-door HX / air-assisted liquid	No (door-level only)	Reinforced rows; brownfield-friendly	1.2–1.4	Online & batch inference; retrofit bridge
~100–200 kW	Direct-to-chip liquid (single-phase)	Yes — CDU + warm-water loop	Reinforced slab for ~3,000–5,000 lb wet racks	1.05–1.15	Pre-training; RL trainer; dense inference
~200–600 kW+	DLC + 800 VDC; busbar-integrated liquid	Yes — high-flow, tight delta-T	Purpose-built; pipe-rack & knockout headroom	1.05–1.10	Frontier pre-training (Rubin / Kyber class)

Air ceiling per ASHRAE TC 9.9 / SemiAnalysis; rack-density figures are 2026-current NVIDIA-class reference points (see keynumbers for sources and vintages). PUE bands are design, not annualized.

The DLC envelope is unforgiving — design to it, not around it

Choosing DLC is not the end of the decision; it commits you to a tight operating window. The GB200 NVL72 spec wants coolant inlet ~20–25 °C, ~80 L/min/rack flow, and a controlled delta-T across the cold plates — and deviation throttles the GPUs up to ~50%. That turns a sloppy facility-water loop or an undersized CDU into a direct, measurable loss of goodput: you bought the densest accelerators on the market and then clocked them down because the plant could not hold setpoint. The flow rule of thumb is ~1.2–2.0 L/min per kW; size the CDU, the warm-water loop, and the heat-rejection plant to the worst-case branch at full load, not the nameplate average. → Chapter 5.6 (CDUs & the secondary loop), Chapter 5.12 (setpoint stability).

Mapping 2 — fabric sizing and the system ratios

The second mapping is the network, and it has two parts: the blocking ratio of the back-end (scale-out) fabric, set by coupling, and the system composition ratios — GPU:CPU, GPU:memory, GPU:storage — set by the archetype's host-side and data-path demands. Both are decisions where the wrong answer wastes money in opposite directions: over-build and you pay for bandwidth that never carries traffic; under-build and you starve the accelerators you spent the most on.

Coupling sets the blocking ratio. A synchronous pre-training job spends a large fraction of every step in collectives (all-reduce, all-gather, reduce-scatter), so the back-end must be 1:1 non-blocking — typically an 8-rail-optimized fat-tree. Oversubscribe it and the all-reduce stalls, dragging model FLOPs utilization down across the whole job. Loosely-coupled inference fits inside a node or a small scale-up domain, so 2:1–3:1 oversubscription is fine and cuts back-end cost ~31% (Meta has run 7:1 on a 24k-H100 inference fleet). Sizing a non-blocking fabric for an inference business is the cleanest example of a self-inflicted anti-pattern — bisection bandwidth the requests never use. → Chapter 8.5 (topology & oversubscription), Chapter 8.4 (protocols).

The composition ratios are archetype-specific and shifting. Training historically ran ~8 GPU:1 CPU; agentic inference — with host-side sandbox execution, retrieval, tool calls, and RL rollouts — is pulling that toward ~4–8 GPU:1 CPU and lower, which changes the host BOM and the node power budget. GPU:memory is set by per-GPU HBM (H100 80 GB → B200 192 GB → B300 288 GB → Rubin Ultra ~1 TB) plus host RAM, and inference is increasingly KV-cache-bound rather than weight-bound. GPU:storage is a bandwidth ratio, not a capacity one: it is fixed by checkpoint write speed for training and data-loader read speed for both — which is exactly where Mapping 3 begins.

Fabric and system-ratio sizing by archetype

Archetype	Back-end blocking ratio	GPU:CPU (host)	Dominant memory pressure	Storage demand profile
Pre-training	1:1 non-blocking, 8-rail fat-tree	~8:1 (compute-dense host)	HBM for activations; host RAM for staging	Burst checkpoint writes; high sustained read for data loader
Post-training / RL	Disaggregated: tight trainer, tolerant rollout pool	Mixed — more CPU on rollout side	KV-cache on rollouts; HBM on trainer	Rollout reads + trainer checkpoints; staleness-tolerant
Online inference	2:1–3:1 oversubscribed	~4–8:1, falling (agentic host work)	KV-cache capacity & bandwidth	Model-weight load; KV-cache tiering to NVMe/CXL
Batch inference	Heavily oversubscribed; cost-optimized	Flexible	Throughput over latency; large batches	Throughput reads; no low-latency requirement
Edge inference	Minimal (single node / WAN backhaul)	Constrained by appliance	Single-model resident; small KV	Local model store; periodic sync

Blocking ratios and GPU:CPU norms per SemiAnalysis AI Neocloud Playbook and TrendForce; storage bandwidth bands are practitioner design floors. Figures are 2026-current reference points, not vendor minimums.

~41 kW

practical air-cooling ceiling per rack; RDHx ~50–100 kW; DLC 100–200 kW+

2025ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy

120–132 kW

per GB200 NVL72 rack (~115 kW liquid + ~17 kW air); GB300 ~142 kW; Rubin Ultra Kyber ~600 kW

2026NVIDIA OCP / SemiAnalysis roadmap

20–25 °C / ~80 L/min

GB200 NVL72 DLC inlet & flow; deviation can throttle GPUs up to ~50%

2025NVIDIA OCP / Introl

1:1 vs 2:1–3:1

training non-blocking vs inference oversubscribed; 2:1 cuts back-end cost ~31% (contested — single-source); Meta ran 7:1 on 24k H100

2025SemiAnalysis AI Neocloud Playbook / Meta

~8:1 → 4–8:1

GPU:CPU ratio shifting from training-era norm toward agentic-inference host demand

2026TrendForce Insights; Introl

~$5–10M/MW

full AI liquid retrofit cost crossing the cooling cliff; still strands capacity

2026Introl / Vera Rubin deployment analysis

Tier III 99.982% / Tier IV 99.995%

~1.6 hr/yr vs ~26 min/yr downtime; Tier IV ~20–40% capital premium

2025Uptime Institute

~90% / ~96%

goodput (effective training time): industry avg vs best-in-class; reliability overhead 6–21% of TCO

2025SemiAnalysis ClusterMAX / CoreWeave

Mapping 3 — storage and redundancy against interruption tolerance

Storage and redundancy are two consequences of the same input — the archetype's tolerance for an interrupted GPU — and they are most defensible when designed together. The question storage answers is: when does a GPU stall waiting on data, and what does that stall cost? The question redundancy answers is: when a node or a power feed fails, does the workload restart cheaply or lose money?

Storage is sized by the throughput that keeps GPUs fed, not by capacity alone. For training, the two binding flows are checkpoint write bandwidth — because a synchronous job pauses all GPUs to write a checkpoint, and slow writes are pure goodput loss — and data-loader read bandwidth, because a starved loader idles the whole pipeline. A high-bandwidth parallel file system feeding GPUDirect Storage (CPU-bypass) is the training default; this is the link that turns a storage decision into a GPU-efficiency decision (Chapter 9.1, Chapter 9.3, Chapter 9.4). For online inference, the new pressure is the KV-cache: reasoning models emit long decode sequences, inflating per-request cache, so the hierarchy now tiers KV state across HBM, host memory, and NVMe/CXL (Chapter 9.7). Batch and edge are the relaxed cases — throughput reads with no low-latency floor.

Redundancy is set by interruption tolerance, and over-building it is a recognizable waste. A synchronous training job already restarts from a checkpoint when any node fails — at best-in-class operators, MTBF is ~7 days per 512 GPUs, and Meta's Llama 3 405B run logged ~one interruption every three hours on 16,384 H100s — so the rational posture is N or N+1 plus disciplined checkpointing, not 2N. Spending on Tier-IV facility power to prevent a restart the job already tolerates buys nines the workload does not value; that capital returns more as goodput — faster checkpoint storage, hot spares, more GPUs. An always-on inference business inverts this: an outage is lost revenue and a breached SLA, so 2N / Tier-IV-class power with N+1 cooling on standby is justified. → Chapter 12.1 (redundancy topologies), Chapter 12.2 (goodput vs availability), Chapter 12.4 (goodput SLAs).

Storage and redundancy mapped to interruption tolerance

Archetype	Interruption tolerance	Binding storage flow	Storage tier	Redundancy posture
Pre-training	High — checkpoint-and-resume	Checkpoint write + loader read bandwidth	Parallel FS + NVMe; GPUDirect Storage	N or N+1; spend on checkpointing, not 2N
Post-training / RL	High — staleness-tolerant, restartable	Rollout reads + trainer checkpoints	Tiered: fast trainer FS + rollout object store	N+1; disaggregated fault domains
Online inference	Low — outage = lost revenue + SLA breach	Weight load + KV-cache bandwidth	KV tiered HBM→host→NVMe/CXL	2N / Tier-IV + N+1 cooling on standby
Batch inference	High — queue-and-retry	Throughput reads	Object store / capacity tier	N — interruption-tolerant
Edge inference	Site-level — fleet geo-redundancy	Local model store + periodic sync	Local NVMe; minimal	Often N; resilience via fleet-of-sites

Redundancy tiers and storage profiles are design heuristics, not rules. Goodput/MTBF figures from SemiAnalysis and the Meta Llama 3 disclosure; see keynumbers and the reliability provenance entries.

Deep dive: why checkpoint bandwidth is a redundancy decision in disguise

It is tempting to file checkpoint storage under "storage" and redundancy under "electrical," and to size them in separate workstreams. For training, that separation hides the real trade. A synchronous job's resilience strategy is checkpoint-and-resume: every node failure is absorbed by reloading the last checkpoint and replaying. The cost of that strategy is two-fold — the goodput lost while all GPUs pause to write each checkpoint, and the work re-done since the last one. Both shrink as checkpoint write bandwidth rises: faster writes mean you can checkpoint more often (less re-done work) at lower per-checkpoint cost (less pause).

So the question "how much should we spend on facility redundancy?" and "how fast must checkpoint storage be?" are the same question asked twice. At a top operator's ~7-day MTBF per 512 GPUs and Meta's observed ~one interruption per three hours at 16k-GPU scale, the dominant resilience lever is not 2N power — it is the storage and checkpointing path that makes each interruption cheap. Capital budgeted for Tier-IV redundancy on a checkpointable cluster almost always returns more as checkpoint bandwidth, hot spares, and autonomous fault recovery. The inference case flips: there is no checkpoint to resume to, so the spend belongs in 2N power and N+1 cooling. Interruption tolerance, read once, sets both columns. → Chapter 9.4, Chapter 12.2.

Mapping 4 — siting: power-first vs latency-first

The fourth mapping is the least reversible of all — you cannot move a slab — which is why it must be derived from the workload, never chosen first and rationalized after. Latency sensitivity is the discriminator. Pre-training and batch inference are indifferent to user proximity, so they are scored power-first: chase the cheapest firm (or curtailable) megawatts and the coldest free-cooling climate, accept that the site may be hours from any metro, and treat the grid-interconnection queue slot as the scarcest asset in the project. Online and edge inference are scored latency-first: chase sub-50 ms reach to users and accept power that can cost 2–4x more, distributing capacity for proximity rather than concentrating it for cost.

The 2026 context sharpens this fork. The binding constraint is power, not chips — US large-load interconnection waits run ~3–7+ years in the densest hubs — so a power-first archetype that mis-sites near expensive, constrained metro power burns both money and a queue slot it cannot recover. A latency-first archetype sited in a cheap-power exurb, conversely, may meet its energy budget and miss its SLO, which is the more expensive miss because it loses the revenue the building exists to earn. Water availability is a hard siting gate for any liquid-cooled hall regardless of archetype (Chapter 3.7). The reordered hierarchy and the speed-to-power race are engineered in Chapter 3.1 and Chapter 3.2; the fiber/latency screen in Chapter 3.6.

Power-first and latency-first are different buildings, not a dial

It is tempting to treat siting as a dial you can set anywhere between "cheap power" and "close to users." For a single facility it is closer to a switch. A power-first site optimizes for the lowest $/MWh firm supply and the largest interconnection it can secure, and tolerates distance; a latency-first site optimizes for the network distance to a user population and tolerates expensive power. The two rarely co-locate — cheap stranded power is, almost by definition, far from dense demand. The strategic move when a portfolio needs both is not to compromise one site, but to split the archetypes across sites: a power-first training/batch campus feeding a constellation of latency-first inference points-of-presence. That is the same build-core-rent-edge logic the procurement chapter reaches from the other direction. → Chapter 1.5 (edge), Chapter 1.6 (procurement).

The reference design-basis sheet, per archetype

The four mappings converge into a single artifact: a reference design-basis sheet per archetype that freezes the inherited assumptions before any long-lead equipment is ordered. This is the deliverable 1.1 promised under "design-basis document," now filled in. Each sheet pins one row per subsystem — density tier, cooling modality, fabric blocking ratio, system ratios, storage tier and throughput floor, redundancy topology, voltage class, and siting class — plus a reversible-vs-irreversible register recording which assumptions are hedged and which are committed. The table below is the skeleton; a real sheet attaches the numbers (the ramp curve, the MVA sizing, the CDU capacity) and the signatures.

Reference design-basis sheets (skeleton) by archetype

Subsystem	Pre-training	Online inference	Batch inference	Edge inference
Density tier	100–600 kW (DLC)	30–100 kW (air/RDHx/DLC)	30–60 kW (flexible)	few kW–~30 kW
Cooling modality	DLC, warm-water loop	Air→RDHx→DLC by density	Host hall's existing (air often fine)	Air / sealed modular
Fabric blocking	1:1 non-blocking	2:1–3:1 oversubscribed	Heavily oversubscribed	Minimal / WAN backhaul
Storage tier	Parallel FS + GPUDirect	KV-tiered HBM→NVMe/CXL	Object / capacity tier	Local NVMe
Redundancy	N / N+1 (checkpoint)	2N / Tier-IV + N+1 cooling	N (queue-and-retry)	N + fleet geo-redundancy
Voltage class	415/480 VAC → 800 VDC path	415/480 VAC	415/480 VAC	Local LV / appliance
Siting class	Power-first (cheap, cold, big queue)	Latency-first (sub-50 ms)	Cheapest / curtailable power	Proximity over cost

A starting template, not a prescription — every figure is sized to the specific ramp curve and generation. Voltage class follows density (800 VDC paths emerge above ~200 kW/rack). Cross-check each cell against the mapping tables above.

Sign the irreversible rows; pencil the reversible ones

Not every row on the design-basis sheet costs the same to change later, and the sheet should say so explicitly. The irreversible rows — siting class, voltage path, floor-loading basis, and the cooling-modality commitment (a hall plumbed for liquid or not) — must be signed and over-built for the ramp, because retrofitting them mid-life is the $5–10M/MW one-way door. The reversible rows — the specific accelerator generation within a power/cooling envelope, the oversubscription ratio on a fabric you sized non-blocking, the exact GPU:CPU ratio, the scheduler — can be penciled and re-decided cheaply. The discipline that separates a robust scope from a fragile one is reserving the headroom you cannot retrofit (floor, water, electrical, pipe-rack) while keeping the IT fit-out matched to the current generation. → reversibility framing in Chapter 1.1; economics that score the sheet in Chapter 1.8.

Deep dive: reading the matrix backwards to audit an existing facility

The matrix is written forward — archetype in, design basis out — but its most useful diagnostic mode is backward. Given a facility that already exists (a hall you are evaluating to lease, retrofit, or acquire), read its observable subsystems back up the cascade and infer the archetype it was actually built for, then compare that to the workload you intend to run.

A hall with 40 kW air-cooled racks, an oversubscribed Ethernet fabric, 2N power, and a metro location is an inference building — try to run synchronous pre-training in it and you will hit the cooling cliff, starve the all-reduce, and pay for redundancy the training job does not value. A campus with 132 kW DLC racks, a non-blocking InfiniBand fabric, N+1 power, and a remote cheap-power site is a training building — run latency-sensitive inference from it and you will miss every proximity SLO while paying for bisection bandwidth the requests never use. The mismatches the backward read exposes are exactly the three anti-patterns 1.1 named: training fabric for an inference business, retrofitting past the air-cooling cliff, and over-provisioned redundancy for checkpointable jobs. The matrix is therefore both a scoping tool and a due-diligence checklist — the same table, run in two directions. → Chapter 5.10 (retrofit limits), Chapter 1.6 (procurement diligence).

This chapter is the lookup table for the cascade introduced in Chapter 1.1 and deepened per archetype in Chapter 1.2 (training), Chapter 1.3 (inference), Chapter 1.4 (post-training/RL), and Chapter 1.5 (edge); the procurement fork that pairs with siting is in Chapter 1.6, and the economics that score every design-basis sheet live in Chapter 1.8. The cooling cliff is engineered in Chapter 5.1 through Chapter 5.4, with CDUs in Chapter 5.6 and retrofit paths in Chapter 5.10; the fabric blocking decision in Chapter 8.4 and Chapter 8.5; the storage flows in Chapter 9.1, Chapter 9.3, Chapter 9.4, and Chapter 9.7; the redundancy rethink in Chapter 12.1, Chapter 12.2, and Chapter 12.4; and the siting hierarchy in Chapter 3.1, Chapter 3.2, Chapter 3.6, and Chapter 3.7.