Chapter 1.7
The Requirements-and-Consequences Matrix
Once you have named the workload archetype, the rest of the facility is no longer a menu of options — it is a forced sequence of subsystem commitments, and this chapter is the lookup table that turns one requirement into a defensible, signed design basis.
What you'll decide here
- The cooling modality each hall is plumbed for — air, rear-door, or direct-to-chip liquid — which the density target sets before steel is cut and which a retrofit cannot cheaply undo.
- The back-end fabric blocking ratio and the GPU:CPU / GPU:memory / GPU:storage ratios, which coupling and read-bandwidth demand fix per archetype — over-provision and you strand capex, under-provision and you starve the accelerators.
- The storage tier and its throughput floor (checkpoint write bandwidth, data-loader read bandwidth, KV-cache capacity), mapped to the archetype's tolerance for a stalled GPU.
- The redundancy tier — N, N+1, or 2N/Tier-IV — mapped to whether the workload survives a node loss via checkpoint-and-resume or loses revenue on every outage.
- Whether the site is scored power-first or latency-first, and the per-archetype reference design-basis sheet you freeze and sign before ordering long-lead equipment.
Chapter 1.1 established the master variable — the workload archetype — and walked the cascade qualitatively. This chapter is the engineering instrument that operationalizes it: a requirements-and-consequences matrix that takes one archetype and returns a concrete, numbered design basis for every subsystem that follows. Where 1.1 said "pre-training implies liquid cooling," this chapter says which inlet temperature, which flow rate, which floor-loading class, which blocking ratio, which storage throughput floor, and which redundancy tier — and names the downstream cost of each cell you fill in wrong.
The altitude here is lower than in 1.1. We move through four mappings in the order an engineer actually commits them — density to the cooling cliff, fabric and the GPU:CPU/memory/storage ratios, storage and redundancy against interruption tolerance, and siting as power-first versus latency-first — and close on the reference design-basis sheets that capture all four per archetype. Read this chapter with Chapter 1.1 open: this is the table its cascade was promising.
Mapping 1 — density to the cooling cliff
The first irreversible commitment is the cooling modality, and it is set entirely by one number: peak rack density. This is physics, not preference. Air saturates as a heat-removal medium around ~41 kW/rack under realistic containment; rear-door heat exchangers (RDHx) and air-assisted liquid push the ceiling to ~50–100 kW without facility water at the rack; past ~100 kW the only answer is direct-to-chip liquid (DLC). A GB200 NVL72 draws ~120–132 kW — roughly ~115 kW removed by liquid and ~17 kW by residual air — which lands it firmly past the cliff. The next generation does not soften this: Rubin VR200-class racks are ~190–230 kW, and Rubin Ultra Kyber is on a ~600 kW / 800 VDC path. The density target therefore does not influence the cooling plant; it determines it.
The consequence of mis-reading the cliff is a discontinuity, not a slope. A hall scoped for 40 kW air-cooled racks cannot absorb a 132 kW DLC rack by tuning airflow — there is no containment scheme, no warmer supply air, no economizer setting that closes a ~90 kW gap. You are over the cliff, and crossing it in a retrofit costs ~$5–10M/MW while still stranding capacity, because the slab cannot bear ~3,000 lb wet racks, the plenum was never sized for liquid distribution, and facility water was never provisioned. The map below is the lookup; the engineering lives in Chapter 5.1 (the density wall) through Chapter 5.4 (DLC).
| Rack density band | Cooling modality | Facility water at rack? | Floor / structural basis | Typical PUE band | Archetypes that land here |
|---|---|---|---|---|---|
| Up to ~41 kW | Air (containment, CRAH/in-row) | No | Standard raised floor / slab | 1.4–1.6 (legacy air) | Edge; modest-density batch inference |
| ~41–100 kW | Rear-door HX / air-assisted liquid | No (door-level only) | Reinforced rows; brownfield-friendly | 1.2–1.4 | Online & batch inference; retrofit bridge |
| ~100–200 kW | Direct-to-chip liquid (single-phase) | Yes — CDU + warm-water loop | Reinforced slab for ~3,000–5,000 lb wet racks | 1.05–1.15 | Pre-training; RL trainer; dense inference |
| ~200–600 kW+ | DLC + 800 VDC; busbar-integrated liquid | Yes — high-flow, tight delta-T | Purpose-built; pipe-rack & knockout headroom | 1.05–1.10 | Frontier pre-training (Rubin / Kyber class) |
Mapping 2 — fabric sizing and the system ratios
The second mapping is the network, and it has two parts: the blocking ratio of the back-end (scale-out) fabric, set by coupling, and the system composition ratios — GPU:CPU, GPU:memory, GPU:storage — set by the archetype's host-side and data-path demands. Both are decisions where the wrong answer wastes money in opposite directions: over-build and you pay for bandwidth that never carries traffic; under-build and you starve the accelerators you spent the most on.
Coupling sets the blocking ratio. A synchronous pre-training job spends a large fraction of every step in collectives (all-reduce, all-gather, reduce-scatter), so the back-end must be 1:1 non-blocking — typically an 8-rail-optimized fat-tree. Oversubscribe it and the all-reduce stalls, dragging model FLOPs utilization down across the whole job. Loosely-coupled inference fits inside a node or a small scale-up domain, so 2:1–3:1 oversubscription is fine and cuts back-end cost ~31% (Meta has run 7:1 on a 24k-H100 inference fleet). Sizing a non-blocking fabric for an inference business is the cleanest example of a self-inflicted anti-pattern — bisection bandwidth the requests never use. → Chapter 8.5 (topology & oversubscription), Chapter 8.4 (protocols).
The composition ratios are archetype-specific and shifting. Training historically ran ~8 GPU:1 CPU; agentic inference — with host-side sandbox execution, retrieval, tool calls, and RL rollouts — is pulling that toward ~4–8 GPU:1 CPU and lower, which changes the host BOM and the node power budget. GPU:memory is set by per-GPU HBM (H100 80 GB → B200 192 GB → B300 288 GB → Rubin Ultra ~1 TB) plus host RAM, and inference is increasingly KV-cache-bound rather than weight-bound. GPU:storage is a bandwidth ratio, not a capacity one: it is fixed by checkpoint write speed for training and data-loader read speed for both — which is exactly where Mapping 3 begins.
| Archetype | Back-end blocking ratio | GPU:CPU (host) | Dominant memory pressure | Storage demand profile |
|---|---|---|---|---|
| Pre-training | 1:1 non-blocking, 8-rail fat-tree | ~8:1 (compute-dense host) | HBM for activations; host RAM for staging | Burst checkpoint writes; high sustained read for data loader |
| Post-training / RL | Disaggregated: tight trainer, tolerant rollout pool | Mixed — more CPU on rollout side | KV-cache on rollouts; HBM on trainer | Rollout reads + trainer checkpoints; staleness-tolerant |
| Online inference | 2:1–3:1 oversubscribed | ~4–8:1, falling (agentic host work) | KV-cache capacity & bandwidth | Model-weight load; KV-cache tiering to NVMe/CXL |
| Batch inference | Heavily oversubscribed; cost-optimized | Flexible | Throughput over latency; large batches | Throughput reads; no low-latency requirement |
| Edge inference | Minimal (single node / WAN backhaul) | Constrained by appliance | Single-model resident; small KV | Local model store; periodic sync |
Mapping 3 — storage and redundancy against interruption tolerance
Storage and redundancy are two consequences of the same input — the archetype's tolerance for an interrupted GPU — and they are most defensible when designed together. The question storage answers is: when does a GPU stall waiting on data, and what does that stall cost? The question redundancy answers is: when a node or a power feed fails, does the workload restart cheaply or lose money?
Storage is sized by the throughput that keeps GPUs fed, not by capacity alone. For training, the two binding flows are checkpoint write bandwidth — because a synchronous job pauses all GPUs to write a checkpoint, and slow writes are pure goodput loss — and data-loader read bandwidth, because a starved loader idles the whole pipeline. A high-bandwidth parallel file system feeding GPUDirect Storage (CPU-bypass) is the training default; this is the link that turns a storage decision into a GPU-efficiency decision (Chapter 9.1, Chapter 9.3, Chapter 9.4). For online inference, the new pressure is the KV-cache: reasoning models emit long decode sequences, inflating per-request cache, so the hierarchy now tiers KV state across HBM, host memory, and NVMe/CXL (Chapter 9.7). Batch and edge are the relaxed cases — throughput reads with no low-latency floor.
Redundancy is set by interruption tolerance, and over-building it is a recognizable waste. A synchronous training job already restarts from a checkpoint when any node fails — at best-in-class operators, MTBF is ~7 days per 512 GPUs, and Meta's Llama 3 405B run logged ~one interruption every three hours on 16,384 H100s — so the rational posture is N or N+1 plus disciplined checkpointing, not 2N. Spending on Tier-IV facility power to prevent a restart the job already tolerates buys nines the workload does not value; that capital returns more as goodput — faster checkpoint storage, hot spares, more GPUs. An always-on inference business inverts this: an outage is lost revenue and a breached SLA, so 2N / Tier-IV-class power with N+1 cooling on standby is justified. → Chapter 12.1 (redundancy topologies), Chapter 12.2 (goodput vs availability), Chapter 12.4 (goodput SLAs).
| Archetype | Interruption tolerance | Binding storage flow | Storage tier | Redundancy posture |
|---|---|---|---|---|
| Pre-training | High — checkpoint-and-resume | Checkpoint write + loader read bandwidth | Parallel FS + NVMe; GPUDirect Storage | N or N+1; spend on checkpointing, not 2N |
| Post-training / RL | High — staleness-tolerant, restartable | Rollout reads + trainer checkpoints | Tiered: fast trainer FS + rollout object store | N+1; disaggregated fault domains |
| Online inference | Low — outage = lost revenue + SLA breach | Weight load + KV-cache bandwidth | KV tiered HBM→host→NVMe/CXL | 2N / Tier-IV + N+1 cooling on standby |
| Batch inference | High — queue-and-retry | Throughput reads | Object store / capacity tier | N — interruption-tolerant |
| Edge inference | Site-level — fleet geo-redundancy | Local model store + periodic sync | Local NVMe; minimal | Often N; resilience via fleet-of-sites |
Deep dive: why checkpoint bandwidth is a redundancy decision in disguise
It is tempting to file checkpoint storage under "storage" and redundancy under "electrical," and to size them in separate workstreams. For training, that separation hides the real trade. A synchronous job's resilience strategy is checkpoint-and-resume: every node failure is absorbed by reloading the last checkpoint and replaying. The cost of that strategy is two-fold — the goodput lost while all GPUs pause to write each checkpoint, and the work re-done since the last one. Both shrink as checkpoint write bandwidth rises: faster writes mean you can checkpoint more often (less re-done work) at lower per-checkpoint cost (less pause).
So the question "how much should we spend on facility redundancy?" and "how fast must checkpoint storage be?" are the same question asked twice. At a top operator's ~7-day MTBF per 512 GPUs and Meta's observed ~one interruption per three hours at 16k-GPU scale, the dominant resilience lever is not 2N power — it is the storage and checkpointing path that makes each interruption cheap. Capital budgeted for Tier-IV redundancy on a checkpointable cluster almost always returns more as checkpoint bandwidth, hot spares, and autonomous fault recovery. The inference case flips: there is no checkpoint to resume to, so the spend belongs in 2N power and N+1 cooling. Interruption tolerance, read once, sets both columns. → Chapter 9.4, Chapter 12.2.
Mapping 4 — siting: power-first vs latency-first
The fourth mapping is the least reversible of all — you cannot move a slab — which is why it must be derived from the workload, never chosen first and rationalized after. Latency sensitivity is the discriminator. Pre-training and batch inference are indifferent to user proximity, so they are scored power-first: chase the cheapest firm (or curtailable) megawatts and the coldest free-cooling climate, accept that the site may be hours from any metro, and treat the grid-interconnection queue slot as the scarcest asset in the project. Online and edge inference are scored latency-first: chase sub-50 ms reach to users and accept power that can cost 2–4x more, distributing capacity for proximity rather than concentrating it for cost.
The 2026 context sharpens this fork. The binding constraint is power, not chips — US large-load interconnection waits run ~3–7+ years in the densest hubs — so a power-first archetype that mis-sites near expensive, constrained metro power burns both money and a queue slot it cannot recover. A latency-first archetype sited in a cheap-power exurb, conversely, may meet its energy budget and miss its SLO, which is the more expensive miss because it loses the revenue the building exists to earn. Water availability is a hard siting gate for any liquid-cooled hall regardless of archetype (Chapter 3.7). The reordered hierarchy and the speed-to-power race are engineered in Chapter 3.1 and Chapter 3.2; the fiber/latency screen in Chapter 3.6.
The reference design-basis sheet, per archetype
The four mappings converge into a single artifact: a reference design-basis sheet per archetype that freezes the inherited assumptions before any long-lead equipment is ordered. This is the deliverable 1.1 promised under "design-basis document," now filled in. Each sheet pins one row per subsystem — density tier, cooling modality, fabric blocking ratio, system ratios, storage tier and throughput floor, redundancy topology, voltage class, and siting class — plus a reversible-vs-irreversible register recording which assumptions are hedged and which are committed. The table below is the skeleton; a real sheet attaches the numbers (the ramp curve, the MVA sizing, the CDU capacity) and the signatures.
| Subsystem | Pre-training | Online inference | Batch inference | Edge inference |
|---|---|---|---|---|
| Density tier | 100–600 kW (DLC) | 30–100 kW (air/RDHx/DLC) | 30–60 kW (flexible) | few kW–~30 kW |
| Cooling modality | DLC, warm-water loop | Air→RDHx→DLC by density | Host hall's existing (air often fine) | Air / sealed modular |
| Fabric blocking | 1:1 non-blocking | 2:1–3:1 oversubscribed | Heavily oversubscribed | Minimal / WAN backhaul |
| Storage tier | Parallel FS + GPUDirect | KV-tiered HBM→NVMe/CXL | Object / capacity tier | Local NVMe |
| Redundancy | N / N+1 (checkpoint) | 2N / Tier-IV + N+1 cooling | N (queue-and-retry) | N + fleet geo-redundancy |
| Voltage class | 415/480 VAC → 800 VDC path | 415/480 VAC | 415/480 VAC | Local LV / appliance |
| Siting class | Power-first (cheap, cold, big queue) | Latency-first (sub-50 ms) | Cheapest / curtailable power | Proximity over cost |
Deep dive: reading the matrix backwards to audit an existing facility
The matrix is written forward — archetype in, design basis out — but its most useful diagnostic mode is backward. Given a facility that already exists (a hall you are evaluating to lease, retrofit, or acquire), read its observable subsystems back up the cascade and infer the archetype it was actually built for, then compare that to the workload you intend to run.
A hall with 40 kW air-cooled racks, an oversubscribed Ethernet fabric, 2N power, and a metro location is an inference building — try to run synchronous pre-training in it and you will hit the cooling cliff, starve the all-reduce, and pay for redundancy the training job does not value. A campus with 132 kW DLC racks, a non-blocking InfiniBand fabric, N+1 power, and a remote cheap-power site is a training building — run latency-sensitive inference from it and you will miss every proximity SLO while paying for bisection bandwidth the requests never use. The mismatches the backward read exposes are exactly the three anti-patterns 1.1 named: training fabric for an inference business, retrofitting past the air-cooling cliff, and over-provisioned redundancy for checkpointable jobs. The matrix is therefore both a scoping tool and a due-diligence checklist — the same table, run in two directions. → Chapter 5.10 (retrofit limits), Chapter 1.6 (procurement diligence).