Guide › Reliability, Resilience & Standards › 12.2

Chapter 12.2

The AI-Cluster Reliability Rethink: Goodput vs Facility Availability

Facility availability — the data-center 'nines' — measures whether the building is up; an AI cluster earns its return on goodput, the fraction of bought GPU-hours that turn into useful work, and these two numbers diverge so sharply that spending the next redundancy dollar on facility availability is, for most AI factories, the wrong call.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Whether your design-basis metric is facility availability (the Uptime/TIA 'nines') or cluster goodput — because the two pull redundancy spend in opposite directions, and only one tracks the revenue.
Where redundancy actually lives for your workload: in the facility power chain (2N/Tier-IV), in the silicon and storage (capacitance, hot spares, fast-checkpoint tiers), or in the software (elastic training, request retry) — and therefore what the next dollar buys.
How much facility availability your thermal/mechanical path can actually deliver once chilled-water inertia is gone and a CDU/pump loss throttles or trips racks in seconds, not minutes.
Whether your cluster is a grid-reliability problem in its own right — a synchronized multi-hundred-MW load swing the utility now models as a fault — and who pays to flatten it.
The point on the goodput-vs-availability curve where you stop buying facility nines and start buying goodput — the crossover that the Chapter 12.5 model quantifies for your failure environment.

For sixty years the data-center industry optimized one number: availability — the fraction of time the facility is energized and cooled, expressed as a string of nines and certified against a tier. A Tier III site promises concurrent maintainability and roughly 99.982% uptime (about 1.6 hours of down a year); a Tier IV site adds fault tolerance and roughly 99.995% (about 26 minutes a year). That metric was correct for the workload it was built around: enterprise applications and web services where the unit of value is a transaction, an outage is a binary up/down event, and a single rack going dark is a contained, recoverable nuisance. Redundancy — N+1, 2N, block- and distributed-redundant power, dual cooling paths — exists to push that one number toward unity.

An AI factory breaks the assumption underneath the metric. A frontier training job is one tightly-coupled supercomputer running synchronously across tens of thousands of accelerators; a single failed GPU forces the entire job to restart from its last checkpoint. The facility can be at 100.000% availability — every breaker closed, every CDU pumping — and the cluster can still be throwing away a fifth of the money you spent on it, because the GPUs are idle waiting on a straggler, replaying lost steps, or stalled mid-checkpoint. Availability measures the building. It does not measure the work. The metric that measures the work is goodput, and this chapter is the argument that goodput — not the facility's nines — is the number that governs return, and therefore the number your redundancy budget should be optimizing.

Two metrics, and why they diverge

Define the terms precisely, because the whole rethink lives in the gap between them. Facility availability is a property of the physical plant: the probability that power and cooling are present at the rack, measured at the building boundary, certified by topology (Uptime Tier, TIA-942 Rated level, EN 50600 Availability Class). Goodput is a property of the workload: the fraction of provisioned accelerator-time that produces preserved, useful progress. Google's formulation decomposes it as ML Goodput = Scheduling Goodput × Runtime Goodput × Program Goodput — resources available × time spent making preserved progress × effective FLOP utilization — and everything that is not goodput is badput: accelerator init, JIT compilation, data-loading stalls, checkpoint save and restore, wasted progress replayed after a failure, and infrastructure recovery during restarts (Google Cloud, 2024–2025).

The two numbers diverge because most badput is invisible to the facility. When the building loses power, both availability and goodput drop — they agree. But the dominant losses in a real cluster are not facility outages. Meta's published Llama 3 405B snapshot recorded 419 unplanned interruptions over 54 days on 16,384 H100s — roughly one every three hours — of which 78% were hardware-caused and 58.7% traced to GPUs or HBM (Meta, 2024). None of those is a building outage. Every one of them is goodput lost while the facility sat at a flawless 100% availability. The relationship runs one way: facility availability is a ceiling on goodput, never a floor. You can have perfect availability and mediocre goodput; you cannot have poor availability and good goodput. Optimizing the ceiling while the workload bleeds out below it is the central mis-allocation.

Redundancy moves into the silicon and the software

The deepest consequence of the goodput reframe is that redundancy migrates off the facility one-line diagram and into the compute stack. In the legacy model, resilience was almost entirely a facility property — you bought it as 2N power, dual cooling, and a generator yard, and the IT equipment was a passive load that the facility protected. In the AI model, the workload protects itself: synchronous training already tolerates node loss by checkpoint-and-resume, so the resilience that matters is how fast it recovers, which is a silicon, storage, and software question, not a switchgear question.

This changes the design-basis arithmetic at every layer. In the silicon: per-GPU capacitance (65 J/GPU on GB300, ~400 J/GPU on Vera Rubin) and rack BBUs ride through the millisecond-to-second transients that used to be the UPS's job, while the facility BESS is repurposed from outage ride-through to transient absorption (NVIDIA / SemiAnalysis, 2025–2026). In storage: multi-tier checkpointing collapses mean-time-to-recovery from the 15–30 minutes of a remote-filesystem restore to under two minutes by staging checkpoints in node-local and peer memory first (Google Cloud, 2025) — a goodput improvement the facility one-line cannot deliver at any redundancy tier. In software: elastic training continues on a shrunken world size instead of halting, and inference fabrics retry failed requests against healthy replicas. A dollar spent lifting facility power from N+1 to 2N prevents a class of outage the synchronous job already survives via checkpoint; the same dollar spent on a faster checkpoint tier or a hot-spare pool directly converts badput into goodput. The question is which layer resilience earns its keep in, not how many nines you buy.

Where the next redundancy dollar goes: facility availability vs goodput

Redundancy spend	Layer	What it buys	Training relevance	Inference relevance	Goodput leverage
N+1 → 2N facility power	Facility	Fewer building-level power outages; concurrent maintainability → fault tolerance	Low — job already survives outage via checkpoint-and-resume	High — an outage is lost revenue and a breached SLA	Weak for training; strong for always-on inference
Multi-tier / async checkpointing	Storage + software	MTTR from 15–30 min → <2 min; less replayed progress	Very high — directly cuts the largest badput bucket	Low — inference is stateless per request	Strongest single lever for training goodput
Hot-spare GPU pool + fast health-check/drain	Silicon + orchestration	Failed node swapped in minutes, not a fabric re-cable	High — shrinks recovery time per interruption	Moderate — keeps replica count above SLO	Strong for both, scales with failure rate
Per-GPU capacitance + rack BBU + facility BESS	Silicon + facility	Ride-through of ms–s transients; ~30% peak-grid reduction	Moderate — prevents transient-induced trips	Moderate — protects latency SLO during swings	Indirect — avoids badput from nuisance trips
N+1 CDU / pump / dual liquid loop	Thermal	Continuity of cooling once chilled-water inertia is gone	High — a cooling loss throttles/trips racks in seconds	High — same physics, no thermal buffer	Strong — the new single point of cluster-wide failure

The recurring decision in an AI factory. 'Goodput leverage' is qualitative — the quantitative crossover is the Chapter 12.5 sensitivity model. Figures are 2026-current; see keynumbers for sources.

Read the table as a spend-allocation guide, not a feature list. For a checkpointable training cluster the rational priority order runs roughly opposite to the legacy facility instinct: checkpoint tiering and hot spares first, cooling continuity second, facility power redundancy last. For an always-on inference fleet the order inverts again — facility availability and cooling continuity climb back up because an outage is unrecoverable revenue. The same building, two different workloads, two opposite redundancy budgets. Designing to goodput forces you to allocate the redundancy capital where the workload loses money, which is rarely where the facility tier chart tells you to spend it.

The thermal path: where availability disappears

The most under-appreciated consequence of the density ramp is that the cooling system became the dominant single point of cluster-wide failure, and it lost its shock absorber at the same time. An air-cooled hall carried enormous thermal inertia: chilled-water volume, the air mass of the room, raised-floor plenum. A CRAH failure gave operators minutes of ride-through before inlet temperatures climbed — time to fail over, time to intervene. Direct-to-chip liquid cooling deletes that buffer. A GB200-class cold plate sits microns from a ~1,000 W die with a coolant inlet that must stay below ~25 °C and a delta-T held under ~10 °C; lose flow and junction temperature climbs not in minutes but in seconds, and the GPUs throttle — up to 50% — or trip to protect the silicon.

This relocates the availability problem. The facility can hold power at Tier-IV nines and still take the entire cluster down through a coolant-distribution-unit fault, a pump trip, or a control-loop oscillation, because the technology-cooling loop now has no inertia to coast on. The design-basis response is to push redundancy into the loop the way the legacy world pushed it into the power chain: N+1 (or 2N) CDUs, redundant pumps with automatic ride-through, dual-fed coolant loops, and leak-detection with negative-pressure containment so a breach throttles rather than floods (Vertiv / Equinix / Chilldyne, 2026). Concurrent maintainability — the Tier-III property the industry already values — has to be re-earned in the liquid path: you must be able to pull a pump or service a heat exchanger without dropping the rack. Skimp on cooling-loop redundancy and you have built a cluster whose availability is capped by its weakest pump, no matter how many nines the power chain carries. → Chapter 12.1 sets the topology vocabulary; the DLC continuity engineering is in Chapter 5.4.

Deep dive: the disappearance of chilled-water inertia, quantified

Thermal inertia is the integral of mass × specific heat × temperature headroom across everything between the chip and the heat sink. In an air hall it is large and free: a 1,000 m² raised-floor room holds tonnes of air and often thousands of litres of chilled water in the loop, buying minutes of ride-through after a cooling fault before any server crosses its inlet limit. The operator's runbook assumed those minutes — failover scripts, on-call response, even manual intervention all fit inside them.

Direct-to-chip flips every term. The coolant volume in contact with the die is litres, not tonnes; the temperature headroom is the ~10 °C the cold plate is allowed before throttling, not the 10–15 °C of ASHRAE air margin; and the heat flux is an order of magnitude higher. The time constant collapses from minutes to seconds. A pump that coasts to a stop, a CDU heat-exchanger fouling event, or a controls glitch that drops flow now races the silicon's thermal protection — and the silicon wins, throttling or tripping to survive. The engineering consequence is that cooling continuity must be designed to electrical-grade standards: pumps on UPS/BESS-backed feeds with automatic transfer, N+1 CDUs with isolation valves for concurrent service, flow and pressure telemetry feeding the same SLO-burn alerting as the compute fabric, and a commissioning test that drops a pump at full load and proves ride-through. The chilled-water buffer that quietly underwrote air-cooled availability is gone, and nothing in the facility tier chart replaces it. → Chapter 5.4; commissioning the worst-case branch in Chapter 13.3.

The facility as a grid-reliability problem

The reliability rethink runs in both directions. The cluster's own reliability depends on the facility — but the facility has become a reliability problem for the grid, and that coupling now feeds back into the cluster's design-basis. AI training loads are phase-coherent and synchronized: tens of thousands of GPUs step from idle to peak and back together, every training step, producing load swings of hundreds of megawatts on sub-second timescales. Worse, the GPUs' own protection can drop the entire load at once. In a 2024 Virginia event, ~1,500 MW of data-center load tripped off on a single 230 kV fault, with roughly 1.5 GW dropped in 82 seconds — enough that the surviving generation had to absorb the imbalance, and enough that NERC issued a rare Level 3 Essential Actions Alert and now treats large data centers as grid actors with mandatory fault-ride-through obligations (NERC / Utility Dive, 2026).

Ride-through is no longer optional, and it is no longer purely a grid-interconnection concern — it is a goodput concern. A cluster that trips off on every grid disturbance to protect itself converts a recoverable grid event into a full cluster restart: maximum badput. The mitigation is the same transient-absorption stack that protects against the cluster's own load swings — per-GPU capacitance, rack BBUs, facility BESS, and intelligent power smoothing that has demonstrated ~30% reductions in peak grid demand on real training jobs — now also tuned to keep the cluster online through utility-side faults rather than dropping load (NVIDIA / SemiAnalysis, 2025–2026). The fork: engineer the facility to ride through grid disturbances (cost: storage and smoothing capex, plus a regulator-facing study) or accept that grid noise becomes cluster restarts (cost: goodput, plus a worsening relationship with a utility that can throttle your interconnection). → the full grid-interactive engineering — reactive support, frequency response, ride-through curves at the point of interconnection — is canonical in Chapter 4.10; the storage that backs it in Chapter 4.5.

419 / 54 days

unplanned interruptions on 16,384 H100s (~1 every 3 hr); 78% hardware, 58.7% GPU/HBM — all at 100% facility availability

2024Meta (Llama 3 405B paper) / Tom's Hardware

~90% / ~96%

goodput (effective training time): industry average vs best-in-class; reliability overhead 6–21% of TCO

2025SemiAnalysis ClusterMAX / CoreWeave

~7 days

best-in-class MTBF per 512 GPUs on mature H100 clusters; far worse during 3–4 week burn-in

2025SemiAnalysis (100k H100 clusters)

99.982% / 99.995%

Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min/yr); Tier IV ~20–40% capital premium

2025Uptime Institute (% figures Uptime-disavowed)

15–30 min → <2 min

training MTTR cut by multi-tier checkpointing — a goodput gain no facility tier delivers

2025Google Cloud (multi-tier checkpointing)

~1,500 MW

data-center load lost on a single 230 kV fault (1.5 GW in 82 s, VA); triggered NERC's rare Level 3 alert

2026NERC Level 3 Alert / Utility Dive

65 → ~400 J/GPU

per-GPU capacitance, GB300 → Vera Rubin (~6x); ~30% peak-grid-demand reduction demonstrated

2026NVIDIA / SemiAnalysis

~43.4%

large-LLM job failure rate (Alibaba Unicron); ~37% hardware-attributed, ~73% restart-recoverable

2024Alibaba (Unicron) via SemiAnalysis

Mapping the rethink onto the standards

None of this means the standards are wrong — it means they answer a question that is no longer the binding one. Uptime Tiers, TIA-942 Rated levels, and EN 50600 Availability Classes all certify the facility's availability, and none of them certify goodput. A Tier IV building tells a tenant the power and cooling will be present 99.995% of the time; it says nothing about whether the cluster inside it loses 10% or 20% of its bought GPU-hours to badput the facility never sees. The standards remain the right tool for the question they answer — they govern the ceiling — but they cannot be the design-basis metric for a workload whose return lives in the gap below that ceiling.

The practical reconciliation is a two-tier design-basis: certify the facility to the availability class the workload's floor requires (high for always-on inference, deliberately modest for checkpointable training), then run a separate goodput design-basis that governs the silicon/storage/software redundancy the standards never touch. This is also where the redundancy primer in Chapter 0.5 gets extended: N, N+1, 2N and the distributed-redundant topologies are still the vocabulary, but you now apply them in two places — the facility power/cooling chain and the compute resilience stack — and the goodput model decides which application earns the spend. The standards landscape and topology selection are detailed in Chapter 12.1; the SLA that contracts goodput rather than availability is the subject of Chapter 12.4.

Availability-shaped vs goodput-shaped design basis

Axis	Availability-shaped (legacy / inference)	Goodput-shaped (training)
Primary metric	Facility nines (Tier/Rated/Class)	Effective accelerator-time (goodput %)
Outage model	Binary up/down at the building	Continuous badput leakage below a perfect ceiling
Where redundancy lives	Facility power & cooling (2N, dual path)	Silicon, storage, software (checkpoint, spares, elastic)
Power redundancy target	2N / Tier-IV-class	N or N+1 — checkpoint-and-resume tolerant
Dominant failure to engineer against	Utility outage, switchgear fault	GPU/HBM faults, stragglers, slow recovery, cooling loss
Next-dollar priority	More facility nines	Faster recovery + cooling continuity
Certified by	Uptime / TIA-942 / EN 50600	Acceptance goodput baseline + ClusterMAX-style health checks

The two design bases pull in opposite directions on nearly every axis. Most real facilities are a deliberate blend keyed to workload mix, not a pure pick.

The goodput-availability tradeoff curve

Put the two metrics on one curve and the design decision becomes a spending question with a real crossover. On the x-axis, redundancy capital; on the y-axis, realized goodput. Early dollars spent on the workload's own dominant failures — checkpoint tiering, hot spares, cooling continuity, ride-through — buy steep goodput gains, because they attack the badput that actually dominates the loss budget (GPU/HBM faults, slow recovery, transient trips). Past a point, the next dollar can only go to facility availability — lifting power from N+1 toward 2N+ — and for a checkpointable training job that dollar buys almost no goodput, because the job already survives the outage class that spend prevents. The curve flattens, and the flattening point is where you should stop buying nines.

For an always-on inference fleet the curve has a different shape: facility availability stays on the productive part of the curve much longer, because every outage is unrecoverable revenue rather than a checkpoint replay, so 2N power and dual cooling keep paying. The right redundancy budget is therefore workload-specific and quantitative: it is the point where the marginal goodput per dollar of facility availability falls below the marginal goodput per dollar of compute-stack resilience — and finding that point for your actual failure environment is exactly what the reliability model does. It takes component failure rates (Chapter 14.3) and FMEA scenarios (Appendix F) as inputs and rolls them up to cluster-level availability and goodput, then runs the sensitivity analysis that locates the crossover. → the model is built in Chapter 12.5; the failure-rate inputs in Chapter 14.3.

The crossover, stated as a rule

Spend redundancy capital where the workload loses money, in this order until the marginal goodput-per-dollar equalizes: (1) attack the dominant badput bucket — for training that is recovery time, so checkpoint tiering and hot spares come first; (2) secure cooling continuity, because the liquid path is the new cluster-wide single point of failure and it lost its inertia; (3) buy ride-through to keep grid noise from becoming restarts; (4) only then buy facility availability nines — and for a checkpointable job, stop early, because 2N power prevents an outage the checkpoint already survives. For always-on inference, move step 4 up: the outage is unrecoverable, so facility availability stays on the productive part of the curve. The exact crossover is not a heuristic — it is the output of the Chapter 12.5 sensitivity model run against your failure environment.

Deep dive: why a perfect facility still delivers ~90% goodput

Take the Meta Llama 3 405B numbers at face value: 419 interruptions in 54 days on 16,384 GPUs, 78% hardware-caused, and yet over 90% effective training time achieved. Decompose where the other ~10% went, because it explains why facility availability cannot move the number. The losses are: wasted progress — work done since the last checkpoint, thrown away on each interruption (mitigated by checkpoint cadence, the Young/Daly optimal interval); infrastructure recovery — the time to detect the failure, drain the bad node, reschedule, and reload state (mitigated by fast health-checks and multi-tier checkpoint restore); stragglers — the whole synchronous job moving at the speed of its slowest rank, so one degraded 'lemon' GPU taxes thousands of healthy ones (mitigated by lemon-node detection and eviction); and steady-state MFU below the theoretical peak. Not one of those buckets is a facility-availability event. The power was on and the cooling was flowing through every single one of the 419 interruptions.

The consequence for capital allocation is direct. The 6–21% of TCO that goes to 'reliability overhead' in a well-run cluster is spent almost entirely inside the goodput stack — spare capacity, checkpoint storage bandwidth, health-check tooling, elastic-training engineering — not on facility nines. A best-in-class operator closing the gap from ~90% to ~96% goodput does it with faster recovery and better straggler detection, not with a higher Uptime tier. That 6-point goodput gain is worth more, on a large fleet, than the entire capital premium of going from Tier III to Tier IV. → the checkpoint math in Chapter 9.4; serving-side goodput in Chapter 10.11.

This chapter is the conceptual hinge of Part 12. The standards and topologies it reframes are detailed in Chapter 12.1, and the redundancy vocabulary it extends into two layers comes from Chapter 0.5. The thermal-continuity engineering that the liquid path now demands is in Chapter 5.4; the storage and grid-interactive behaviour behind ride-through are in Chapter 4.5 and Chapter 4.10. The goodput stack itself — checkpointing in Chapter 9.4 and serving-side goodput-optimal scheduling in Chapter 10.11 — is where the redundancy this chapter redirects actually lives. The crossover point on the tradeoff curve is quantified by the reliability model in Chapter 12.5, fed by the failure rates of Chapter 14.3; the goodput SLA that contracts the result is Chapter 12.4, and the geographic-failover layer above it is Chapter 12.3.