The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 12.5

Quantitative Reliability & Availability Modeling (RBD / FTA / Monte-Carlo)

An availability number you cannot reconstruct from component failure rates, repair times, and a stated failure environment is a marketing claim, not an engineering result — and the whole point of RBD, fault trees, Markov chains, and Monte-Carlo is to turn the redundancy debate into a model whose every nine, and every percent of goodput, has a traceable parent.

GOODPUTPOWER-BOUND

What you'll decide here

  1. Which target you are actually modeling — facility availability (uptime fraction) or cluster goodput (effective-training-time fraction) — because they are different objective functions with different dominant terms, and a redundancy investment that buys nines may buy almost no goodput.
  2. Which method fits the question: closed-form RBD/k-of-n algebra for static redundancy, a Markov state-space model when repair-crew limits and degraded states matter, fault trees and minimal cut sets to find the dominant failure path, and Monte-Carlo when the failure environment is correlated, time-varying, or non-exponential.
  3. Whether your model accounts for common-cause failure — the shared CDU, the shared bus, the fleet-wide firmware push — because an idealized parallel block claims six nines that a single beta-factor erases down to three.
  4. Which inputs you trust: the Chapter 14.3 AFRs, the Appendix F FMEA scenarios, and the repair/restore times — because a quantitative model is exactly as credible as its rates, and most of those rates are contested and scale-dependent.
  5. Where the next dollar of redundancy buys the most nines or the most goodput — the sensitivity (importance) ranking that is the actual deliverable of this chapter and the engine under the Chapter 12.2 tradeoff curve.

Chapter 0.5 gave you the vocabulary — series and parallel, N / N+1 / 2N, the Tier ladder, availability versus goodput — as a primer you could carry into a design review. This chapter turns that vocabulary into a method: the quantitative machinery that takes component failure rates and repair times and rolls them up into a defensible cluster-level number, with a traceable parent for every nine and every percent of goodput. It is the chapter where "we built it 2N, so it's reliable" stops being an assertion and becomes a calculation that someone can check, attack, and improve.

The chapter builds the method up one technique at a time. We extend the series/parallel/k-of-n algebra into real availability arithmetic; we build Markov state-space models for repairable systems where repair crews are finite and degraded operation is a real state, not a binary; we use fault trees and minimal cut sets to surface the dominant failure path that the redundancy diagram hides; we model common-cause failure with beta-factors, because that is where idealized parallel redundancy quietly fails to deliver; and we close with Monte-Carlo, which is the only honest tool once the failure environment becomes correlated, time-varying, and non-exponential — which, for an AI cluster, it always is. The through-line: a model is exactly as good as its inputs and its honesty about correlation, and the two most common ways to lie with an availability model are to assume independence and to model uptime when the thing that pays the bills is goodput.

Two targets, two objective functions

Before any algebra, name the target. A traditional IT availability model computes the fraction of time the service is up — the complement of downtime, the thing the Uptime Tiers and the SLA penalty ladder are written against. An AI training cluster cares about a different number: goodput, the fraction of wall-clock GPU-time that advances the job rather than being lost to failure, detection, restart, and recompute-from-checkpoint. The two are not the same function, and the gap between them is the entire reason Chapter 12.2 exists. A facility can be 99.99% available (the power and cooling never dropped) while the cluster runs at 85% goodput (a GPU failed every few hours and each failure cost a checkpoint-restart) — the facility model and the goodput model share almost no dominant terms.

This matters because the redundancy investment that moves one barely moves the other. Spending capital on 2N facility power lifts availability but does almost nothing for goodput, because the goodput loss in a synchronous training job is dominated by silicon failure — GPUs, HBM, optics, NICs — that 2N power does not prevent. Spending the same capital on faster checkpointing, hot spares, and lemon-node ejection lifts goodput and does almost nothing for the published availability number. A model that does not declare which target it optimizes will recommend the wrong investment with a straight face. → the rethink in Chapter 12.2; the metric definitions in Chapter 0.3.

Reliability block diagrams: the series/parallel/k-of-n algebra

The RBD is the workhorse and the right place to start, because it makes the topology's logic explicit before any number is plugged in. Each block is a component with a steady-state availability A = MTBF / (MTBF + MTTR) — uptime over uptime-plus-repair. The blocks compose by two rules. Series (every block must work): availabilities multiply, A = A₁ × A₂ × …, so a chain is always less available than its weakest link and unavailability roughly adds. Parallel (any one block suffices): the unavailabilities multiply, A = 1 − (1−A₁)(1−A₂)…, so redundancy multiplies the small failure probabilities into a much smaller one — this is where the nines come from.

The general case is k-of-n: the system works if at least k of n identical units work, which is the honest model for an N+1 or N+2 power or cooling plant (you need N running, you installed N+1, so it survives any one failure). For independent units each with availability p, the system availability is the binomial tail — the probability that k or more of n survive. The practical consequence: going from N (k=n, a series-like all-must-work plant) to N+1 (k=n−1) is the single largest nine-buying step, and each further +1 buys rapidly diminishing returns. That diminishing-returns curve is not intuition; it falls straight out of the binomial, and it is the arithmetic backbone of the Chapter 12.2 tradeoff curve.

Which method for which question
MethodAnswersKey assumptionBreaks whenTypical AI-DC use
RBD / k-of-n algebraSteady-state availability of a fixed topologyIndependent blocks, constant rates, unlimited repairRepair is shared/limited, or failures correlatePower & cooling plant nines; N vs N+1 vs 2N
Markov / state-spaceRepairable systems with degraded states & finite crewsExponential (memoryless) transitionsWear-out, time-varying, or non-exponential repairMTTR vs crew size; degraded-cooling ride-through
Fault tree / minimal cut setsThe dominant failure path and its top contributorsBoolean gates; rates assignable to basic eventsDynamic ordering / sequence-dependent failuresFinding the single shared CDU or bus that dominates
Monte-Carlo simulationAvailability AND goodput under a realistic environmentOnly what you put in the sampling modelInputs (rates, correlations) are themselves wrongCluster goodput with correlated, scale-dependent AFRs
The four core techniques are complementary, not competing. Pick by what the question actually requires; escalate from closed-form to simulation only when the assumptions break.

Read the table as an escalation ladder, not a menu. Start with the RBD because it is cheap and forces the topology to be explicit. Escalate to Markov the moment repair is shared or the system has a meaningful degraded state — a cooling plant that limps at reduced capacity rather than failing outright. Build a fault tree when the RBD has gotten complex enough that the dominant failure path is no longer obvious by inspection. And reach for Monte-Carlo when the thing you actually care about — goodput under a correlated, scale-dependent failure environment — has no closed form at all. Each rung exists because the rung below it made an assumption that the cluster violates.

Markov state-space models: repair crews and degraded states

The RBD's hidden assumption is that repair is instantaneous-to-start and unconstrained — every failed unit gets its own repair crew the moment it fails. Real facilities have finite crews, parts depots with turnaround time, and components that fail into a degraded state rather than a clean binary down. A Markov model captures all three by representing the system as a set of discrete states (all-up, one-down-repairing, two-down, degraded-running) with transition rates between them: failure rates (λ) push the system toward worse states, repair rates (μ) pull it back. Solve for the steady-state probability of each state and you have availability as the probability mass in the acceptable states — and, critically, you can see how much of your unavailability is queueing for a repair crew rather than waiting on a part.

This is where MTTR stops being a single number and becomes a policy. A 2N plant whose second unit is down for 24 hours awaiting a part is, for that window, running N with no margin — a state the RBD never showed you. The Markov model prices that exposure: if the repair rate μ is slow relative to the failure rate λ, the probability of sitting in the vulnerable degraded state climbs, and the second-failure-during-repair term — the one that actually takes the plant down — grows with it. The decision this surfaces is concrete and expensive: does the next dollar buy another redundant unit, or a faster repair (on-site spares, a larger crew, a depot SLA)? For a plant already at N+1, shrinking MTTR often buys more availability per dollar than adding N+2, because it attacks the dominant second-failure-during-repair term directly. The model tells you which; intuition routinely gets it backwards. → repair-time and sparing inputs in Chapter 14.3.

Deep dive: why exponential transitions are a convenient lie, and when it bites

Markov models assume memoryless (exponential) transitions: the probability a component fails in the next hour does not depend on how long it has already run. That is a genuine convenience — it makes the math a linear system you can solve by hand or with a small solver — and it is roughly true during the flat-bottom of the bathtub curve, the useful-life region where failures are random. It is false at both ends. During infant mortality (the first weeks after install, the burn-in window where new GPU clusters fail far more often than mature ones), and during wear-out (aging fans, pumps, capacitors, optics with creeping link margin), the failure rate is not constant and the exponential assumption understates the risk.

For an AI cluster this is not academic, because so much of the fleet is perpetually new — a capacity ramp means there is always a freshly-installed tranche in its high-failure infant-mortality window. The fix when it bites is one of two things: piecewise-Markov (different rate regimes for burn-in, useful-life, wear-out) or, more honestly, abandon Markov for Monte-Carlo and sample directly from Weibull or empirical lifetime distributions that capture the bathtub shape. The tell that you have outgrown the exponential assumption is a model that is confidently wrong about a brand-new cluster's first month — exactly the month the goodput numbers look worst in the field.

Fault trees and minimal cut sets: finding the dominant path

An RBD is built bottom-up from components; a fault tree is built top-down from the thing you are trying to prevent — "cluster job halted," "hall loses cooling," "tenant SLA breached" — decomposed through AND/OR gates down to basic events with assignable rates. The two are duals, but the fault tree answers a question the RBD does not phrase well: what are the smallest sets of simultaneous failures that take the system down? Those are the minimal cut sets, and ranking them by probability is the single most useful output in this chapter. A first-order cut set (a single basic event that alone causes the top event) is a latent single point of failure hiding inside a topology you thought was redundant. A model that surfaces one shared, un-redundant component buried under two layers of N+1 has earned its keep.

The recurring AI-data-center example is the cooling distribution unit. The RBD shows N+1 CDUs and reports comfortable nines. The fault tree, traced to basic events, reveals that all of them share a single secondary-loop isolation valve, a single facility-water supply, or a single controls PLC — a first-order cut set that the parallel block diagram drew right over. The same pattern recurs for a shared busway feeding ostensibly-independent power paths, a single BMS that gates every chiller, and a single firmware image across every BMC. The discipline is to push the tree down to basic events — physical, replaceable, independently-failing things — and refuse to stop at the convenient block boundary, because the convenient boundary is exactly where the shared dependency hides. → the FMEA basic-event catalog in Appendix F.

Common-cause failure: the beta-factor that erases your nines

Here is the term that separates a credible model from a flattering one. Parallel redundancy delivers its spectacular nines only if the redundant units fail independently. They never fully do. A shared cause — a common firmware bug pushed fleet-wide, a shared coolant chemistry problem, a single controls fault, a power transient that hits every unit at once, a maintenance error repeated on each "redundant" path — defeats redundancy by failing the units together. The standard way to model this is the beta-factor: a fraction β of a component's failures are assumed common-cause, hitting all redundant units simultaneously, while the remaining (1−β) fail independently. IEC 61508 puts β in the range of roughly 0.5% to 10%, with 10% the conservative default you inherit if you have done nothing to diversify the redundant paths.

The consequence is severe and counterintuitive. An idealized parallel pair of units each at 99.9% availability computes to about six nines (1 − 0.001²). Layer in a 5% beta-factor — assume 5% of failures are common-cause — and the system availability collapses toward roughly three-and-a-half nines, because the common-cause term, not the independent-failure product, now dominates. Redundancy cannot buy you below the common-cause floor. This is why diversity is worth paying for: different firmware versions across redundant controllers, A/B coolant sourcing, independent power-transient paths, staggered maintenance windows, and — above all — never pushing the same firmware to every redundant unit on the same day. A model without a beta-factor is the most common way an availability number is overstated by two or three nines.

Monte-Carlo: simulating availability AND goodput under a real failure environment

Closed-form methods buy tractability with assumptions the cluster violates: independence, constant rates, exponential repair, a single target. Monte-Carlo abandons tractability for honesty. You build a generative model — sample each component's time-to-failure from its real lifetime distribution (Weibull for wear-out, empirical for the burn-in hump), inject correlation explicitly (a common-cause event that fails a set of units together, a grid transient that hits the whole hall), model the repair queue with its finite crews and depot delays — then run the clock thousands of times and read off the distribution of outcomes, not just a point estimate. The output is what no closed form gives you: a full distribution of annual downtime, of goodput, of worst-case windows, with confidence intervals.

For an AI cluster the killer feature is that the same simulation produces both targets at once. Layer the workload model on top of the failure model — checkpoint interval, detection time, restart-and-recompute cost, lemon-node ejection policy — and the run yields goodput directly: every sampled GPU failure triggers a detection delay, a restart, and a recompute of the work since the last checkpoint, and the fraction of GPU-time that survives that gauntlet is the goodput. This is the only method that can answer "if I halve my checkpoint interval, what happens to my goodput distribution at 100k GPUs?" — because the answer depends on a failure rate that scales with cluster size, a recompute cost that depends on interval, and a correlation structure that no algebra captures. The cost is calibration: a Monte-Carlo result is only as good as the rates and correlations you fed it, which is why this method consumes the Chapter 14.3 AFRs and the Appendix F scenarios as its raw inputs rather than inventing them.

6.50 vs 2.34
failures per 1,000 node-days, Meta RSC-1 vs RSC-2 — the empirical λ that drives any cluster goodput model
2024Meta, Revisiting Reliability in Large-Scale ML Clusters (arXiv 2410.21680)
1.8 hr → 14 min
projected mean time between failures for a 16,384-GPU vs 131,072-GPU synchronous job
2024Meta (arXiv 2410.21680); SemiAnalysis
0.70 → 0.93
modeled ETTR (goodput) for a 16k-GPU run moving from 60-min to 5-min checkpoint interval
2024Meta, Revisiting Reliability (arXiv 2410.21680)
14% → 4%
512+ GPU job failure rate after lemon-node ejection — a sensitivity result the model must reproduce
2024Meta, Revisiting Reliability (arXiv 2410.21680)
0.5%–10%
IEC 61508 beta-factor range for common-cause failure; ~10% the default if no diversity measures applied
2025IEC 61508-6 Annex D; exida
~9% AFR
annualized GPU failure rate feeding the per-node λ in fleet roll-up models
2026domain synthesis / Chapter 14.3 fleet data
99.982% / 99.995%
Uptime Tier III / Tier IV availability targets (~1.6 hr vs ~26 min/yr) — the facility-model benchmark
2025Uptime Institute (Tier classes; % figures Uptime-disavowed)
~90% / ~96%
industry-average vs best-in-class training goodput — the validation band any goodput model must land in
2025SemiAnalysis ClusterMAX / CoreWeave

The roll-up: from component AFRs to cluster availability and goodput

The point of the method is the roll-up — taking the per-component failure rates from Chapter 14.3 and the failure scenarios from Appendix F and composing them, level by level, into a number for the whole cluster. The path runs: component AFRs → per-node failure rate → RBD/k-of-n for the redundant subsystems (power, cooling, network core) → fault tree to catch the shared cut sets → Monte-Carlo for the synchronous-job goodput that no algebra captures → cluster-level availability AND goodput. Each level has a method matched to its question, and the levels compose: the facility availability from the RBD becomes one input to the goodput simulation (a power event is just another correlated failure that stalls the job), and the per-node silicon AFR becomes the dominant term in the goodput run.

The validation discipline is to check the roll-up against ground truth before trusting it forward. A goodput model that does not reproduce the field band — roughly 90% industry-average, ~96% best-in-class effective training time — is mis-calibrated, and the usual culprit is an optimistic per-node failure rate or an absent correlation term. A facility availability model that lands far above the Tier benchmark for its topology is almost always missing a beta-factor or a shared cut set. The model is not done when it produces a number; it is done when it produces a number you can defend against both the published benchmarks and your own incident history. → AFR inputs and the failure taxonomy in Chapter 14.3; the FMEA scenario catalog in Appendix F; the checkpoint math behind the goodput term in Chapter 9.4.

Sensitivity analysis: where the next dollar buys the most nines or goodput

The deliverable of this chapter is not a single availability number — it is a ranking. Once the model is built and calibrated, you compute the sensitivity (importance) of the result to each input: how much does cluster availability or goodput move per unit of investment in this component, this redundancy step, this MTTR reduction, this checkpoint interval? That ranking is the engine under the Chapter 12.2 tradeoff curve and the answer to the only question the CFO actually asked: where does the next dollar of redundancy buy the most?

The rankings routinely surprise. On the availability side, the model frequently shows that shrinking MTTR (on-site spares, a depot SLA, a larger crew) beats adding another redundant unit, because it attacks the dominant second-failure-during-repair term directly — and that diversifying against common-cause beats both, because the beta-factor is the floor nothing else can pierce. On the goodput side, the model almost always shows that checkpoint cadence and detection-to-restart time dominate facility redundancy by an order of magnitude — the Meta data makes it concrete: moving a 16k-GPU run from a 60-minute to a 5-minute checkpoint interval lifts ETTR from 0.70 to 0.93, a goodput gain no amount of 2N power could touch, and lemon-node ejection cutting large-job failures from 14% to 4% is a software policy, not a capital expense. The general lesson holds across clusters: for checkpointable training, the highest-return reliability dollars are almost never in the facility — they are in checkpointing, health-checking, and node-ejection. The sensitivity analysis is what proves that to a skeptic with a budget, and it is why this model, not intuition, should set the redundancy spend. → the tradeoff curve it feeds in Chapter 12.2; the SLA commitments it underwrites in Chapter 12.4.

Deep dive: this model is not the design-validation digital twin (Chapter 2.7)

It is easy to conflate two things that both call themselves "models" and both simulate the facility. They are distinct tools with distinct purposes, and using one for the other's job produces confident nonsense. The availability/goodput model of this chapter is a reliability model: its inputs are failure rates, repair times, and a failure environment; its outputs are nines and goodput distributions and a sensitivity ranking; its question is "how often does this break and what does that cost?" It is statistical, it is about the long-run frequency of failure, and it lives on the time axis of years and incident counts.

The design-validation digital twin of Chapter 2.7 is a physics model: CFD for airflow and thermal, electrical transient simulation, hydraulic models of the coolant loop. Its inputs are geometry, material properties, and boundary conditions; its outputs are temperatures, pressures, and transient waveforms; its question is "does this design work under load and worst-case transients?" It lives on the time axis of milliseconds to minutes. The two feed each other — the digital twin tells you whether a degraded-cooling state is survivable, which becomes a state in the Markov model; the reliability model tells you how often you will be in that state — but they are not interchangeable. The reliability model cannot tell you whether a cold plate runs hot; the digital twin cannot tell you how many nines the plant delivers over a year. Keep the two on separate axes and let each answer its own question. → the design-validation twin in Chapter 2.7.

Anti-patterns

The same modeling errors recur, each one a way of producing a number that is precise and wrong. Four are worth naming:

  • Modeling availability for a goodput-bound workload. Building a meticulous 2N facility RBD for a synchronous training tenant whose real losses are entirely silicon-and-checkpoint. The model is internally correct and answers the wrong question — and recommends 2N power where the dollars should have gone to faster checkpointing. → Chapter 12.2.
  • Assuming independence. Reporting the parallel-product nines without a beta-factor, so the model claims six nines where the shared bus, shared CDU loop, and fleet-wide firmware push cap it at three. The most common way an availability number is overstated.
  • Stopping the fault tree at the convenient block. Drawing N+1 CDUs as independent parallel blocks without pushing to basic events, and missing the single shared valve or PLC that is a first-order cut set. The redundancy diagram drew right over the single point of failure. → Appendix F.
  • Modeling once and reusing across scale. Taking the goodput number from a 16k-GPU model and assuming it holds at 100k, when MTBF falls roughly inversely with node count and the answer has moved by an order of magnitude. The model must be re-run at each point on the ramp. → Chapter 14.3.
This chapter is the quantitative spine of Part 12. It extends the design-basis primer in Chapter 0.5 and the topology selection in Chapter 12.1 into an actual method; it consumes the component AFRs and failure taxonomy from Chapter 14.3 and the FMEA scenario catalog in Appendix F as its raw inputs; it draws the checkpoint math that drives the goodput term from Chapter 9.4. Its outputs flow forward: the availability-vs-goodput rethink in Chapter 12.2 is the qualitative argument this model makes quantitative, the SLA and goodput-contract commitments in Chapter 12.4 are underwritten by the numbers this model produces, and the DR/failover design in Chapter 12.3 consumes its dominant cut sets. It is distinct from the physics-based design-validation digital twin in Chapter 2.7, and the sensitivity ranking it produces is the engine under the redundancy tradeoff curve in Chapter 12.2.