Guide › Part 12

Part 12

Reliability, Resilience & Standards

5 chapters

Resilience Standards, Redundancy Topologies & Fault-Domain Engineering

The redundancy standards the industry inherited rate a building, not a job — so the real design decision is not which Tier you certify to, but which fault domains you draw, how big you let each blast radius grow, and whether you are buying concurrent maintainability, fault tolerance, or both.

The AI-Cluster Reliability Rethink: Goodput vs Facility Availability

Facility availability — the data-center 'nines' — measures whether the building is up; an AI cluster earns its return on goodput, the fraction of bought GPU-hours that turn into useful work, and these two numbers diverge so sharply that spending the next redundancy dollar on facility availability is, for most AI factories, the wrong call.

Disaster Recovery, Business Continuity & Geographic Failover

Disaster recovery for an AI factory is not a backup policy bolted onto a building — it is a per-workload decision about how much spare capacity you pre-pay for, in which geography, kept how warm, against a token-priced revenue clock; and because GPU capacity is power-bound and cannot be conjured on demand, the spare region is the most expensive idle asset on the balance sheet and the most consequential one to size wrong.

SLAs, Goodput Contracts & Availability Commitments

The SLA is where the reliability rethink becomes a number on a contract — and the operator who promises facility availability for a workload that is actually paying for goodput has signed a contract that is simultaneously unenforceable by the customer and unprofitable for the provider.

Quantitative Reliability & Availability Modeling (RBD / FTA / Monte-Carlo)

An availability number you cannot reconstruct from component failure rates, repair times, and a stated failure environment is a marketing claim, not an engineering result — and the whole point of RBD, fault trees, Markov chains, and Monte-Carlo is to turn the redundancy debate into a model whose every nine, and every percent of goodput, has a traceable parent.