Chapter 14.3
Component Failure Modes, Failure Rates & Fleet Reliability Data
A GPU fleet does not fail like a traditional data center — it fails constantly, in three distinct ways (hard, transient, silent), and the only defensible design is to measure the per-component failure rate, accept that the mean cluster lifetime between interruptions is hours not months, and engineer detection and recovery around the failures you cannot prevent.
What you'll decide here
- Which failure taxonomy you instrument for — and specifically whether you fund a silent-data-corruption (SDC) detection program at all, because the failures you do not look for are the ones that quietly poison a training run for days.
- What annualized failure rate (AFR) you assume per component — and whether you derive it from your own fleet telemetry or borrow Meta/SemiAnalysis published numbers, because every spare, every availability model, and every SLA inherits this single input.
- How long you burn in before accepting a node into production — the trade between schedule (every burn-in hour is deferred revenue) and infant mortality landing on the customer's training run.
- Where you draw the line between a node worth repairing and a 'lemon' worth ejecting — the ejection threshold that turns a long tail of repeat-offenders into reclaimed goodput.
- Whether your reliability data feeds the availability and goodput model downstream (Chapter 12.5) as a live, fleet-measured input or as a stale design-time assumption that drifts the moment the next silicon generation lands.
This is the canonical home for one uncomfortable fact: an AI training cluster is the least reliable large machine humans operate at scale, and it is supposed to be. A traditional enterprise data center measures uptime in nines and counts annual outages on one hand. A 16,384-GPU training cluster experiences an unplanned interruption roughly every three hours — Meta's published Llama 3 405B run logged 419 of them over 54 days — and a well-run fleet still delivers over 90% effective training time through automation, not through preventing the failures (Meta, Llama 3 Herd, 2024). At this density and scale, the per-component physics guarantees components fail; the reliability problem is measuring the rate precisely enough to size spares, model availability, and detect the failures that hide.
This chapter establishes three things every other reliability chapter in the guide depends on. First, the failure taxonomy — hard, transient, and silent — which is the canonical fault vocabulary cross-referenced from the GPU operations view in Chapter 10.7, the redundancy view in Chapter 12.1, and the IST failure-demonstration view in Chapter 13.6. Second, the empirical fleet failure-rate data — the actual published numbers from at-scale operators, with their vintages and caveats, that you plug into an availability model rather than inventing. Third, the AFR modeling and burn-in discipline that turns raw component failure rates into a spares forecast and an acceptance gate. Every AFR in this chapter feeds the cluster availability and goodput roll-up in Chapter 12.5; the consolidated FMEA catalog these modes populate lives in Appendix F.
The failure taxonomy: hard, transient, silent
Every fault in a GPU fleet falls into one of three classes, and each demands completely different detection and recovery machinery; confusing them is a common operational error. The taxonomy organizes everything downstream: what you instrument, how fast you must react, and whether the failure is even visible at all.
Hard failures are the easy ones, paradoxically, because they announce themselves. A GPU throws an uncorrectable XID and falls off the PCIe bus, an optical transceiver goes dark, a CDU trips on low flow, a power supply faults. The component is unambiguously dead or unreachable; the job crashes or the node drops out; the telemetry screams. The XID 79 'GPU has fallen off the bus' event — which affects roughly 3.2% of H100s in their first year (Last9 / Introl fleet analyses, 2025) — is the archetypal hard failure. These are expensive in lost goodput but cheap to detect: the recovery path is fail-fast, drain, restart from checkpoint, swap the FRU. Hard failures are where mature operators are already good, because the signal is loud.
Transient failures are the treacherous middle. A correctable ECC error storm on HBM, a link that flaps rather than goes down, a thermal excursion that throttles a GPU for ninety seconds, an XID that self-clears on reset. The component is not dead — it works again after a power-cycle or a few minutes — but it degraded the run while it misbehaved, and it will very likely do so again. Transients are the raw material of lemon nodes: hardware that passes every point-in-time health check yet fails repeatedly under load. The decision a transient forces is not 'is it broken' but 'is it broken often enough to eject' — and getting that threshold wrong either keeps a repeat-offender poisoning runs or ejects healthy capacity. Crucially, Meta found that single-bit ECC error trends predict hard GPU failure 48–72 hours in advance with 89–96% accuracy (Meta fleet research, 2025) — the transient is frequently the early-warning tremor before the hard quake, which is why instrumenting transients is the highest-leverage detection investment most fleets under-fund.
Silent failures are the ones that should keep you up at night, because by definition nothing screams. Silent data corruption (SDC) is a computational error — a multiply that returns the wrong product, a memory read that flips a bit undetected by ECC — that produces no fault, no XID, no log line. The hardware reports success. The math is wrong. In training, an SDC silently corrupts gradients and weights; the loss curve drifts or diverges days later, and you cannot tell whether it is a bad hyperparameter, a data bug, or a single faulty multiplier in one of a hundred thousand chips. The historical SDC rate was roughly 1 per million devices — a cosmic-ray-grade rarity nobody budgeted for. At current process nodes and scale it has risen to roughly 1 per 1,000 silicon devices, which Meta attributes to fundamental silicon manufacturing variation, not particle effects (Meta, How Meta keeps its AI hardware reliable, July 2025). On a 100,000-device fleet, 1-in-1,000 is not an edge case — it is a near-certainty on any multi-day run, and the only defense is a deliberate detection program.
| Class | Signature | Detection mechanism | Reaction timescale | Primary risk if missed |
|---|---|---|---|---|
| Hard | Uncorrectable XID, device off bus, link down, hardware fault | XID/SEL/syslog, DCGM, fabric BER alarms, CDU/PDU telemetry | Seconds — fail-fast, drain, restart from checkpoint | Lost wall-clock to last checkpoint; one node stalls the whole synchronous job |
| Transient | Correctable-ECC storm, link flap, thermal throttle, self-clearing XID | Trend analysis on correctable errors; repeat-offender counters; straggler detection | Minutes to hours — quarantine, observe, decide eject vs keep | Lemon node poisons run after run; the missed 48–72 hr early-warning before a hard failure |
| Silent (SDC) | No signature — correct-looking but wrong computation | Dedicated SDC program: periodic test sweeps + in-workload checks + anomaly detection | Days — only surfaces as drifted/diverged training or wrong inference output | Corrupted weights, wasted compute, results you cannot trust; root-cause is days of detective work |
Component failure modes: where the rate actually comes from
Fleet-level AFR is an aggregate that hides a strongly skewed distribution: a handful of components dominate the failure budget, and knowing which ones lets you target spares, burn-in, and detection where they pay off. The Meta Llama 3 root-cause breakdown is the most-cited public dataset, and its shape generalizes across operators even if your absolute numbers differ.
The GPU and its HBM dominate. In the Llama 3 run, faulty GPUs caused 30.1% of interruptions and HBM3 a further 17.2% — together nearly 59% of all failures were GPU/HBM-related, and roughly 78% were confirmed hardware (Meta, 2024). This is not surprising once you see the physics: the accelerator package is the densest, hottest, highest-current component in the rack, and HBM stacks are the most thermally and mechanically stressed memory ever shipped at volume. HBM is also temperature-sensitive in a way that compounds the density-ramp: HBM error rates roughly double per ~5 °C above ~75 °C junction (commissioning/thermal guidance, 2025) — so a cooling system that runs warm does not just risk throttling, it directly inflates your memory failure rate. Every kilowatt you add per rack in the Chapter 1.2 density ramp pushes this term up unless coolant temperature holds.
Network and optics are the persistent long tail. Switches and cables accounted for 8.4% of Llama 3 interruptions, and link-flaps are as damaging as hard-down links because they corrupt collectives without obviously failing. At 800G XDR and the optics densities of a rail-optimized fabric, transceiver and cable failures scale with link count — a 100k-GPU cluster has millions of optical links, and even an excellent per-link AFR multiplies into a steady drip of fabric faults. The fabric is the failure domain that grows fastest as you scale out.
Infrastructure failures are rarer but higher-impact. Power and cooling faults are far less frequent than GPU faults per-event, but a single CDU trip or a PDU fault can take down an entire rack or pod at once — converting one component failure into dozens of simultaneous node losses. The Uptime data is stark: power is implicated in roughly 45% of impactful data-center outages (mostly UPS), and human error is behind 70–80% of all outages, 85% of those traced to process failures (Uptime Institute, 2025). The lesson for AI fleets is that the GPU dominates frequency while infrastructure dominates blast radius — and your FMEA in Appendix F must weight both.
| Root cause | Share of interruptions | Class | Spares / detection implication |
|---|---|---|---|
| Faulty GPU (incl. XIDs) | 30.1% | Mostly hard, some transient | Largest single spares driver; ECC-trend prediction buys 48–72 hr warning |
| HBM3 memory | 17.2% | Hard + thermal-driven | Coolant temperature directly modulates this term; bin GPUs with HBM history |
| GPU SRAM | 4.5% | Hard/transient | Often surfaces as correctable-error storms first |
| GPU processor | 4.1% | Hard | Part of the ~59% GPU/HBM total |
| Network switch / cable | 8.4% | Hard + link-flap transient | Scales with link count; optics spares pool sized to fabric, not node, count |
| Software / other | ~12.9% | Transient | Not a spare; recovered by restart, masks some hardware root causes |
The scale law: why MTBF collapses as the cluster grows
Here is the result that reframes every reliability conversation in AI infrastructure, and the one most newcomers get wrong: mean time between failures is not a property of your hardware — it is a property of your scale. If a single 8-GPU node has a mean time to failure of ~47.7 days, then a synchronous job that depends on all GPUs staying up sees a cluster MTBF that drops roughly in proportion to GPU count. The observed numbers are stark (these are empirical anchors that include software-induced failures, not a clean arithmetic chain — they do not divide exactly from the single-node base rate):
- 8 GPUs: ~47.7 days between failures.
- 1,024 GPUs: ~7.9 hours.
- 16,384 GPUs: ~1.8 hours (consistent with Llama 3's observed ~3-hour cadence including software).
- 131,072 GPUs: ~0.23 hours — roughly one failure every 14 minutes (SemiAnalysis / Meta-derived scaling, 2024–25).
This is the GOODPUT thread made concrete. At 100k+ GPUs, the cluster is never fully healthy — something is always failing — so the design target shifts from 'maximize uptime' to 'maximize the fraction of GPU-hours doing useful work despite a continuous failure stream.' It is also why the checkpoint interval must shrink as you scale: a 100k-GPU run needs roughly 2-minute checkpoint intervals to hit 0.9 ETTR, where a 16k-GPU run is comfortable at 5 minutes (the optimal-interval math is canonical in Chapter 9.4; operational tuning in Chapter 14.4). The scale law is the bridge between component AFR and cluster availability: the same per-GPU failure rate is a non-event at 8 GPUs and an existential design constraint at 100k.
SDC detection programs: chasing the failure with no signature
Because SDC by definition leaves no log line, detecting it is an active program, not a passive alarm — and the state of the art is a layered defense, with each layer trading coverage against the GPU-hours it steals from production. Meta's published stack is the reference architecture the rest of the industry is converging on.
Fleetscanner is the offline sweep: dedicated silicon test patterns scheduled across the fleet so the entire estate is covered every 45–60 days. It is the most thorough layer (over three years it reached ~93% coverage for a major defect family, with ~23% unique coverage no other method caught) but the most expensive, because the GPU under test is not earning revenue. Ripple co-locates with live workloads, slipping millisecond-to-second test bursts into the gaps between real work, so it achieves fleet-wide coverage in days rather than weeks at near-zero opportunity cost — at lower per-pass depth. Hardware Sentinel is the newest layer and the cleverest: it watches application exceptions in kernel space and infers core-level SDC without allocating any test time at all, raising effective coverage roughly 1.74x over Fleetscanner and 1.92x over Ripple (ASPLOS 2025). The architectural lesson is that no single method suffices — you layer a deep-but-slow sweep, a fast-but-shallow in-workload probe, and a zero-cost inference layer, and the union catches what any one misses.
For training specifically, the framework-level defenses matter as much as the fleet-level ones: redundant computation on a sample of operations, gradient/activation checksums, and divergence monitors that flag when a replica's numerics drift from its peers. These catch SDC at the moment it corrupts the math rather than 45 days later in a sweep. The decision here mirrors the funding fork above — every layer you add costs GPU-hours or engineering, and the right depth is set by how catastrophic a silently-corrupted run is for your business. A frontier lab burning months of compute on one run buys all the layers; a batch-inference shop running idempotent, re-runnable jobs may rationally buy none.
Deep dive: from component AFR to a spares forecast and an availability number
The practical payoff of measuring failure rates is two numbers you cannot run a fleet without: how many spares to stock, and what availability to promise. Both fall straight out of AFR.
Spares. Take a 100,000-GPU fleet at a combined GPU+HBM AFR of ~9%. That is ~9,000 GPU swaps per year, or roughly 25 per day, every day, forever. You cannot RMA at that rate without an on-site spare pool — practice is to hold a spare-node pool of ~2–5% of the cluster so a failed node is swapped from inventory in minutes and the dead one enters the RMA pipeline asynchronously. The AFR also sizes the optics spares pool separately (driven by link count, not node count) and the CDU/PDU spares (driven by blast radius, not frequency). Get the AFR input wrong by a factor of two and you either strand capital in idle spares or stall runs waiting for parts. The full sparing model, RMA logistics, and repair-vs-replace-vs-harvest economics are the subject of Chapter 14.6; this chapter supplies the AFR that drives it.
Availability. The same AFRs, combined with the scale law and your checkpoint interval, feed the reliability block diagram and Monte-Carlo availability model in Chapter 12.5. The critical discipline is that the AFR must be a live, fleet-measured input, not a design-time constant. The moment a new silicon generation lands — and the density ramp guarantees one every ~18–24 months — its infant-mortality and steady-state AFR are unknown, your burn-in data is the first signal, and last generation's number is actively misleading. Operators that hard-code AFR into their availability model wake up to a fleet whose real reliability has drifted out from under the promise they sold.
Burn-in: paying for infant mortality up front
Component failure rates are not constant over life — they follow the classic bathtub: a high infant-mortality phase early, a low flat useful-life phase, and a rising wear-out phase late. Burn-in is the deliberate decision to pull infant mortality forward into a controlled acceptance window so it lands on a test harness instead of a customer's training run. This is the central fork at go-live, and it is a direct schedule-versus-reliability trade.
The mechanics: 72–168 hours of sustained, thermally-stressful load — GPU-Burn, NCCL all-reduce loops, HBM stress patterns, FIO storage soak — at or near max TDP, with DCGM diagnostics (the dcgmi diag -r 4 deep level is the standard acceptance gate) bracketing the run. Vendors claim burn-in removes on the order of 98% of infant-mortality failures within the warranty period, and a fresh cluster's MTBF during its first 3–4 weeks is dramatically worse than its mature ~7-days-per-512-GPU steady state — which is precisely the infant-mortality phase you are trying to flush before acceptance. The acceptance criteria, levels, and tooling are the subject of the commissioning chapters (Chapter 13.6 covers Level 5 IST and failure-mode demonstration); here the point is the economic fork.
| Burn-in posture | Window | What it buys | What it costs | Best fit |
|---|---|---|---|---|
| Minimal / skip | <24 hr smoke test | Fastest time-to-revenue; GPUs earning almost immediately | Infant-mortality failures hit live workloads; high early interruption rate; SLA risk | Spot/batch capacity; re-runnable jobs; aggressive neocloud ramp |
| Standard soak | 72–168 hr | ~98% of infant-mortality flushed; documented acceptance; defensible SLA baseline | 3–7 days of deferred revenue per node cohort | Most production training/inference; the industry default |
| Extended / contractual | 168+ hr + repeat r3/NCCL | Highest confidence; weeds out marginal lemons before customer ever sees them | Longest schedule hit; risks over-cycling components into early wear | Frontier training contracts; reliability-premium SLAs; ClusterMAX-grade clouds |
Burn-in does not end at acceptance — it transitions into a steady-state cadence. The day-2 discipline is a weekly deep node-health pass (dcgmi diag -r 3 plus NCCL on idle GPUs) and the continuous straggler/lemon detection that watches for the repeat-offenders burn-in could not catch. A node that runs ~15% below a golden-reference benchmark is auto-flagged for quarantine, and the lemon-ejection decision — proven to cut 512+-GPU job failure rates from ~14% to ~4% and lift completion ~30% (Meta lemon-node studies, 2024) — is where transient-failure data becomes reclaimed goodput. Burn-in front-loads the cost of infant mortality; lemon ejection back-stops the transients that slip through. Both are detection programs paid for in GPU-hours, and both are justified by the same arithmetic: at scale, the goodput you reclaim dwarfs the capacity you spend finding it.
Deep dive: why your fleet's numbers will (and should) differ from Meta's
The Llama 3 dataset is the most-cited reliability data in the field precisely because so little else is public — but treating it as a universal constant is a mistake. It is a single snapshot, on H100s, on Meta's specific facility, cooling, firmware, and software stack, in 2024. Four things move your numbers off it. Silicon generation: Blackwell-class GB200/GB300 racks at ~120–140 kW change the thermal and current stress profile entirely, and their burn-in AFR is still being established across the fleet — early NVL72 bring-up surfaced novel NVLink copper-backplane reliability issues that simply did not exist on H100 (SemiAnalysis, 2025). Cooling discipline: because HBM error rate roughly doubles per 5 °C above ~75 °C, a fleet that holds tighter coolant temperature will measure a materially lower HBM failure share. Operational maturity: a fresh cluster in burn-in and a two-year-old fleet sit on opposite ends of the bathtub curve, so a blended fleet AFR depends on your age mix. Software stack: Llama 3's ~12.9% software share is highly stack-dependent and not portable at all.
The conclusion is operational, not academic: borrow published numbers to bootstrap your design-time model, then replace them with your own telemetry as fast as you can collect it. The DCIM and observability stack of Chapter 14.2 exists in large part to produce your AFR, not someone else's — and the availability model in Chapter 12.5 is only as good as the fleet-measured failure rate you feed it.