Guide › Day-2 Operations, Upgrades & Lifecycle › 14.3

Chapter 14.3

Component Failure Modes, Failure Rates & Fleet Reliability Data

A GPU fleet does not fail like a traditional data center — it fails constantly, in three distinct ways (hard, transient, silent), and the only defensible design is to measure the per-component failure rate, accept that the mean cluster lifetime between interruptions is hours not months, and engineer detection and recovery around the failures you cannot prevent.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

Which failure taxonomy you instrument for — and specifically whether you fund a silent-data-corruption (SDC) detection program at all, because the failures you do not look for are the ones that quietly poison a training run for days.
What annualized failure rate (AFR) you assume per component — and whether you derive it from your own fleet telemetry or borrow Meta/SemiAnalysis published numbers, because every spare, every availability model, and every SLA inherits this single input.
How long you burn in before accepting a node into production — the trade between schedule (every burn-in hour is deferred revenue) and infant mortality landing on the customer's training run.
Where you draw the line between a node worth repairing and a 'lemon' worth ejecting — the ejection threshold that turns a long tail of repeat-offenders into reclaimed goodput.
Whether your reliability data feeds the availability and goodput model downstream (Chapter 12.5) as a live, fleet-measured input or as a stale design-time assumption that drifts the moment the next silicon generation lands.

The bigger the cluster, the more often a job dies — at frontier scale, every few minutes. Budget for fast recovery (MTTR), not for preventing failure.

This is the canonical home for one uncomfortable fact: an AI training cluster is the least reliable large machine humans operate at scale, and it is supposed to be. A traditional enterprise data center measures uptime in nines and counts annual outages on one hand. A 16,384-GPU training cluster experiences an unplanned interruption roughly every three hours — Meta's published Llama 3 405B run logged 419 of them over 54 days — and a well-run fleet still delivers over 90% effective training time through automation, not through preventing the failures (Meta, Llama 3 Herd, 2024). At this density and scale, the per-component physics guarantees components fail; the reliability problem is measuring the rate precisely enough to size spares, model availability, and detect the failures that hide.

This chapter establishes three things every other reliability chapter in the guide depends on. First, the failure taxonomy — hard, transient, and silent — which is the canonical fault vocabulary cross-referenced from the GPU operations view in Chapter 10.7, the redundancy view in Chapter 12.1, and the IST failure-demonstration view in Chapter 13.6. Second, the empirical fleet failure-rate data — the actual published numbers from at-scale operators, with their vintages and caveats, that you plug into an availability model rather than inventing. Third, the AFR modeling and burn-in discipline that turns raw component failure rates into a spares forecast and an acceptance gate. Every AFR in this chapter feeds the cluster availability and goodput roll-up in Chapter 12.5; the consolidated FMEA catalog these modes populate lives in Appendix F.

The failure taxonomy: hard, transient, silent

Every fault in a GPU fleet falls into one of three classes, and each demands completely different detection and recovery machinery; confusing them is a common operational error. The taxonomy organizes everything downstream: what you instrument, how fast you must react, and whether the failure is even visible at all.

Hard failures are the easy ones, paradoxically, because they announce themselves. A GPU throws an uncorrectable XID and falls off the PCIe bus, an optical transceiver goes dark, a CDU trips on low flow, a power supply faults. The component is unambiguously dead or unreachable; the job crashes or the node drops out; the telemetry screams. The XID 79 'GPU has fallen off the bus' event — which affects roughly 3.2% of H100s in their first year (Last9 / Introl fleet analyses, 2025) — is the archetypal hard failure. These are expensive in lost goodput but cheap to detect: the recovery path is fail-fast, drain, restart from checkpoint, swap the FRU. Hard failures are where mature operators are already good, because the signal is loud.

Transient failures are the treacherous middle. A correctable ECC error storm on HBM, a link that flaps rather than goes down, a thermal excursion that throttles a GPU for ninety seconds, an XID that self-clears on reset. The component is not dead — it works again after a power-cycle or a few minutes — but it degraded the run while it misbehaved, and it will very likely do so again. Transients are the raw material of lemon nodes: hardware that passes every point-in-time health check yet fails repeatedly under load. The decision a transient forces is not 'is it broken' but 'is it broken often enough to eject' — and getting that threshold wrong either keeps a repeat-offender poisoning runs or ejects healthy capacity. Crucially, Meta found that single-bit ECC error trends predict hard GPU failure 48–72 hours in advance with 89–96% accuracy (Meta fleet research, 2025) — the transient is frequently the early-warning tremor before the hard quake, which is why instrumenting transients is the highest-leverage detection investment most fleets under-fund.

Silent failures are the ones that should keep you up at night, because by definition nothing screams. Silent data corruption (SDC) is a computational error — a multiply that returns the wrong product, a memory read that flips a bit undetected by ECC — that produces no fault, no XID, no log line. The hardware reports success. The math is wrong. In training, an SDC silently corrupts gradients and weights; the loss curve drifts or diverges days later, and you cannot tell whether it is a bad hyperparameter, a data bug, or a single faulty multiplier in one of a hundred thousand chips. The historical SDC rate was roughly 1 per million devices — a cosmic-ray-grade rarity nobody budgeted for. At current process nodes and scale it has risen to roughly 1 per 1,000 silicon devices, which Meta attributes to fundamental silicon manufacturing variation, not particle effects (Meta, How Meta keeps its AI hardware reliable, July 2025). On a 100,000-device fleet, 1-in-1,000 is not an edge case — it is a near-certainty on any multi-day run, and the only defense is a deliberate detection program.

The three failure classes and their operational consequences

Class	Signature	Detection mechanism	Reaction timescale	Primary risk if missed
Hard	Uncorrectable XID, device off bus, link down, hardware fault	XID/SEL/syslog, DCGM, fabric BER alarms, CDU/PDU telemetry	Seconds — fail-fast, drain, restart from checkpoint	Lost wall-clock to last checkpoint; one node stalls the whole synchronous job
Transient	Correctable-ECC storm, link flap, thermal throttle, self-clearing XID	Trend analysis on correctable errors; repeat-offender counters; straggler detection	Minutes to hours — quarantine, observe, decide eject vs keep	Lemon node poisons run after run; the missed 48–72 hr early-warning before a hard failure
Silent (SDC)	No signature — correct-looking but wrong computation	Dedicated SDC program: periodic test sweeps + in-workload checks + anomaly detection	Days — only surfaces as drifted/diverged training or wrong inference output	Corrupted weights, wasted compute, results you cannot trust; root-cause is days of detective work

The taxonomy is the master fork: each class demands different detection, reacts on a different timescale, and is referenced as the canonical fault vocabulary from Chapters 10.7, 12.1, and 13.6.

The fork that defines your reliability program: do you fund SDC detection?

Hard and transient detection are table stakes — every operator does them because the signal is there to act on. SDC detection is a discretionary spend, and it is the decision that most distinguishes a frontier-grade fleet from a merely competent one. Funding it means running periodic silicon test sweeps that steal real GPU-hours from production, embedding redundant-compute or checksum checks in the training framework, and building anomaly detection on top — Meta runs on the order of 2.5 billion test seeds per month to chase a 1-in-1,000 defect rate. Not funding it means you are flying blind to a failure class that is statistically certain on any large run, and discovering it only after a diverged training curve has burned days of compute and you cannot prove why. There is no cheap middle: either you build the program or you accept that a fraction of your training output is silently untrustworthy. For anyone running multi-week frontier training, this is no longer optional.

Component failure modes: where the rate actually comes from

Fleet-level AFR is an aggregate that hides a strongly skewed distribution: a handful of components dominate the failure budget, and knowing which ones lets you target spares, burn-in, and detection where they pay off. The Meta Llama 3 root-cause breakdown is the most-cited public dataset, and its shape generalizes across operators even if your absolute numbers differ.

The GPU and its HBM dominate. In the Llama 3 run, faulty GPUs caused 30.1% of interruptions and HBM3 a further 17.2% — together nearly 59% of all failures were GPU/HBM-related, and roughly 78% were confirmed hardware (Meta, 2024). This is not surprising once you see the physics: the accelerator package is the densest, hottest, highest-current component in the rack, and HBM stacks are the most thermally and mechanically stressed memory ever shipped at volume. HBM is also temperature-sensitive in a way that compounds the density-ramp: HBM error rates roughly double per ~5 °C above ~75 °C junction (commissioning/thermal guidance, 2025) — so a cooling system that runs warm does not just risk throttling, it directly inflates your memory failure rate. Every kilowatt you add per rack in the Chapter 1.2 density ramp pushes this term up unless coolant temperature holds.

Network and optics are the persistent long tail. Switches and cables accounted for 8.4% of Llama 3 interruptions, and link-flaps are as damaging as hard-down links because they corrupt collectives without obviously failing. At 800G XDR and the optics densities of a rail-optimized fabric, transceiver and cable failures scale with link count — a 100k-GPU cluster has millions of optical links, and even an excellent per-link AFR multiplies into a steady drip of fabric faults. The fabric is the failure domain that grows fastest as you scale out.

Infrastructure failures are rarer but higher-impact. Power and cooling faults are far less frequent than GPU faults per-event, but a single CDU trip or a PDU fault can take down an entire rack or pod at once — converting one component failure into dozens of simultaneous node losses. The Uptime data is stark: power is implicated in roughly 45% of impactful data-center outages (mostly UPS), and human error is behind 70–80% of all outages, 85% of those traced to process failures (Uptime Institute, 2025). The lesson for AI fleets is that the GPU dominates frequency while infrastructure dominates blast radius — and your FMEA in Appendix F must weight both.

Llama 3 405B interruption root-cause breakdown (16,384 H100s, 54 days)

Root cause	Share of interruptions	Class	Spares / detection implication
Faulty GPU (incl. XIDs)	30.1%	Mostly hard, some transient	Largest single spares driver; ECC-trend prediction buys 48–72 hr warning
HBM3 memory	17.2%	Hard + thermal-driven	Coolant temperature directly modulates this term; bin GPUs with HBM history
GPU SRAM	4.5%	Hard/transient	Often surfaces as correctable-error storms first
GPU processor	4.1%	Hard	Part of the ~59% GPU/HBM total
Network switch / cable	8.4%	Hard + link-flap transient	Scales with link count; optics spares pool sized to fabric, not node, count
Software / other	~12.9%	Transient	Not a spare; recovered by restart, masks some hardware root causes

The canonical public failure-mix dataset. 466 total interruptions, 419 unplanned, ~1 every 3 hours; >90% effective training time achieved with only 3 manual interventions. A single snapshot — your fleet's mix will differ, but the GPU/HBM dominance generalizes. Source: Meta, Llama 3 Herd of Models (2024).

The scale law: why MTBF collapses as the cluster grows

Here is the result that reframes every reliability conversation in AI infrastructure, and the one most newcomers get wrong: mean time between failures is not a property of your hardware — it is a property of your scale. If a single 8-GPU node has a mean time to failure of ~47.7 days, then a synchronous job that depends on all GPUs staying up sees a cluster MTBF that drops roughly in proportion to GPU count. The observed numbers are stark (these are empirical anchors that include software-induced failures, not a clean arithmetic chain — they do not divide exactly from the single-node base rate):

8 GPUs: ~47.7 days between failures.
1,024 GPUs: ~7.9 hours.
16,384 GPUs: ~1.8 hours (consistent with Llama 3's observed ~3-hour cadence including software).
131,072 GPUs: ~0.23 hours — roughly one failure every 14 minutes (SemiAnalysis / Meta-derived scaling, 2024–25).

This is the GOODPUT thread made concrete. At 100k+ GPUs, the cluster is never fully healthy — something is always failing — so the design target shifts from 'maximize uptime' to 'maximize the fraction of GPU-hours doing useful work despite a continuous failure stream.' It is also why the checkpoint interval must shrink as you scale: a 100k-GPU run needs roughly 2-minute checkpoint intervals to hit 0.9 ETTR, where a 16k-GPU run is comfortable at 5 minutes (the optimal-interval math is canonical in Chapter 9.4; operational tuning in Chapter 14.4). The scale law is the bridge between component AFR and cluster availability: the same per-GPU failure rate is a non-event at 8 GPUs and an existential design constraint at 100k.

Two failure rates that look contradictory and aren't

Meta's RSC cluster data reports node failure rates of 2.34–6.50 failures per 1,000 node-days, and separately notes that hardware failures are only 0.2% of jobs but 18.7% of GPU-runtime impact. These are not in tension — they are the same truth from two angles. Most jobs are short and never hit a failure; the rare long, large job is where failures concentrate and where each failure is most expensive because it stalls thousands of synchronized GPUs. The implication for capacity planning: reliability spend should be workload-weighted, not fleet-averaged. Hardening the small fraction of long frontier runs returns far more goodput than uniformly hardening every node — which is exactly the lemon-ejection-and-fast-checkpoint posture, not a 2N-everything posture.

1 every ~3 hr

unplanned interruption rate, Llama 3 405B (419 over 54 days, 16,384 H100s); >90% effective training time with automation

2024Meta (Llama 3 Herd of Models)

~59% / ~78%

share of interruptions GPU/HBM-related (GPU 30.1% + HBM3 17.2%) / share confirmed hardware

2024Meta (Llama 3 paper) / DataCenterDynamics

~9% AFR

combined H100 GPU+HBM annualized failure rate (~1.34% over 54 days → ~27% cumulative over 3 yr)

2025SemiAnalysis / Meta-derived

~1 per 1,000

silent-data-corruption rate per silicon device (up from ~1 per million historically); large runs expect SDC every 1–2 weeks

2025Meta (How Meta keeps its AI hardware reliable)

45–60 days

Fleetscanner full-fleet SDC sweep cadence; Hardware Sentinel raises effective coverage ~1.7–1.9x over test-based methods

2025Meta Engineering / ASPLOS 2025

~3.2%

share of H100s hit by XID 79 ('GPU fell off the bus') in year 1; ECC trends predict hard failure 48–72 hr out at 89–96% accuracy

2025Last9 / Introl; Meta fleet research

72–168 hr

production GPU burn-in/soak window; vendor burn-in claims ~98% of infant-mortality failures removed within warranty

2025Together AI / Introl / domain synthesis

~7 days / 512 GPUs

best-in-class mature-cluster MTBF; new clusters fail far more during 3–4 week burn-in

2025SemiAnalysis (100k H100 clusters)

SDC detection programs: chasing the failure with no signature

Because SDC by definition leaves no log line, detecting it is an active program, not a passive alarm — and the state of the art is a layered defense, with each layer trading coverage against the GPU-hours it steals from production. Meta's published stack is the reference architecture the rest of the industry is converging on.

Fleetscanner is the offline sweep: dedicated silicon test patterns scheduled across the fleet so the entire estate is covered every 45–60 days. It is the most thorough layer (over three years it reached ~93% coverage for a major defect family, with ~23% unique coverage no other method caught) but the most expensive, because the GPU under test is not earning revenue. Ripple co-locates with live workloads, slipping millisecond-to-second test bursts into the gaps between real work, so it achieves fleet-wide coverage in days rather than weeks at near-zero opportunity cost — at lower per-pass depth. Hardware Sentinel is the newest layer and the cleverest: it watches application exceptions in kernel space and infers core-level SDC without allocating any test time at all, raising effective coverage roughly 1.74x over Fleetscanner and 1.92x over Ripple (ASPLOS 2025). The architectural lesson is that no single method suffices — you layer a deep-but-slow sweep, a fast-but-shallow in-workload probe, and a zero-cost inference layer, and the union catches what any one misses.

For training specifically, the framework-level defenses matter as much as the fleet-level ones: redundant computation on a sample of operations, gradient/activation checksums, and divergence monitors that flag when a replica's numerics drift from its peers. These catch SDC at the moment it corrupts the math rather than 45 days later in a sweep. The decision here mirrors the funding fork above — every layer you add costs GPU-hours or engineering, and the right depth is set by how catastrophic a silently-corrupted run is for your business. A frontier lab burning months of compute on one run buys all the layers; a batch-inference shop running idempotent, re-runnable jobs may rationally buy none.

Deep dive: from component AFR to a spares forecast and an availability number

The practical payoff of measuring failure rates is two numbers you cannot run a fleet without: how many spares to stock, and what availability to promise. Both fall straight out of AFR.

Spares. Take a 100,000-GPU fleet at a combined GPU+HBM AFR of ~9%. That is ~9,000 GPU swaps per year, or roughly 25 per day, every day, forever. You cannot RMA at that rate without an on-site spare pool — practice is to hold a spare-node pool of ~2–5% of the cluster so a failed node is swapped from inventory in minutes and the dead one enters the RMA pipeline asynchronously. The AFR also sizes the optics spares pool separately (driven by link count, not node count) and the CDU/PDU spares (driven by blast radius, not frequency). Get the AFR input wrong by a factor of two and you either strand capital in idle spares or stall runs waiting for parts. The full sparing model, RMA logistics, and repair-vs-replace-vs-harvest economics are the subject of Chapter 14.6; this chapter supplies the AFR that drives it.

Availability. The same AFRs, combined with the scale law and your checkpoint interval, feed the reliability block diagram and Monte-Carlo availability model in Chapter 12.5. The critical discipline is that the AFR must be a live, fleet-measured input, not a design-time constant. The moment a new silicon generation lands — and the density ramp guarantees one every ~18–24 months — its infant-mortality and steady-state AFR are unknown, your burn-in data is the first signal, and last generation's number is actively misleading. Operators that hard-code AFR into their availability model wake up to a fleet whose real reliability has drifted out from under the promise they sold.

Burn-in: paying for infant mortality up front

Component failure rates are not constant over life — they follow the classic bathtub: a high infant-mortality phase early, a low flat useful-life phase, and a rising wear-out phase late. Burn-in is the deliberate decision to pull infant mortality forward into a controlled acceptance window so it lands on a test harness instead of a customer's training run. This is the central fork at go-live, and it is a direct schedule-versus-reliability trade.

The mechanics: 72–168 hours of sustained, thermally-stressful load — GPU-Burn, NCCL all-reduce loops, HBM stress patterns, FIO storage soak — at or near max TDP, with DCGM diagnostics (the dcgmi diag -r 4 deep level is the standard acceptance gate) bracketing the run. Vendors claim burn-in removes on the order of 98% of infant-mortality failures within the warranty period, and a fresh cluster's MTBF during its first 3–4 weeks is dramatically worse than its mature ~7-days-per-512-GPU steady state — which is precisely the infant-mortality phase you are trying to flush before acceptance. The acceptance criteria, levels, and tooling are the subject of the commissioning chapters (Chapter 13.6 covers Level 5 IST and failure-mode demonstration); here the point is the economic fork.

The burn-in duration fork: schedule vs. reliability

Burn-in posture	Window	What it buys	What it costs	Best fit
Minimal / skip	<24 hr smoke test	Fastest time-to-revenue; GPUs earning almost immediately	Infant-mortality failures hit live workloads; high early interruption rate; SLA risk	Spot/batch capacity; re-runnable jobs; aggressive neocloud ramp
Standard soak	72–168 hr	~98% of infant-mortality flushed; documented acceptance; defensible SLA baseline	3–7 days of deferred revenue per node cohort	Most production training/inference; the industry default
Extended / contractual	168+ hr + repeat r3/NCCL	Highest confidence; weeds out marginal lemons before customer ever sees them	Longest schedule hit; risks over-cycling components into early wear	Frontier training contracts; reliability-premium SLAs; ClusterMAX-grade clouds

Every burn-in hour is deferred revenue on idle GPUs; every hour skipped raises the odds infant mortality lands on a production run. The right point depends on workload value and contractual reliability commitments.

Burn-in does not end at acceptance — it transitions into a steady-state cadence. The day-2 discipline is a weekly deep node-health pass (dcgmi diag -r 3 plus NCCL on idle GPUs) and the continuous straggler/lemon detection that watches for the repeat-offenders burn-in could not catch. A node that runs ~15% below a golden-reference benchmark is auto-flagged for quarantine, and the lemon-ejection decision — proven to cut 512+-GPU job failure rates from ~14% to ~4% and lift completion ~30% (Meta lemon-node studies, 2024) — is where transient-failure data becomes reclaimed goodput. Burn-in front-loads the cost of infant mortality; lemon ejection back-stops the transients that slip through. Both are detection programs paid for in GPU-hours, and both are justified by the same arithmetic: at scale, the goodput you reclaim dwarfs the capacity you spend finding it.

Deep dive: why your fleet's numbers will (and should) differ from Meta's

The Llama 3 dataset is the most-cited reliability data in the field precisely because so little else is public — but treating it as a universal constant is a mistake. It is a single snapshot, on H100s, on Meta's specific facility, cooling, firmware, and software stack, in 2024. Four things move your numbers off it. Silicon generation: Blackwell-class GB200/GB300 racks at ~120–140 kW change the thermal and current stress profile entirely, and their burn-in AFR is still being established across the fleet — early NVL72 bring-up surfaced novel NVLink copper-backplane reliability issues that simply did not exist on H100 (SemiAnalysis, 2025). Cooling discipline: because HBM error rate roughly doubles per 5 °C above ~75 °C, a fleet that holds tighter coolant temperature will measure a materially lower HBM failure share. Operational maturity: a fresh cluster in burn-in and a two-year-old fleet sit on opposite ends of the bathtub curve, so a blended fleet AFR depends on your age mix. Software stack: Llama 3's ~12.9% software share is highly stack-dependent and not portable at all.

The conclusion is operational, not academic: borrow published numbers to bootstrap your design-time model, then replace them with your own telemetry as fast as you can collect it. The DCIM and observability stack of Chapter 14.2 exists in large part to produce your AFR, not someone else's — and the availability model in Chapter 12.5 is only as good as the fleet-measured failure rate you feed it.

The density-ramp resets your reliability data to zero

Every figure in this chapter is anchored to a specific silicon generation, and the DENSITY-RAMP thread means that anchor moves out from under you on an ~18–24-month cadence. When you transition from H100 to GB200/GB300 to Rubin-class racks, the AFRs, the SDC rate, the dominant failure modes, and the infant-mortality curve are all unknown again — and the higher current, higher temperature, and novel interconnects (copper NVLink backplanes, 800 VDC distribution) introduce failure modes that have no historical baseline. The operational trap is carrying last generation's AFR into the new generation's spares forecast and availability promise. The discipline: treat each new generation as a fresh burn-in population, instrument it harder than the mature fleet, and refuse to publish an availability number on borrowed reliability data until your own telemetry confirms it. Reliability data has a half-life, and the density ramp keeps shortening it.

This chapter is the canonical home for the hard/transient/silent taxonomy and the empirical failure-rate data; the modes it catalogs populate the consolidated FMEA in Appendix F. The same fault vocabulary is used operationally in Chapter 10.7 (fleet fault tolerance & autonomous recovery), in the redundancy and fault-domain engineering of Chapter 12.1, and demonstrated under load in the Level 5 IST failure-mode work of Chapter 13.6. The AFRs derived here feed the quantitative availability and goodput roll-up in Chapter 12.5. The checkpoint-interval math that turns the scale law into a recovery strategy is canonical in Chapter 9.4 and operationally tuned in Chapter 14.4. The spares and RMA logistics that consume these failure rates are in Chapter 14.6; the telemetry that measures them in Chapter 14.2; and the density ramp that keeps invalidating the data in Chapter 1.2.