Guide › Day-2 Operations, Upgrades & Lifecycle › 14.6

Chapter 14.6

Spares Strategy, RMA Logistics & Repair Operations

At a fleet of hundreds of thousands of accelerators, spares are not an inventory line — they are an availability instrument: the depth of the on-site pool and the speed of the hot-swap, not the OEM warranty, set how much goodput the cluster actually earns back from every failure.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

Your spares depth per failure class — set from the component's annualized failure rate and the replenishment lead time, not a flat percentage — and where the pool physically lives (on-rack hot spares vs site cage vs regional depot).
The repair-vs-replace-vs-harvest disposition for each failed unit: who can swap an FRU in minutes, what goes back to the OEM under RMA, what gets board-repaired at a depot, and what is cannibalized for parts.
Whether you operate inside the OEM warranty/RMA channel or self-spare and self-insure — the fork that decides your turnaround time, your capital tied up in inventory, and who eats the depreciation on a failed board.
The FRU granularity you design and buy to: tray-level, board-level, or module-level replaceability, because it sets both the spares bill of materials and the mean-time-to-repair on the data-hall floor.
The logistics backbone at GW scale: serialized tracking, customs and import-of-record for cross-border RMA, ESD/shock-controlled transport, and the reverse-logistics path for liquid-cooled, weight-heavy, export-controlled hardware.

By the time a cluster is in steady-state operation, the reliability problem has stopped being a question of whether hardware fails and become a question of how fast you put it back. Chapter 14.3 quantified the failure rates; Chapter 14.4 covered the training-resilience software that survives a failure; this chapter is about the physical economy that closes the loop — the spare on the shelf, the technician with the FRU, the box going back to the OEM, and the board on a repair bench. The decision that governs all of it is rarely modeled at scoping time: how deep is your spares pool, how close does it sit to the rack, and how fast can a failed unit be dispositioned and replaced? Get that wrong and a fleet with excellent component reliability still bleeds goodput, because the bottleneck moves from the failure rate to the replacement rate.

Each disposition fork has a direct downstream cost. Spare too thin and you stall jobs waiting on parts that are weeks out on an RMA; spare too deep and you have stranded millions in depreciating silicon on a shelf, on a 2-3 year economic clock (→ Chapter 1.8). Route every failure through the OEM warranty channel and your turnaround is measured in weeks; self-spare and self-repair and you carry the inventory and the engineering, but you swap in minutes. Design for tray-level FRUs and your floor MTTR is short but your spares BoM is expensive; design for board- or module-level and you save inventory dollars but your technicians need more skill and more time at the rack. This chapter names each fork and its downstream cost, globally and vendor-neutrally, current to 2026.

Sparing models: from failure rate to pool depth

A spares pool is sized, not guessed. The honest model is a forecast: for each failure class, the expected number of failures over the replenishment lead time, plus a safety buffer sized to the variance and to the cost of a stockout. The two inputs are the component's annualized failure rate (AFR) and the lead time to get a replacement onto the shelf — and for AI accelerators in 2026 both inputs are hostile. GPUs in production fleets run an effective AFR in the high single digits: roughly ~9% annualized, with cumulative failure risk crossing 25% over three years, dominated by the GPU package and its HBM stacks rather than the board substrate (industry synthesis off Meta's Llama 3 and academic resilience studies, 2024-2025). Lead times, meanwhile, are gated upstream by CoWoS packaging and HBM allocation (→ Chapter 2.3), so a replacement GPU tray is not a next-day commodity — it competes with new-build demand for the same constrained supply.

The naive approach — "hold 2% of the fleet as spares" — fails because failure rates are wildly non-uniform across the bill of materials. Optics and cables fail far more often than GPU boards; HBM-related faults dominate the GPU class; power-supply and fan FRUs fail on their own curves; CDUs, manifolds, and quick-disconnects fail on a mechanical/fluids curve that has nothing to do with silicon. A single flat percentage over-spares the reliable parts and under-spares the failure-prone ones. The correct model sizes each failure class against its own AFR and its own lead time, which is why the disposition table below is organized by component, not by a global ratio.

Spares depth is an availability dial, not an inventory cost

The instinct of a finance team is to minimize spares inventory because it is depreciating capital sitting idle. The instinct of an operations team is to maximize it because a stockout stalls a job. Both are right, and the resolution is to price the spare in goodput, not in dollars-on-the-shelf. The Delta resilience study quantified the lever directly: improving GPU availability from 99.5% to 99.9% cut the overprovisioning a large synchronous job needs from 20% to 5% — a 4x reduction (arXiv 2503.11901, 2025). That overprovisioning is GPUs you bought and are not using productively; the hot spare that shortens a swap is what lets you run them. A deep, close pool is not idle capital — it is the cheapest way to buy back the overprovisioning tax on every other GPU in the cluster.

The RMA lifecycle

RMA — Return Merchandise Authorization — is the formal channel by which a failed unit goes back to the OEM for warranty repair or replacement. At fleet scale it is a pipeline with measurable stages, and the goodput you lose is the integral of how long a unit spends in each: detect → triage/qualify → swap → return → repair/replace → replenish. The expensive, often-skipped stages are the bookends. Triage/qualify determines whether the unit actually meets the RMA criteria — an OEM will reject a return that its field-diagnostic tool says is healthy, and a rejected RMA is wasted shipping plus a still-empty slot. Replenish is the silent killer: the slot on the floor is refilled from your local spare in minutes, but the spare you consumed is not back on the shelf until the RMA closes weeks later — so RMA latency does not gate floor MTTR, it gates how fast your pool recovers, and a slow RMA pipeline with a shallow pool is how you walk into a stockout three failures deep.

The qualify stage is more disciplined than most operators expect, and it is worth getting right because it is where RMAs are rejected. For GPU memory, the vendor publishes objective field-diagnosable thresholds: NVIDIA's row-remapping RMA policy qualifies a GPU once a bank accumulates eight remapped rows from uncorrectable errors, or on a duplicate remap of an already-remapped row, or after 512 total remappings — and on Blackwell a third remap attempt can trigger an on-package HBM channel repair against a spare channel, potentially avoiding the RMA entirely (NVIDIA GPU Memory Error Management docs, 2025). The operational lesson: instrument the row-remap failure flag in-band (NVML/nvidia-smi) and out-of-band (SMBPBI) so triage is automated and your returns are accepted on the first pass, not bounced back across a customs border.

Repair vs replace vs harvest: the disposition fork

Every failed unit gets one of four dispositions, and choosing wrong is either slow or wasteful. Hot-swap from local spare is the default for anything field-replaceable: the floor MTTR is minutes and the failed unit is dealt with offline. RMA to OEM sends the unit back under warranty — zero repair cost to you, but weeks of pipeline latency and a spare consumed in the interim. Depot board-repair sends the unit to a specialist bench that reworks at component level (reflow a failed VRM, replace a fan module, re-seat or replace an optic) — viable for boards out of warranty or where the failure is a cheap discrete part, but it requires a repair partner and a logistics leg. Harvest/cannibalize strips a write-off unit for its still-good FRUs — the rational end-state for a board with a dead GPU package but healthy NVLink, power, and cooling subassemblies, especially once the generation is off the new-build supply and the parts are otherwise unobtainable.

The fork that decides which of these you can even reach is granularity. A fault domain that is replaceable as a small FRU — a single optic, a single power-supply, a single fan, a single GPU board on an OAM/UBB-style baseboard — is fast and cheap to swap and cheap to spare. A fault domain welded into a large integrated assembly — a full compute tray, a fused NVLink backplane, a sealed liquid-cooled module — is slower to swap, more expensive to spare, and more likely to force an RMA of a large, heavy, valuable unit when only a small part of it failed. The trend in 2026 dense racks (GB200 NVL72 and successors, → Chapter 5.4) cuts against serviceability: blind-mate liquid manifolds and copper NVLink backplanes raise integration density and complicate the in-place swap, so the FRU-granularity decision has to be made at procurement, against the rack architecture you are buying.

Failure class → sparing model and disposition

Failure class	Relative failure rate	Typical FRU granularity	Default disposition	Sparing posture	Replenishment gate
GPU package / HBM	High (~9% AFR; dominant fault class)	GPU board on baseboard, or full tray	RMA under warranty; harvest end-of-life	Deep local pool + hot spares on-rack	CoWoS/HBM supply; weeks-to-months
Optics / transceivers	Highest volume of swaps (per-link)	Pluggable module (hot-swap, minutes)	Hot-swap; depot-clean or scrap	Bulk consumable; reorder buffer at site	Commodity; days-to-weeks
Cables (DAC/AEC/fiber)	High in burn-in; lower steady-state	Individual cable	Replace from bin	Bulk consumable on site	Commodity; days
PSU / fans / VRMs	Moderate; wear-driven	Discrete hot-swap module	Replace; depot board-repair for VRM	Modest local pool	Commodity; days-to-weeks
CDU / manifold / QDC (cooling)	Low count, high consequence	Pump, valve, quick-disconnect, hose	Replace critical path; depot-rebuild pumps	N+1 plant spares + critical QDC kit	Specialist; weeks (→ Chapter 14.5)
NVLink backplane / switch tray	Low but stalls a scale-up domain	Switch tray; backplane (rack-level)	RMA; on-call OEM field service	Spare switch tray per N racks	OEM allocation; weeks

AFR/lead-time columns are 2026 practitioner-current synthesis; figures vary by generation, operator maturity, and supply conditions. Disposition is the default economic path, not a rule.

Field-replaceable units and design for serviceability

Serviceability is a design property you inherit from the hardware vendor, partly negotiate at procurement, and pay for every day in operation. The good FRU is hot-swappable, blind-mate, keyed, and serial-tracked: an optic slides out and a spare slides in without taking the switch down, MTTR in minutes (the canonical reason Meta kept copper inside the rack — passive DAC's far better per-link MTBF makes the rare swap trivial, while the link that does fail does not require shutting anything down). The hostile FRU is the one that forces you to drain a liquid loop, break a blind-mate manifold, or pull a 1.4 kA busbar connection to reach a failed part — work that is slower, higher-risk, and bounded by concurrent-maintainability rules (→ Chapter 12.1) so it does not take down neighbors.

The liquid-cooling transition rewrites the serviceability calculus. Replacing a GPU tray in a direct-to-chip rack is no longer a dry electrical operation — it is a fluid-handling operation involving dripless quick-disconnects (UQD/UQDB, → Chapter 5.4), leak-detection re-verification, and re-priming. That raises both the skill floor for the technician and the MTTR for the swap, and it means your spares strategy now has a cooling bill of materials — spare QDCs, hoses, gaskets, CDU pumps — that an air-cooled fleet never had. Design-for-serviceability reviews at procurement should score exactly this: how many minutes, how many tools, and how much fluid-handling does each likely FRU swap require, live, on a concurrently-maintainable rack.

Deep dive: why MTTR, not MTBF, is the goodput lever at fleet scale

Reliability intuition fixates on mean-time-between-failures, but at a fleet of hundreds of thousands of accelerators the failure rate is effectively a constant you cannot engineer away — best-in-class operators still see roughly one failure per 512 GPUs every ~7 days (SemiAnalysis, 2025), and a 100k-GPU cluster therefore lives with continuous attrition. What you can engineer is the time-to-repair, and that is what the spares-and-RMA machine exists to compress. Availability is MTBF / (MTBF + MTTR); when MTBF is fixed by physics and supply, every hour you shave off MTTR is a direct, linear gain in the fraction of the fleet that is productive.

This is why the close-and-deep spare beats the cheap-and-distant one. The Delta study's 0.3-hour average MTTR was achievable because recovery did not wait on logistics — the part was already there. Push the spare to a regional depot to save inventory cost and you have re-introduced hours-to-days of logistics latency into the denominator, dragging availability down and forcing the synchronous-job overprovisioning tax back up from 5% toward 20%. The spares decision is an MTTR decision, and MTTR is the term you actually control. → goodput framing in Chapter 14.1; the reliability rethink in Chapter 12.2.

Logistics at GW scale

At a single GW-class campus the spares-and-RMA operation is a small logistics company in its own right, and the constraints are not the consumer-electronics constraints most operators have intuitions for. The hardware is heavy (a populated NVL72-class rack runs into the tonnes; a compute tray is a two-person lift), fragile in transit (shock and tilt damage to integrated racks is a documented failure mode, and ESD discipline is mandatory on boards and optics), fluid-wetted on the reverse leg (a returned liquid-cooled tray must be drained, capped, and handled so residual coolant does not damage it or violate shipping rules), and export-controlled (advanced accelerators move under tightening export regimes, so even a warranty RMA across a border needs the right licenses, import-of-record, and customs classification or the box sits in a bonded warehouse for weeks).

The disciplines that make this work are mundane and decisive. Serialized tracking of every FRU — by serial number, not part number — so a failed unit's warranty status, RMA history, and lemon-flag (→ lemon-node detection in Chapter 14.3) are known on scan, and so a unit that has bounced through repair twice is retired instead of reinstalled. Tiered stocking that puts hot spares on-rack or in-row for the highest-failure-rate FRUs, a site cage for the medium tier, and a regional depot for the low-velocity, high-value, long-lead items — minimizing both stockout risk and total inventory carrying cost. A defined reverse-logistics path with the OEM and a repair partner, including who is import-of-record for cross-border returns, so the replenishment gate is a known number you can plan a pool depth around, not a surprise.

Export controls and warranty cliffs turn a clean RMA into stranded silicon

Two failure modes recur at the boundary between operations and the supply chain. First, export-control friction: returning a failed advanced GPU across a border for OEM repair can require licenses and classification that a logistics team treats as routine until a return is held in customs, leaving the consumed spare un-replenished and the pool one failure shallower than the plan assumed. Build the import-of-record and license path before the first RMA, not during it. Second, the warranty cliff: standard hardware warranties run shorter than the asset's deployed life on a fleet that is run hard at high utilization, so a class of failures predictably falls outside warranty mid-life — at which point the disposition silently shifts from free RMA to paid depot-repair or harvest, and a spares budget that assumed warranty coverage is suddenly carrying real repair cost. Model the post-warranty period explicitly; it is when harvest and self-repair stop being optional.

~9% / yr

effective GPU annualized failure rate in production fleets; cumulative risk >25% over 3 yr

2025Industry synthesis off Meta Llama 3 + academic resilience studies

~1 / 512 GPUs / 7 days

best-in-class mature-cluster GPU MTBF; new clusters fail far more during 3-4 wk burn-in

2025SemiAnalysis (100k H100 clusters)

20% → 5%

synchronous-job overprovisioning cut when GPU availability rises 99.5% → 99.9% (4x)

2025Characterizing GPU Resilience (arXiv 2503.11901)

0.3 hr

average measured MTTR for GPU recovery when the spare is local (Delta, 1,168 GPUs)

2025arXiv 2503.11901

8 rows / 512 remaps

NVIDIA HBM row-remap RMA thresholds: 8 remapped rows per bank, or 512 total, or a duplicate remap

2025NVIDIA GPU Memory Error Management docs

30.1% + 17.2%

Llama 3 interruptions from faulty GPU (30.1%) and HBM3 (17.2%) — the dominant RMA classes

2024Meta (Llama 3 paper)

>2.5M hr MTBF

CPO target vs <1M hr for pluggable optics; pluggable swap = minutes, no switch downtime

2025OCP CPO reliability workshops / Introl

~20-40%

residual value of a 3-yr-old GPU — the floor on what a harvested or RMA'd board is worth

2025Hashrate Index / CNBC synthesis

Warranty-channel vs self-spare: the operating-model fork

The strategic decision that sits above all the tactics is whether you run inside the OEM warranty/RMA channel or around it with a self-spared, self-repaired pool. Inside the channel, the OEM owns the repair, eats the board-level cost, and supplies replacements — but you accept its turnaround time (weeks, gated by its own allocation and your shipping legs) and you live with the warranty cliff. Self-sparing, you pre-buy a deep local pool and stand up triage and possibly board-repair capability — turnaround collapses to a hot-swap, but you carry the inventory (depreciating, supply-constrained, export-controlled) and the engineering payroll, and you self-insure the repair cost. The largest operators converge on a hybrid: self-spare for the high-failure-rate, fast-swap FRUs to protect MTTR and goodput, while still RMAing the expensive board-level units back to the OEM to avoid eating their repair cost and to keep warranty intact where it still applies.

Warranty-channel RMA vs self-spare / self-repair

Dimension	OEM warranty / RMA channel	Self-spare / self-repair
Turnaround (floor MTTR)	Hot-swap if local spare exists; else weeks	Minutes — deep local pool by design
Replenishment latency	Weeks; gated by OEM allocation + shipping	Pre-bought; gated only by your reorder cadence
Capital tied up	Minimal spare inventory	High — depreciating, supply-constrained silicon
Repair cost owner	OEM (under warranty)	You — board-repair partner or in-house bench
Control / dependency	Dependent on OEM SLA & channel	Self-determined; needs engineering + logistics depth
Best fit	Smaller fleets, in-warranty assets, low-velocity FRUs	Large fleets, high-velocity FRUs, post-warranty harvest

The two poles of the operating-model fork; mature GW-scale operators run a deliberate hybrid across the two by failure class.

Deep dive: harvest economics and the end-of-life pool

Harvesting becomes the rational disposition the moment two conditions hold: the generation has rolled off the new-build supply (so individual FRUs are otherwise unobtainable at any reasonable lead time), and the warranty has lapsed (so RMA is no longer free). At that point a write-off board is not scrap — it is a parts donor. A GPU board with a dead package frequently retains healthy NVLink subassemblies, power stages, fans, heatsinks/cold-plates, and an intact baseboard, and the residual value of even a 3-year-old accelerator sits around 20-40% (Hashrate Index / CNBC synthesis, 2025), most of which lives in those still-good subcomponents.

The operational discipline that makes harvest work is the same serialized tracking that runs the RMA pipeline: a harvested FRU must carry its provenance and its cumulative repair/harvest history, so a twice-bounced part is retired rather than silently reinstalled into a production rack where it becomes a lemon. Done well, the end-of-life harvest pool extends the serviceable life of an aging generation past the point where the OEM will support it, deferring a forced refresh — which ties directly into refresh and ITAD strategy (→ Chapter 14.9), where the same boards eventually exit to data sanitization and resale rather than the spares bench.

Anti-patterns

The recurring mistakes all come from treating spares as an inventory problem rather than an availability instrument:

Flat-percentage sparing. Holding "2% of the fleet" over-spares reliable boards and under-spares optics and HBM-heavy GPU classes. Size each failure class against its own AFR and lead time, or you will stock out on exactly the parts that fail most.
Distant pool to save carrying cost. Pushing spares to a regional depot looks efficient on the inventory line and quietly re-introduces hours-to-days of MTTR, dragging availability down and pushing the synchronous-job overprovisioning tax back up. The carrying cost you saved is dwarfed by the goodput you lost.
RMA pipeline with no replenishment model. Trusting that a consumed spare "comes back" without modeling the weeks-long RMA latency, then walking into a stockout three failures deep. The pool depth must cover the replenishment gate, not the swap time.
Ignoring the warranty cliff and export friction. Budgeting spares as if warranty covers the full deployed life, then discovering mid-life that the disposition has shifted to paid repair or harvest — or that a cross-border RMA is stuck in customs. Both are predictable; plan for them at procurement, not at the first failure.

This chapter sits inside the operations lifecycle and inherits its inputs from upstream. The failure rates and lemon-node taxonomy that drive pool depth are quantified in Chapter 14.3; the training-resilience software that survives a failure long enough for the swap is in Chapter 14.4; the cooling-plant spares (CDUs, pumps, QDCs) overlap with predictive/preventive maintenance in Chapter 14.5. The goodput economics that price every spare live in Chapter 14.1 and the reliability rethink in Chapter 12.2; concurrent maintainability that bounds a live swap is in Chapter 12.1. Upstream, the long-lead supply gate (CoWoS/HBM) that sets replenishment latency is in Chapter 2.3; the serviceability of dense liquid-cooled racks in Chapter 5.4; and the asset economics and residual-value floor in Chapter 1.8. Downstream, the harvested end-of-life board flows into refresh, depreciation, and ITAD in Chapter 14.9.