Chapter 14.6
Spares Strategy, RMA Logistics & Repair Operations
At a fleet of hundreds of thousands of accelerators, spares are not an inventory line — they are an availability instrument: the depth of the on-site pool and the speed of the hot-swap, not the OEM warranty, set how much goodput the cluster actually earns back from every failure.
What you'll decide here
- Your spares depth per failure class — set from the component's annualized failure rate and the replenishment lead time, not a flat percentage — and where the pool physically lives (on-rack hot spares vs site cage vs regional depot).
- The repair-vs-replace-vs-harvest disposition for each failed unit: who can swap an FRU in minutes, what goes back to the OEM under RMA, what gets board-repaired at a depot, and what is cannibalized for parts.
- Whether you operate inside the OEM warranty/RMA channel or self-spare and self-insure — the fork that decides your turnaround time, your capital tied up in inventory, and who eats the depreciation on a failed board.
- The FRU granularity you design and buy to: tray-level, board-level, or module-level replaceability, because it sets both the spares bill of materials and the mean-time-to-repair on the data-hall floor.
- The logistics backbone at GW scale: serialized tracking, customs and import-of-record for cross-border RMA, ESD/shock-controlled transport, and the reverse-logistics path for liquid-cooled, weight-heavy, export-controlled hardware.
By the time a cluster is in steady-state operation, the reliability problem has stopped being a question of whether hardware fails and become a question of how fast you put it back. Chapter 14.3 quantified the failure rates; Chapter 14.4 covered the training-resilience software that survives a failure; this chapter is about the physical economy that closes the loop — the spare on the shelf, the technician with the FRU, the box going back to the OEM, and the board on a repair bench. The decision that governs all of it is rarely modeled at scoping time: how deep is your spares pool, how close does it sit to the rack, and how fast can a failed unit be dispositioned and replaced? Get that wrong and a fleet with excellent component reliability still bleeds goodput, because the bottleneck moves from the failure rate to the replacement rate.
Each disposition fork has a direct downstream cost. Spare too thin and you stall jobs waiting on parts that are weeks out on an RMA; spare too deep and you have stranded millions in depreciating silicon on a shelf, on a 2-3 year economic clock (→ Chapter 1.8). Route every failure through the OEM warranty channel and your turnaround is measured in weeks; self-spare and self-repair and you carry the inventory and the engineering, but you swap in minutes. Design for tray-level FRUs and your floor MTTR is short but your spares BoM is expensive; design for board- or module-level and you save inventory dollars but your technicians need more skill and more time at the rack. This chapter names each fork and its downstream cost, globally and vendor-neutrally, current to 2026.
Sparing models: from failure rate to pool depth
A spares pool is sized, not guessed. The honest model is a forecast: for each failure class, the expected number of failures over the replenishment lead time, plus a safety buffer sized to the variance and to the cost of a stockout. The two inputs are the component's annualized failure rate (AFR) and the lead time to get a replacement onto the shelf — and for AI accelerators in 2026 both inputs are hostile. GPUs in production fleets run an effective AFR in the high single digits: roughly ~9% annualized, with cumulative failure risk crossing 25% over three years, dominated by the GPU package and its HBM stacks rather than the board substrate (industry synthesis off Meta's Llama 3 and academic resilience studies, 2024-2025). Lead times, meanwhile, are gated upstream by CoWoS packaging and HBM allocation (→ Chapter 2.3), so a replacement GPU tray is not a next-day commodity — it competes with new-build demand for the same constrained supply.
The naive approach — "hold 2% of the fleet as spares" — fails because failure rates are wildly non-uniform across the bill of materials. Optics and cables fail far more often than GPU boards; HBM-related faults dominate the GPU class; power-supply and fan FRUs fail on their own curves; CDUs, manifolds, and quick-disconnects fail on a mechanical/fluids curve that has nothing to do with silicon. A single flat percentage over-spares the reliable parts and under-spares the failure-prone ones. The correct model sizes each failure class against its own AFR and its own lead time, which is why the disposition table below is organized by component, not by a global ratio.
The RMA lifecycle
RMA — Return Merchandise Authorization — is the formal channel by which a failed unit goes back to the OEM for warranty repair or replacement. At fleet scale it is a pipeline with measurable stages, and the goodput you lose is the integral of how long a unit spends in each: detect → triage/qualify → swap → return → repair/replace → replenish. The expensive, often-skipped stages are the bookends. Triage/qualify determines whether the unit actually meets the RMA criteria — an OEM will reject a return that its field-diagnostic tool says is healthy, and a rejected RMA is wasted shipping plus a still-empty slot. Replenish is the silent killer: the slot on the floor is refilled from your local spare in minutes, but the spare you consumed is not back on the shelf until the RMA closes weeks later — so RMA latency does not gate floor MTTR, it gates how fast your pool recovers, and a slow RMA pipeline with a shallow pool is how you walk into a stockout three failures deep.
The qualify stage is more disciplined than most operators expect, and it is worth getting right because it is where RMAs are rejected. For GPU memory, the vendor publishes objective field-diagnosable thresholds: NVIDIA's row-remapping RMA policy qualifies a GPU once a bank accumulates eight remapped rows from uncorrectable errors, or on a duplicate remap of an already-remapped row, or after 512 total remappings — and on Blackwell a third remap attempt can trigger an on-package HBM channel repair against a spare channel, potentially avoiding the RMA entirely (NVIDIA GPU Memory Error Management docs, 2025). The operational lesson: instrument the row-remap failure flag in-band (NVML/nvidia-smi) and out-of-band (SMBPBI) so triage is automated and your returns are accepted on the first pass, not bounced back across a customs border.
Repair vs replace vs harvest: the disposition fork
Every failed unit gets one of four dispositions, and choosing wrong is either slow or wasteful. Hot-swap from local spare is the default for anything field-replaceable: the floor MTTR is minutes and the failed unit is dealt with offline. RMA to OEM sends the unit back under warranty — zero repair cost to you, but weeks of pipeline latency and a spare consumed in the interim. Depot board-repair sends the unit to a specialist bench that reworks at component level (reflow a failed VRM, replace a fan module, re-seat or replace an optic) — viable for boards out of warranty or where the failure is a cheap discrete part, but it requires a repair partner and a logistics leg. Harvest/cannibalize strips a write-off unit for its still-good FRUs — the rational end-state for a board with a dead GPU package but healthy NVLink, power, and cooling subassemblies, especially once the generation is off the new-build supply and the parts are otherwise unobtainable.
The fork that decides which of these you can even reach is granularity. A fault domain that is replaceable as a small FRU — a single optic, a single power-supply, a single fan, a single GPU board on an OAM/UBB-style baseboard — is fast and cheap to swap and cheap to spare. A fault domain welded into a large integrated assembly — a full compute tray, a fused NVLink backplane, a sealed liquid-cooled module — is slower to swap, more expensive to spare, and more likely to force an RMA of a large, heavy, valuable unit when only a small part of it failed. The trend in 2026 dense racks (GB200 NVL72 and successors, → Chapter 5.4) cuts against serviceability: blind-mate liquid manifolds and copper NVLink backplanes raise integration density and complicate the in-place swap, so the FRU-granularity decision has to be made at procurement, against the rack architecture you are buying.
| Failure class | Relative failure rate | Typical FRU granularity | Default disposition | Sparing posture | Replenishment gate |
|---|---|---|---|---|---|
| GPU package / HBM | High (~9% AFR; dominant fault class) | GPU board on baseboard, or full tray | RMA under warranty; harvest end-of-life | Deep local pool + hot spares on-rack | CoWoS/HBM supply; weeks-to-months |
| Optics / transceivers | Highest volume of swaps (per-link) | Pluggable module (hot-swap, minutes) | Hot-swap; depot-clean or scrap | Bulk consumable; reorder buffer at site | Commodity; days-to-weeks |
| Cables (DAC/AEC/fiber) | High in burn-in; lower steady-state | Individual cable | Replace from bin | Bulk consumable on site | Commodity; days |
| PSU / fans / VRMs | Moderate; wear-driven | Discrete hot-swap module | Replace; depot board-repair for VRM | Modest local pool | Commodity; days-to-weeks |
| CDU / manifold / QDC (cooling) | Low count, high consequence | Pump, valve, quick-disconnect, hose | Replace critical path; depot-rebuild pumps | N+1 plant spares + critical QDC kit | Specialist; weeks (→ Chapter 14.5) |
| NVLink backplane / switch tray | Low but stalls a scale-up domain | Switch tray; backplane (rack-level) | RMA; on-call OEM field service | Spare switch tray per N racks | OEM allocation; weeks |
Field-replaceable units and design for serviceability
Serviceability is a design property you inherit from the hardware vendor, partly negotiate at procurement, and pay for every day in operation. The good FRU is hot-swappable, blind-mate, keyed, and serial-tracked: an optic slides out and a spare slides in without taking the switch down, MTTR in minutes (the canonical reason Meta kept copper inside the rack — passive DAC's far better per-link MTBF makes the rare swap trivial, while the link that does fail does not require shutting anything down). The hostile FRU is the one that forces you to drain a liquid loop, break a blind-mate manifold, or pull a 1.4 kA busbar connection to reach a failed part — work that is slower, higher-risk, and bounded by concurrent-maintainability rules (→ Chapter 12.1) so it does not take down neighbors.
The liquid-cooling transition rewrites the serviceability calculus. Replacing a GPU tray in a direct-to-chip rack is no longer a dry electrical operation — it is a fluid-handling operation involving dripless quick-disconnects (UQD/UQDB, → Chapter 5.4), leak-detection re-verification, and re-priming. That raises both the skill floor for the technician and the MTTR for the swap, and it means your spares strategy now has a cooling bill of materials — spare QDCs, hoses, gaskets, CDU pumps — that an air-cooled fleet never had. Design-for-serviceability reviews at procurement should score exactly this: how many minutes, how many tools, and how much fluid-handling does each likely FRU swap require, live, on a concurrently-maintainable rack.
Deep dive: why MTTR, not MTBF, is the goodput lever at fleet scale
Reliability intuition fixates on mean-time-between-failures, but at a fleet of hundreds of thousands of accelerators the failure rate is effectively a constant you cannot engineer away — best-in-class operators still see roughly one failure per 512 GPUs every ~7 days (SemiAnalysis, 2025), and a 100k-GPU cluster therefore lives with continuous attrition. What you can engineer is the time-to-repair, and that is what the spares-and-RMA machine exists to compress. Availability is MTBF / (MTBF + MTTR); when MTBF is fixed by physics and supply, every hour you shave off MTTR is a direct, linear gain in the fraction of the fleet that is productive.
This is why the close-and-deep spare beats the cheap-and-distant one. The Delta study's 0.3-hour average MTTR was achievable because recovery did not wait on logistics — the part was already there. Push the spare to a regional depot to save inventory cost and you have re-introduced hours-to-days of logistics latency into the denominator, dragging availability down and forcing the synchronous-job overprovisioning tax back up from 5% toward 20%. The spares decision is an MTTR decision, and MTTR is the term you actually control. → goodput framing in Chapter 14.1; the reliability rethink in Chapter 12.2.
Logistics at GW scale
At a single GW-class campus the spares-and-RMA operation is a small logistics company in its own right, and the constraints are not the consumer-electronics constraints most operators have intuitions for. The hardware is heavy (a populated NVL72-class rack runs into the tonnes; a compute tray is a two-person lift), fragile in transit (shock and tilt damage to integrated racks is a documented failure mode, and ESD discipline is mandatory on boards and optics), fluid-wetted on the reverse leg (a returned liquid-cooled tray must be drained, capped, and handled so residual coolant does not damage it or violate shipping rules), and export-controlled (advanced accelerators move under tightening export regimes, so even a warranty RMA across a border needs the right licenses, import-of-record, and customs classification or the box sits in a bonded warehouse for weeks).
The disciplines that make this work are mundane and decisive. Serialized tracking of every FRU — by serial number, not part number — so a failed unit's warranty status, RMA history, and lemon-flag (→ lemon-node detection in Chapter 14.3) are known on scan, and so a unit that has bounced through repair twice is retired instead of reinstalled. Tiered stocking that puts hot spares on-rack or in-row for the highest-failure-rate FRUs, a site cage for the medium tier, and a regional depot for the low-velocity, high-value, long-lead items — minimizing both stockout risk and total inventory carrying cost. A defined reverse-logistics path with the OEM and a repair partner, including who is import-of-record for cross-border returns, so the replenishment gate is a known number you can plan a pool depth around, not a surprise.
Warranty-channel vs self-spare: the operating-model fork
The strategic decision that sits above all the tactics is whether you run inside the OEM warranty/RMA channel or around it with a self-spared, self-repaired pool. Inside the channel, the OEM owns the repair, eats the board-level cost, and supplies replacements — but you accept its turnaround time (weeks, gated by its own allocation and your shipping legs) and you live with the warranty cliff. Self-sparing, you pre-buy a deep local pool and stand up triage and possibly board-repair capability — turnaround collapses to a hot-swap, but you carry the inventory (depreciating, supply-constrained, export-controlled) and the engineering payroll, and you self-insure the repair cost. The largest operators converge on a hybrid: self-spare for the high-failure-rate, fast-swap FRUs to protect MTTR and goodput, while still RMAing the expensive board-level units back to the OEM to avoid eating their repair cost and to keep warranty intact where it still applies.
| Dimension | OEM warranty / RMA channel | Self-spare / self-repair |
|---|---|---|
| Turnaround (floor MTTR) | Hot-swap if local spare exists; else weeks | Minutes — deep local pool by design |
| Replenishment latency | Weeks; gated by OEM allocation + shipping | Pre-bought; gated only by your reorder cadence |
| Capital tied up | Minimal spare inventory | High — depreciating, supply-constrained silicon |
| Repair cost owner | OEM (under warranty) | You — board-repair partner or in-house bench |
| Control / dependency | Dependent on OEM SLA & channel | Self-determined; needs engineering + logistics depth |
| Best fit | Smaller fleets, in-warranty assets, low-velocity FRUs | Large fleets, high-velocity FRUs, post-warranty harvest |
Deep dive: harvest economics and the end-of-life pool
Harvesting becomes the rational disposition the moment two conditions hold: the generation has rolled off the new-build supply (so individual FRUs are otherwise unobtainable at any reasonable lead time), and the warranty has lapsed (so RMA is no longer free). At that point a write-off board is not scrap — it is a parts donor. A GPU board with a dead package frequently retains healthy NVLink subassemblies, power stages, fans, heatsinks/cold-plates, and an intact baseboard, and the residual value of even a 3-year-old accelerator sits around 20-40% (Hashrate Index / CNBC synthesis, 2025), most of which lives in those still-good subcomponents.
The operational discipline that makes harvest work is the same serialized tracking that runs the RMA pipeline: a harvested FRU must carry its provenance and its cumulative repair/harvest history, so a twice-bounced part is retired rather than silently reinstalled into a production rack where it becomes a lemon. Done well, the end-of-life harvest pool extends the serviceable life of an aging generation past the point where the OEM will support it, deferring a forced refresh — which ties directly into refresh and ITAD strategy (→ Chapter 14.9), where the same boards eventually exit to data sanitization and resale rather than the spares bench.
Anti-patterns
The recurring mistakes all come from treating spares as an inventory problem rather than an availability instrument:
- Flat-percentage sparing. Holding "2% of the fleet" over-spares reliable boards and under-spares optics and HBM-heavy GPU classes. Size each failure class against its own AFR and lead time, or you will stock out on exactly the parts that fail most.
- Distant pool to save carrying cost. Pushing spares to a regional depot looks efficient on the inventory line and quietly re-introduces hours-to-days of MTTR, dragging availability down and pushing the synchronous-job overprovisioning tax back up. The carrying cost you saved is dwarfed by the goodput you lost.
- RMA pipeline with no replenishment model. Trusting that a consumed spare "comes back" without modeling the weeks-long RMA latency, then walking into a stockout three failures deep. The pool depth must cover the replenishment gate, not the swap time.
- Ignoring the warranty cliff and export friction. Budgeting spares as if warranty covers the full deployed life, then discovering mid-life that the disposition has shifted to paid repair or harvest — or that a cross-border RMA is stuck in customs. Both are predictable; plan for them at procurement, not at the first failure.