Guide › Software, Orchestration & Service Delivery › 10.6

Chapter 10.6

Observability, Telemetry & GPU Health

Observability for a GPU fleet is the closed loop that converts a noisy hardware-failure stream into goodput, and the decisions you make about what to detect, how fast, and what to store determine whether your cluster spends its life training or restarting.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

What you instrument as the headline metric — raw GPU utilization (which lies) versus goodput (effective training/serving time), because the metric you put on the wall is the one your operators optimize.
Where you draw the detection line on the XID/SXID and SDC taxonomy — which errors auto-cordon a node, which page a human, and which are merely logged — and the false-positive cost of drawing it too aggressively.
Whether you run continuous silent-data-corruption detection (opportunistic drain-and-test plus in-production sampling) or accept that some fraction of your training tokens are quietly wrong.
How telemetry from three planes — compute (DCGM/NVML), fabric (PFC/ECN counters, queue depth), and facility (CDU/leak/inlet-temp) — is correlated to a single node, so a thermal event and a throttled GPU resolve to one root cause instead of three unrelated alerts.
Self-hosted versus managed observability, and the cardinality budget that governs both — because per-GPU, per-job, per-tenant labels multiply into a time-series count that can cost more than the GPUs it watches.

A frontier training cluster fails constantly. Meta's published Llama 3 405B snapshot recorded 419 unplanned interruptions over 54 days on 16,384 H100s — roughly one every three hours — with 78% hardware-caused and 58.7% GPU-related (Meta, Llama 3 paper, 2024). A best-in-class operator running mature H100 fleets sees a mean time between failures of about seven days per 512 GPUs (SemiAnalysis, 2025); scale that to a 100k-GPU site and you are absorbing a hardware fault every few minutes around the clock. The question is never whether components fail. It is whether your observability stack notices the failure in seconds, attributes it to the right node, and triggers the right recovery — or whether it lets a synchronous job limp along behind a straggler, or worse, keep training on corrupted gradients no counter flagged.

This chapter treats observability as the operational nervous system that closes that loop. We build it from the bottom: the compute-plane telemetry (DCGM/NVML and the XID/SXID error taxonomy), the fabric-plane telemetry (congestion and health counters that tell you whether the network or the GPU is the bottleneck), and the facility-plane telemetry (liquid-cooling and inlet correlation). We then confront the hardest detection problem of all — silent data corruption, the failure that produces no error at all — and elevate goodput from a marketing number to the headline metric that every other signal exists to protect. We close on the architecture and cost decision that quietly dominates the others: self-hosted versus managed, and the cardinality budget that governs what you can afford to watch. Each detection fork is scored by the goodput or dollar cost of getting it wrong.

The headline metric is the design decision

Put GPU utilization on the operations wall and your team will drive it to 100% — and learn nothing, because a GPU spinning at 100% on a wedged all-reduce, recomputing a corrupted step, or busy-waiting on a straggler reads identically to one doing useful work. Put goodput on the wall — the fraction of wall-clock GPU-time spent on committed, correct progress — and every subsystem reorganizes around it: cordon latency, checkpoint cadence, SDC detection, congestion control. Industry-average effective training time sits near ~90%, best-in-class near ~96% (SemiAnalysis ClusterMAX / CoreWeave, 2025). The 6-point gap between average and best is almost entirely an observability-and-recovery gap — and at 100k-GPU scale, six points of goodput is hundreds of millions of dollars of compute either earned or burned.

The three telemetry planes (and why correlation is the hard part)

A GPU node emits signal from three independent stacks that historically belonged to three different teams. The compute plane — accelerator utilization, memory, power, temperature, clocks, ECC counts, XID/SXID events — flows from NVIDIA's NVML library and is aggregated by DCGM (Data Center GPU Manager), whose dcgm-exporter publishes Prometheus-format metrics that any open observability platform can ingest (NVIDIA DCGM docs, 2025). The fabric plane — link flaps, CRC/symbol errors, PFC pause frames, ECN marks, queue depth, NCCL collective timings — comes from switch counters (InfiniBand UFM/NetQ, or RoCE switch telemetry) and from the collective library itself. The facility plane — coolant inlet temperature and flow, CDU pressure, leak detection, rack PDU draw — comes from DCIM and BMS systems on entirely different protocols (Redfish, Modbus, SNMP).

Collecting each plane is solved. Correlating them to a single node, at the moment of an incident, is where most stacks fall down — and the failure is expensive. A GB200 NVL72 GPU throttles up to ~50% when its coolant inlet drifts above the DLC envelope; if the compute plane reports a throttled, underperforming GPU while the facility plane separately reports a CDU pressure excursion, and nothing joins them, you dispatch a hardware tech to RMA a perfectly good GPU while the real fault — a cooling loop — keeps degrading the rest of the rack. The correlation key is mundane but decisive: every metric, in every plane, must carry a consistent node/rack/job label set so a thermal event, a throttle, and a slow collective resolve to one root cause instead of three unrelated pages. This is the same correlation discipline the facility side builds in Chapter 14.2; the difference here is that the join must happen at job granularity, fast enough to cordon a node before it poisons a synchronous step.

The XID/SXID taxonomy: drawing the cordon line

NVIDIA's XID errors are the primary hardware-fault vocabulary for GPUs; SXID is the NVSwitch equivalent for the scale-up fabric. The taxonomy is not flat — XIDs span the full severity range from purely informational to instantly fatal — and the central operational decision is where you draw the line between auto-remediate, page a human, and log only. Draw it too loosely and corrupt work propagates; draw it too aggressively and you cordon healthy nodes on transient noise, shrinking the cluster and tanking goodput from the recovery side.

The canonical fatal signatures are unambiguous and should trigger automated cordon-and-drain with no human in the path. XID 79 ("GPU has fallen off the bus") means the GPU is gone from the host's view — dead or dying, immediate node eviction (NVIDIA XID reference, 2025). XID 48 (double-bit ECC) is an uncorrectable memory error; a single event can be a cosmic-ray fluke, but a repeat within a week is progressive DRAM failure and an RMA. XID 94 (contained ECC) versus XID 95 (uncontained ECC) is the sharpest fork in the catalog: 94 means the hardware contained the error to a single application and the rest of the node is trustworthy; 95 means it could not be contained and workload outputs may be corrupted — an XID 95, especially alongside a rising rate of XID 94, is a memory-degradation trajectory, not a point event. The table below is the decision the chapter exists to force: a default routing policy you tune to your false-positive tolerance.

XID/SXID routing policy — the auto-remediate vs page vs log decision

Signature	Meaning	Severity	Default action	Goodput consequence if mis-routed
XID 79	GPU fell off the bus	Fatal	Auto-evict node, RMA if persists after reseat	Log-only ⇒ the whole synchronous job hangs on a dead GPU until timeout
XID 48	Double-bit (uncorrectable) ECC	Fatal	Cordon; RMA on repeat within a week	Ignore ⇒ corrupted compute; over-cordon a one-off ⇒ needless node loss
XID 94 / 95	Contained vs uncontained ECC	Warn / Urgent	94: monitor trend; 95: cordon + investigate outputs	Treat 95 as 94 ⇒ corrupted training tokens enter the run silently
XID 13 / 31	Graphics/MMU fault (often app bug)	Variable	Attribute to job, not node; page only on cluster-wide pattern	Auto-evict ⇒ you cordon healthy nodes for a tenant's buggy kernel
XID 119 / 120	GSP / firmware RPC timeout	Transient	Reset GPU; cordon only on repeat	Hair-trigger evict ⇒ flapping nodes; ignore repeats ⇒ recurring stalls
SXID (NVSwitch)	Scale-up fabric fault	Variable	Map to NVLink domain; drain the affected rack-scale unit	Treat as single-GPU ⇒ you miss a fault that degrades the whole NVL72

A default severity routing for the most common signatures. 'Cordon' = stop scheduling new work and drain; 'evict' = also kill running jobs on the node. Tune thresholds to your false-positive tolerance; an over-aggressive policy shrinks the cluster and costs goodput from the recovery side. XID semantics per NVIDIA XID Errors reference (2025).

Two routing subtleties separate a mature stack from a noisy one. First, attribution direction: an XID 13 or 31 (MMU/graphics fault) is usually a tenant's bad kernel, not a sick node — route it to the job's owner, and only escalate to hardware suspicion when the same signature appears across many nodes (a real hardware or driver-version pattern). Auto-evicting on every app-level XID hands tenants the power to shrink your cluster with buggy code. Second, SXID maps to a domain, not a device: an NVSwitch fault degrades an entire NVLink scale-up domain (8, 72, heading to 576 GPUs), so the remediation unit is the rack-scale block, not the single GPU — a distinction that gets sharper as scale-up domains widen on the density ramp toward Kyber-class racks. The fabric-side counters that disambiguate "is it the GPU or the network?" — PFC pause frames, ECN marks, queue depth — are the bridge to congestion engineering in Chapter 8.6: a collective that suddenly runs slow is a network-health question first and a GPU-health question second, and only correlated telemetry tells you which.

Silent data corruption: the failure with no error code

Every error discussed so far announces itself. Silent data corruption (SDC) does not. A marginal core or memory cell computes the wrong answer — a flipped bit in a multiply, a miscomputed gradient — and emits no XID, no ECC count, no log line. The hardware believes it succeeded. In a training run, a single SDC can quietly poison a gradient, corrupt an optimizer state, or push a loss curve subtly off-trajectory; you may not discover it for days, and when you do, the only safe recovery is to roll back to a checkpoint taken before the corruption — discarding all the goodput in between. This is the failure mode that makes a stack with 99% XID coverage still untrustworthy.

The scale of the problem is now well-characterized and uncomfortable. Hyperscalers — Meta, Google, Alibaba — independently converged on roughly 1 in 1,000 machines harboring an SDC-prone defect (Meta Engineering, 2025; corroborated by OCP's SDC-in-AI whitepaper). Meta reports that for large-scale AI training, an SDC event is expected every one to two weeks, and recorded six SDCs during its 54-day, 16K-H100 Llama-3 run. Because the defect is silent, you cannot wait for it to surface — you have to go hunting. Meta's published architecture is the reference: two complementary detectors, Fleetscanner (opportunistic — a machine is fully drained, exhaustively tested, then quarantined or returned to rotation) and Ripple (in-production — test loads are sliced into the gaps between real workloads), together running about 2.5 billion test seeds per month across the fleet (Meta Engineering, 2025).

The SDC detection fork is a goodput fork

You have three options on SDC, and the third is the trap. (1) Drain-and-test (Fleetscanner-style) is exhaustive but steals GPU-hours from training to run tests — a direct goodput tax you pay up front. (2) In-production sampling (Ripple-style) is nearly free but probabilistic — it lowers, not eliminates, the corruption window. (3) Do nothing and hope ECC catches it. Option 3 is the one that quietly destroys value: ECC does not catch logic-unit SDCs, so a defect computes wrong answers for days, and when you finally detect a diverged loss, you roll back potentially weeks of training to a clean checkpoint. The drain-and-test tax is a small, visible, scheduled cost; the do-nothing path is a large, invisible, unscheduled one. At ~1-in-1,000 machine prevalence and 100k-GPU fleets, "do nothing" is not a strategy — it is an unbudgeted, unbounded liability against the most expensive asset in the building.

419 / 54 days

unplanned interruptions on 16,384 H100s (~1 every 3 hr); 78% hardware, 58.7% GPU-related

2024Meta (Llama 3 paper) / Tom's Hardware

~7 days

MTBF per 512 GPUs at a best-in-class mature H100 operator (new clusters fail far more)

2025SemiAnalysis (100k H100 clusters)

~1 in 1,000

machines harboring an SDC-prone defect; SDC expected every 1-2 weeks in large training

2025Meta Engineering; OCP SDC-in-AI whitepaper

~2.5 billion

SDC test seeds per month across Meta's fleet (Fleetscanner + Ripple)

2025Meta Engineering (How Meta keeps AI hardware reliable)

~90% / ~96%

industry-average vs best-in-class goodput (effective training time)

2025SemiAnalysis ClusterMAX / CoreWeave

~43.4%

large-LLM-job failure rate (~37% hardware-attributed; ~73% recoverable via restart)

2024Alibaba (Unicron) via SemiAnalysis

6-21% of TCO

reliability/recovery overhead — the cost the observability loop exists to shrink

2025SemiAnalysis ClusterMAX 2.0

<2 min

MTTR achievable with multi-tier checkpointing vs 15-30 min naive restart

2025Google Cloud (multi-tier checkpointing)

Goodput as the headline metric, and badput accounting

Goodput is the metric every other signal in this chapter exists to protect, and it has a precise definition worth adopting verbatim. Google Cloud frames it as productive ML throughput net of badput — and the discipline is in the badput taxonomy, because it forces you to attribute every lost GPU-second to a category you can attack. Badput buckets include: scheduling badput (waiting for resources), provisioning/initialization badput (the slow ramp before steady state — see Chapter 10.5), disruption badput (the failure-detect-and-restart loop), wasted-progress badput (work done since the last checkpoint, thrown away on restart), and SDC badput (work that must be rolled back because it was corrupt). Each bucket points at a different fix: disruption badput is a detection-latency problem; wasted-progress badput is a checkpoint-cadence problem (the Young/Daly optimal-interval math is canonical in Chapter 9.4); SDC badput is a hunting problem.

The job-level observability that produces these numbers must be assembled deliberately. Raw DCGM utilization is necessary but not sufficient — it cannot distinguish a GPU doing useful work from one busy-waiting on a straggler or recomputing a corrupted step. Real goodput accounting requires joining hardware telemetry to training-loop telemetry: step time, samples/sec, optimizer-step commit events, and checkpoint-write completions, emitted by the framework itself. Only at that join can you compute the fraction of wall-clock GPU-time that produced committed, correct progress — and only then does the headline number on the wall mean what operators think it means. SemiAnalysis's ClusterMAX rating makes goodput and health-check rigor a first-class scoring dimension precisely because it is the cleanest single proxy for whether an operator's whole stack works.

Deep dive: the detection-to-recovery loop, and why detection latency dominates lost goodput

Observability is worthless if it only describes; its job is to trigger. The loop is: detect → attribute → cordon → recover → return-to-service, and the dominant term in lost goodput is almost always detection latency on a synchronous job. Here is the mechanism. In synchronous data-parallel training, every GPU blocks at the all-reduce at the end of each step. If one GPU is failing slowly — a thermal throttle, a degrading NVLink, an intermittent ECC storm that has not yet crossed an XID threshold — it becomes a straggler, and the entire 16K-GPU job runs at the speed of that one sick device until something cordons it. Every second of detection latency is multiplied by the full GPU count. A hang on a dead GPU (XID 79) that takes a 10-minute NCCL timeout to surface is 10 minutes times 16,384 GPUs of pure badput.

This is why mature stacks invest in pre-failure signals, not just hard faults: per-rank step-time outlier detection (the straggler is the rank that is consistently 5% slow), ECC-rate trend alarms that fire before the double-bit error, and NVLink/CRC-error slopes that flag a link before it flaps. The recovery side is owned by Chapter 10.7 (autonomous hardware recovery, hot spares, elastic training) and the checkpoint math by Chapter 9.4; the observability stack's contribution is to make detect-and-attribute fast and certain enough that recovery has something correct to act on. Multi-tier checkpointing collapses MTTR from 15-30 minutes of naive restart to under two minutes (Google Cloud, 2025) — but only if the detector hands it an unambiguous "this node, now" signal. Fast, certain detection is the multiplier on every recovery investment downstream.

Self-hosted vs managed — and the cardinality budget that governs both

The architecture fork is the same one every observability buyer faces, sharpened by GPU economics. Self-hosted — Prometheus (or VictoriaMetrics/Mimir/Thanos for scale) plus Grafana, fed by dcgm-exporter and node/fabric exporters — gives full control, no per-host SaaS bill, and data that never leaves your security boundary (which matters for sovereign and air-gapped builds; see Chapter 11.7). The cost is that you now operate a high-cardinality time-series database at fleet scale as a production service of its own. Managed — Datadog, Chronosphere, Grafana Cloud, and the GPU-cloud-native stacks — offloads that operational burden but bills on ingest and active time-series, and at GPU-fleet cardinality those bills get large fast. Chronosphere's own framing is blunt: high observability costs steal budget from GPUs, training, and staff.

The variable that dominates both paths is cardinality, and it is the one most teams underestimate. Cardinality is the number of unique time series, and it multiplies: every metric, times every GPU, times every job, times every tenant, times every region, times every label you attach. DCGM alone exposes dozens of metrics per GPU; multiply by 100,000 GPUs and then by per-job and per-tenant labels and you are into hundreds of millions of active series — high cardinality is consistently cited as the single biggest driver of observability cost overruns across Prometheus, Datadog, and every time-series backend (Grafana Labs; Last9, 2025). The trap specific to AI fleets is cardinality drift: every new tenant, job ID, and model SKU mints fresh label values, so the series count grows without anyone deciding to spend more.

Observability architecture fork — self-hosted vs managed, governed by cardinality

Dimension	Self-hosted (Prometheus/VM + Grafana)	Managed (Datadog/Chronosphere/Grafana Cloud)
Up-front control	Full — schema, retention, sampling all yours	Bounded by the vendor's model and limits
Cost shape	Capex + ops headcount; flat at scale	Opex per ingest / active series; grows with cardinality
Cardinality risk	Yours to engineer (relabel, drop, downsample)	Yours to pay for unless you engineer it the same way
Data residency	Stays in your boundary (sovereign / air-gap ready)	Leaves your boundary unless self-hosted-managed hybrid
Operational burden	You run an HA TSDB as a production service	Offloaded to the vendor
Best fit	Large durable fleets, sovereignty needs, cost-at-scale	Fast start, smaller fleets, lean ops teams

The choice is rarely pure; tiered pipelines (route high-cardinality raw to cheap object storage, send aggregates to the primary backend) are the dominant 2026 pattern. Cardinality discipline applies to both columns. Cost-driver framing per Grafana Labs / Last9 / Chronosphere (2025).

The mature answer is neither pure column — it is a tiered telemetry pipeline with deliberate cardinality control, and it is now the dominant 2026 pattern regardless of which backend you buy. The mechanics: keep high-resolution, high-cardinality raw telemetry (per-GPU, per-second DCGM) in cheap object storage for forensic and ad-hoc analysis, while sending aggregated, lower-cardinality summaries to the primary backend that powers live dashboards and alerts (Last9; NVIDIA DCGM collector docs, 2025). Relabel to drop labels nobody queries, downsample old data, and put a hard cardinality budget on each team's namespace so a new tenant cannot silently 10x the bill. The principle is the same one that governs the whole chapter: you cannot afford to watch everything at full fidelity forever, so you decide — explicitly — what you keep hot, what you keep cold, and what you drop, and you make that a budgeted decision rather than an emergent one. Get it wrong on a power-bound, capital-intense fleet and the observability stack becomes a line item that competes with the GPUs it was built to protect.

Deep dive: a worked alerting taxonomy — what cordons, what pages, what only logs

A concrete default, to make the routing decisions above tangible. Auto-remediate, no human (cordon/evict + drain): XID 79 (off the bus), XID 48 repeated, XID 95 (uncontained ECC), NCCL collective hard-timeout, sustained coolant-inlet over-temp on a DLC node. These have unambiguous remediations and a human in the loop only adds latency that synchronous jobs pay for at full GPU count. Page a human (SLO-burn / investigate): a rising slope of XID 94 toward 95, cluster-wide appearance of a single XID/driver signature (firmware or driver-version regression), goodput dropping below an SLO floor, an SDC detector quarantine, a fabric congestion pattern (PFC/ECN) that does not clear. These need judgment — is it hardware, a bad driver rollout, or a tenant? Log only (trend, do not alert): single XID 94 (contained ECC, by definition handled), single transient GSP timeout that auto-recovered, per-job app-level MMU faults attributed to a tenant. The art is in the thresholds and slopes, not the categories: a single contained-ECC event is noise, a rising rate of them is a pre-failure signal, and the difference between the two is the difference between a stack that cordons sick nodes before they corrupt a step and one that drowns operators in pages — or misses the fault entirely. This SLO-burn-alerting discipline ties directly to the fleet-reliability automation in Chapter 10.7 and the facility-side correlation in Chapter 14.2.

The recovery side of the loop this chapter detects — autonomous hardware recovery, hot spares, and elastic training — is owned by Chapter 10.7; the checkpoint-cadence math that bounds wasted-progress badput is canonical in Chapter 9.4. The fabric counters used here to disambiguate GPU-vs-network faults feed congestion engineering in Chapter 8.6. Provisioning and bring-up — the source of initialization badput and the burn-in gates that catch faults before production — is Chapter 10.5. Facility-side DCIM telemetry and the liquid-cooling correlation are deepened in Chapter 14.2; component failure rates and the fleet-reliability dataset behind the MTBF numbers in Chapter 14.3. Security boundaries for a self-hosted telemetry plane connect to Chapter 11.7.