The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 13.7

Network Fabric Commissioning & Validation

A GPU fabric does not fail loudly at commissioning — it fails quietly, one marginal optic and one mis-cabled rail at a time, and every defect you do not screen out at layer 1 reappears as a straggler, a stalled all-reduce, or an uncorrelatable incident once the cluster is earning depreciation.

GOODPUTDENSITY-RAMP

What you'll decide here

  1. The acceptance bar at the physical layer — the pre-FEC/post-FEC BER threshold, the link-flap budget, and the soak duration each link must survive before you let any higher-layer test touch it.
  2. How you prove topology correctness — that every link lands on the rail and tier the cabling map says it should — before a single collective runs, because a mis-cabled rail passes BER and still destroys bandwidth.
  3. Which layers you validate with synthetic point-to-point traffic at commissioning (this chapter) versus which you defer to cluster-scale collective benchmarking under real GPUs (Chapter 13.9) — and why the seam falls where it does.
  4. Whether time-sync accuracy is a hard acceptance gate or a best-effort configure-and-hope step — and the holdover and offset-under-load criteria the timing plane must demonstrate to pass.
  5. What baseline 'fingerprint' you capture before handover, and the link-health register you hand to operations — because day-2 fabric reliability is only as good as the as-commissioned baseline you can diff against.

By the time the network team arrives, the hall has power that has been load-banked, cooling that has been flushed and proven, and an integrated-systems test that pulled the plug and watched the building survive (Chapter 13.6). What it does not yet have is a fabric anyone can trust. The cables are run, the optics are seated, the switches are racked and powered — and somewhere in the tens of thousands of links there are marginal transceivers, a handful of rails cross-patched into the wrong leaf, a few ports negotiating the wrong FEC, and one boundary clock nobody configured. None of these announce themselves. They surface later, as a 4,000-GPU job that runs at 70% of the bandwidth it should, or a collective that intermittently times out, or an incident whose telemetry timeline is too blurry to read.

This chapter is the discipline that drags those defects into the light before the cluster is declared ready. Each acceptance gate (the BER threshold, the topology-verification method, the synthetic-vs-real-workload seam, the time-sync gate) carries a specific downstream cost when it is set wrong. The lesson of fabric commissioning is that defects are cheapest to find at the lowest layer that can reveal them. A marginal optic found at layer 1 is a five-minute swap by a technician already on the floor; the same optic found after training starts is days of wall-clock loss, a forensic hunt across thousands of streams, and a straggler that throttles a synchronous job to the speed of its weakest link. Commission the fabric bottom-up — physical, then point-to-point, then topology, then timing — and you convert a population of latent defects into a finite punch list. Skip it, and you have shipped the punch list to operations with the GPUs already spinning.

The bottom-up sequence: why layer order is not negotiable

Fabric commissioning runs strictly bottom-up, and the ordering is not a stylistic preference — it is a debugging-economics argument. Each layer can only be trusted once the layer beneath it is clean, because a defect at a lower layer masquerades as a defect at every layer above it. A marginal optic (layer 1) presents as packet loss (layer 2/3), which presents as collapsed RDMA bandwidth (transport), which presents as a slow all-reduce (collective), which presents as a straggler (the job). If you start the diagnosis at the top, you spend days chasing a software ghost that is really a $200 transceiver. So you validate the bottom first and refuse to let any higher-layer test run over a link that has not passed the one below it.

The sequence is: (1) physical-layer acceptance — every link clean, every optic in spec, every port stable under soak; (2) topology validation — every link landing exactly where the cabling map says, with the right rail and tier; (3) point-to-point bandwidth and latency — synthetic RDMA traffic confirming each path delivers its design bandwidth and latency; (4) congestion and QoS verification — PFC/ECN/QoS behaving as configured under deliberate stress; (5) scale-up (NVLink) validation — the intra-node/intra-rack domain proven distinct from the scale-out fabric; and (6) the timing plane — PTP accuracy gated as an acceptance criterion. Cluster-scale collective benchmarking — NCCL all-reduce sweeps, OSU/MLPerf, the bisection-bandwidth acceptance number that becomes a contractual SLA — sits one layer above this chapter and requires real GPUs, so it lives in Chapter 13.9. The seam is deliberate: this chapter proves the fabric is sound with synthetic traffic; the next proves the cluster is fast with real workloads.

The first gate is the physical layer, and it is where the largest population of defects lives. Modern AI fabrics run PAM4 SerDes at 100–200 Gb/s per lane (800G and 1.6T links), a signal regime so tight that every link relies on forward error correction to be usable at all. The raw, uncorrected channel — the pre-FEC BER — is genuinely error-prone; FEC (RS-544 KP4 on Ethernet, the equivalent on InfiniBand) corrects it down to a post-FEC BER that is meant to be effectively error-free. The acceptance discipline turns on reading both numbers, because they answer different questions. Post-FEC BER tells you whether the link is currently delivering clean frames; pre-FEC BER tells you how much margin FEC is burning to do it — and a link running clean today on the last few dB of FEC headroom is a link that will start dropping frames the first warm afternoon.

The convention, inherited from IEEE 802.3 and the IBTA, is a post-FEC BER floor of roughly 1e-12 (tightening toward 1e-13 at the highest lane rates) as the pass line, with pre-FEC BER read against the transceiver's specified margin as the real screen. Any link whose pre-FEC BER sits above its optic's spec — even if FEC is still masking it — is flagged for an optic swap, reseat, or cable replacement before any higher-layer test runs over it. This is the single highest-yield step in fabric commissioning: a 512-GPU pod with ~2,000 back-end links will have a handful of marginal optics straight out of the integration factory, and finding them here is hours of a technician's time. Finding them after the proxy training run starts is days of wall-clock loss and a straggler hunt across the whole job.

The second physical-layer screen is link-flap. A link that trains up, drops, and re-trains — even once an hour — is a link that will eventually drop mid-collective, and on a synchronous fabric a single transient flap can stall the entire job. The discipline is to zero every port's flap counter at baseline and then run a sustained soak — typically a 24-hour minimum, often longer at scale — under line-rate synthetic load and at realistic inlet temperature, watching for any port that logs a single flap, a single FEC-uncorrectable event, or a creep in pre-FEC BER. Soak is where thermal-marginal optics reveal themselves: a transceiver that passes a cold-start BER check at 22 °C can fail once the hall warms to its operating point. Run the soak cold and you have not tested the link you are going to operate.

Topology validation: the mis-cabled rail that passes every BER test

A link can be physically perfect — zero flaps, BER far inside spec — and still be wrong, because it lands on the wrong port. AI fabrics are rail-optimized: GPU 0 on every node connects to leaf 0, GPU 1 to leaf 1, and so on, so that a given rail's collective traffic stays within its own plane and never has to cross to another rail's spine (Chapter 8.5). When a single cable is cross-patched — GPU 3's link landing on leaf 5 instead of leaf 3 — nothing at the physical layer complains. The link trains, BER is clean, the soak passes. But the rail-optimized invariant is broken, and traffic that should have stayed on one rail now traverses extra hops, congesting the spine and collapsing effective bandwidth for every job that touches that node. This is the defect class that BER screening cannot catch, and it is endemic at scale: with tens of thousands of cables run by hand, a sub-percent miswire rate still means hundreds of wrong links.

Topology validation is therefore a distinct, mandatory gate: an automated reconciliation of the discovered topology against the designed cabling map. The fabric manager (UFM/NMX for InfiniBand, the equivalent SDN/telemetry controller for Ethernet) enumerates every link's two endpoints by switch GUID and port, and a script diffs that discovered graph against the reference architecture's expected adjacency — every leaf-to-spine link, every node-to-leaf link, every rail assignment. Mismatches print as a punch list of specific ports to re-patch. The discipline is to run this before point-to-point bandwidth testing, because a mis-cabled topology will produce bandwidth numbers that look like a congestion problem or a tuning problem, and you will burn days chasing the wrong layer. Verify the graph is the graph you designed, then test how fast it is.

Point-to-point bandwidth and latency: synthetic traffic before real workloads

With the physical layer clean and the topology proven correct, you measure whether each path actually delivers its design bandwidth and latency — using synthetic RDMA traffic, not real collectives. On InfiniBand the canonical tools are the perftest suite: ib_send_bw and ib_write_bw for per-pair bandwidth, ib_send_lat/ib_write_lat for one-way and round-trip latency. On RoCEv2 the same perftest binaries run over the Ethernet fabric, plus rping and vendor equivalents. The acceptance criterion is that every node-to-node and node-to-leaf path delivers its line-rate bandwidth (a 400G NIC should show ~390+ Gb/s of goodput after protocol overhead) and its design latency (~1–2 µs on InfiniBand, ~1.5–2.5 µs on a tuned RoCEv2 fabric — see Chapter 8.4), with no path anomalously slow. A single path that benchmarks low after BER and topology have both passed points at a congestion-control or QoS misconfiguration on that path — exactly the kind of defect the next gate is built to catch.

The reason point-to-point comes before collective benchmarking is debugging isolation. A pair-wise ib_write_bw test exercises one link in isolation, so a low number localizes to that link and its two endpoints. An all-reduce sweep exercises thousands of links simultaneously, so a low number tells you the aggregate is slow without telling you which link dragged it down. You establish per-pair correctness first, then layer collective behavior on top — and you defer the collective sweep to Chapter 13.9 because a true NCCL all-reduce acceptance run needs real GPUs driving the NICs, which means it overlaps node burn-in (Chapter 13.8) rather than pure fabric commissioning.

Fabric commissioning gates — what each layer proves, the tool, the pass criterion, and the consequence of skipping it
GateWhat it provesMethod / toolPass criterionCost of skipping
Physical-layer (BER + flap)Every link is clean and has FEC margin to spareSwitch/NIC counters; pre-FEC & post-FEC BER read at line ratePost-FEC BER ≤ ~1e-12; pre-FEC within optic spec; zero flaps over 24h+ soakMarginal optic becomes a production straggler; days of forensic hunt
Topology validationEvery link lands on the designed rail/tierFabric-manager discovery (UFM/NMX/SDN) diffed vs cabling mapDiscovered adjacency graph matches reference architecture exactlyCross-patched rail collapses bandwidth; mimics a tuning problem
Point-to-point BW/latencyEach path delivers design bandwidth and latencyib_send_bw / ib_write_bw / ib_*_lat (IB); perftest / rping (RoCE)≥ ~95% line-rate goodput per pair; latency within design bandA slow path hides in the collective average; localization lost
Congestion / QoSPFC/ECN/QoS behave as configured under stressEngineered incast/many-to-one; watch PFC pause, ECN marks, dropsNo PFC deadlock/storm; ECN engages; no victim-flow loss; fairness holdsPFC storm or HOL blocking stalls the whole job under real load
Scale-up (NVLink)Intra-rack domain is whole and distinct from scale-outFabric/NVLink topology query; per-link NVLink BW checkAll NVLink/NVSwitch links present at full BW; domain size correctA degraded NVLink lane silently caps TP/EP bandwidth in-node
Time-sync (PTP)Every clock is within offset bound, under load and in holdoverOffset-from-master sweep; load test; simulated GNSS-loss holdoverAll nodes within stated offset; bound holds under load; holdover/failover meet specUncorrelatable telemetry; broken multi-DC ordering; blind RoCE diagnosis
Synthetic-traffic gates that belong to fabric commissioning (this chapter). Collective-level acceptance (NCCL all-reduce busbw, MLPerf/OSU, contractual bisection bandwidth) needs real GPUs and lives in Chapter 13.9. Tool names are representative, not exhaustive; vendor-neutral equivalents exist for each.

Congestion and QoS verification: testing the failure case, not the happy path

A fabric that benchmarks beautifully on idle point-to-point tests can still collapse the moment real collective traffic creates contention — because the mechanisms that handle contention (priority flow control, ECN/DCQCN, adaptive routing, QoS classes) are exactly the mechanisms that idle tests never exercise. Congestion verification deliberately creates the failure case: an engineered incast (many senders to one receiver), a synchronized many-to-one pattern that mimics an all-reduce's reduction phase, and sustained saturation across shared spine links. Under that load you watch the congestion telemetry — PFC pause frames, ECN marks and CNP responses, switch egress drops, queue occupancy — and confirm the fabric does what its configuration claims.

The defects this gate exists to catch are the pathologies of lossless Ethernet (Chapter 8.6): a PFC deadlock or pause storm that propagates backward and freezes a whole region of the fabric; head-of-line blocking where one congested flow stalls unrelated flows sharing a queue; victim flows dropped because a misconfigured ECN threshold never engaged; and QoS classes that do not actually isolate (storage traffic starving compute traffic, or vice versa). On a tuned fabric this gate proves the PFC watchdog fires, DCQCN backs off senders before buffers overflow, packet spray / adaptive routing spreads load without reordering past the transport's tolerance, and the lossless guarantee holds with zero drops under saturation — the behavior NVIDIA markets as Spectrum-X's near-zero flow-collision loss, and the behavior any acceptance plan must verify rather than assume. The discipline, identical to the timing plane's, is that an operational mechanism tested only in the happy path has not been tested.

Everything above concerns the scale-out fabric — the InfiniBand/Ethernet network of NICs, leaves, and spines that carries inter-node collectives. The scale-up fabric is a different animal entirely (Chapter 8.2): the NVLink/NVSwitch (or UALink, or vendor-equivalent) mesh that binds the GPUs within a node or rack into a single coherent accelerator, at roughly an order of magnitude more per-GPU bandwidth than the scale-out NIC. It gets its own acceptance gate because it fails its own way. A degraded NVLink lane does not drop packets you can count on a switch — it silently runs the intra-domain bandwidth below spec, capping tensor- and expert-parallel throughput inside the node without ever tripping a scale-out alarm.

The validation queries the NVLink/NVSwitch topology to confirm the domain is whole — that an NVL72 presents all 72 GPUs in one NVLink domain with every expected link present, not 71 with one GPU silently demoted to PCIe — and then measures per-link NVLink bandwidth to confirm each lane runs at full rate (NVLink 5 at 1.8 TB/s per GPU on Blackwell; ~130 TB/s aggregate per NVL72 rack). This gate also confirms the boundary between the two fabrics is clean: that scale-up traffic stays on NVLink and does not spill onto the scale-out fabric, and that the IMEX/MNNVL partitioning is configured so the domain is schedulable as designed. As copper gives way to optical scale-up in the Rubin-Ultra/Kyber generation (Chapter 8.2), the scale-up fabric inherits the same optic-screening discipline as scale-out — pushing BER and flap acceptance inside the rack, where it used to be a copper given.

PTP/IEEE-1588 time-sync as an acceptance gate

Chapter 8.7 made the case for a real PTP timing plane and ended by insisting that time-sync accuracy be a commissioning acceptance gate, not a configure-and-hope step. This is where that gate is actually executed. A timing plane that is designed but never measured is trusted on faith, and faith fails at the worst moment — the night a 4,000-GPU run stalls and the forensic reconstruction across thousands of telemetry streams turns out to be a smear because every clock was drifting by milliseconds. The acceptance gate converts "we configured PTP" into "the clock is provably trustworthy."

The gate has three measured criteria. First, offset-from-master across the fleet: a sweep confirming every node's PHC sits within a stated offset bound of the grandmaster (sub-microsecond is the target; tens-of-nanoseconds is achievable on modern Spectrum/ConnectX-class silicon). Second, accuracy under load: the offset bound must hold when the data plane is saturated, because PTP accuracy that degrades under traffic is exactly the accuracy you lose during the incident you are trying to diagnose. Third, holdover and failover: a deliberately injected GNSS-loss event to confirm the grandmaster coasts on its oscillator within the drift budget for the required holdover window, and a forced primary-GM failure to confirm BMCA elects the standby within spec. The honest version of this gate pulls the GNSS feed and measures the drift, rather than trusting the oscillator datasheet — the same design-for-the-failure-case discipline as the black-building test in Chapter 13.6. Pass these three and the timeline every later incident is reconstructed against is real.

~1e-12
post-FEC BER pass floor for AI fabric links (tightening toward 1e-13 at the highest lane rates)
2025IEEE 802.3 / IBTA link specifications; practitioner acceptance plans
100-200 Gb/s
PAM4 SerDes per-lane rate driving 800G/1.6T links — FEC-mandatory, BER-screening-critical
2025SemiAnalysis (AI networks); provenance.js optics ladder
≥ 24 h
minimum link-flap soak under line-rate load at operating temperature before a link is accepted
2025Practitioner fabric-commissioning practice; Keysight test methodology
~1-2 us
InfiniBand point-to-point latency; tuned RoCEv2 ~1.5-2.5 us — the acceptance band for ib_*_lat
2025SemiAnalysis / NVIDIA; provenance.js IB-vs-RoCE
~10 ns
PTP accuracy held across a Spectrum switch; ConnectX-class NIC timestamping under ~4 ns variance
2025NVIDIA Technical Blog, Spectrum switch time-sync
sub-us
fleet PTP offset-from-master target the time-sync gate must demonstrate, under load and across every node
2024Engineering at Meta (SPTP); IEEE 1588 practice
~95%
effective throughput a well-tuned AI Ethernet fabric (Spectrum-X) sustains — the congestion-gate target
2025NVIDIA (Spectrum-X xAI Colossus)
~130 TB/s
NVLink aggregate per GB200 NVL72 rack the scale-up gate verifies whole (1.8 TB/s/GPU, NVLink 5)
2025NVIDIA; provenance.js NVLink
Deep dive: why the synthetic-vs-real-workload seam falls between 13.7 and 13.9

The natural question is why fabric commissioning stops at point-to-point and congestion tests and hands the collective benchmarking — the NCCL all-reduce sweep, the OSU/MLPerf runs, the bisection-bandwidth number that becomes a contractual SLA — to Chapter 13.9. The seam is set by a hard dependency: a meaningful collective benchmark needs real GPUs driving the NICs. A true all-reduce's bus bandwidth (busbw) depends on the GPUs computing the reduction, the NVLink domain feeding the NICs, and the full software stack (CUDA, the collective library, the topology-aware algorithm selection) — none of which exist until node bring-up (Chapter 13.8) is done. You can drive synthetic RDMA traffic from the NICs alone, with no GPU compute, and that is exactly what perftest does — which is why point-to-point and congestion validation belong here, before the GPUs are even integrated.

The consequence of getting this seam wrong is wasted schedule in both directions. Try to run collective benchmarks during fabric commissioning and you block on GPU availability the fabric team does not control, stalling the network punch list. Try to defer all fabric validation until the GPUs arrive and you discover your marginal optics and mis-cabled rails during the proxy training run, when every defect is now entangled with GPU, HBM, and software variables and the localization that was trivial at layer 1 becomes a multi-dimensional hunt. The right sequence is: fabric clean and proven with synthetic traffic (this chapter) → GPUs burned in (13.8) → cluster-scale collectives and the reference training run validating goodput against the SLA (13.9). Each stage hands the next a layer it no longer has to suspect.

Deep dive: the as-commissioned fabric baseline — and why operations cannot live without it

The last act of fabric commissioning is not a test but a capture: the as-commissioned baseline 'fingerprint' that day-2 operations will diff every future reading against (Chapter 13.2; operations consume it in Chapter 14.2). The reason this matters is that fabric health is fundamentally a change-detection problem. A pre-FEC BER of 5e-9 on a link means nothing in isolation — but if that link was commissioned at 5e-11, the two-order-of-magnitude creep is an optic degrading toward failure, and you can only see it if you saved the commissioned value. Without a baseline, every operational reading is a number with no referent, and the fleet flies blind until a link fails outright instead of being swapped during a maintenance window.

The baseline that ships to operations is concrete: per-link pre-FEC and post-FEC BER and FEC-correction rates at acceptance; per-port flap-counter zeros and the soak result; the validated topology graph (the discovered adjacency that passed, as the golden reference for future drift detection); per-pair bandwidth and latency distributions; the congestion-test telemetry signatures; the NVLink domain inventory and per-link bandwidths; and the PTP offset-from-master distribution with the measured holdover curve. Alongside the numbers comes the link-health handoff: the punch list of links that were swapped or re-patched during commissioning (the population most likely to recur), the marginal-but-passing links flagged for early replacement, and the optic serial/lot data that lets operations correlate a future failure cluster back to a bad batch. This is the artifact that seeds the day-2 reliability program — it converts commissioning from a one-time gate into the first data point of a continuous fabric-health time series.

Anti-patterns

The same fabric-commissioning failures recur, because each comes from skipping a layer or testing only the happy path:

  • Accepting on post-FEC BER alone. Passing every link that delivers clean frames today, ignoring the pre-FEC margin FEC is burning to do it. The marginal optics pass commissioning and become production stragglers the first warm afternoon. Screen on margin, not on the masked symptom.
  • Skipping topology validation because BER passed. A cross-patched rail trains clean and benchmarks low, mimicking a congestion or tuning problem. Days are lost at the wrong layer because nobody diffed the discovered graph against the cabling map first.
  • Commissioning the fabric cold. Running BER and flap checks at ambient and accepting the result, then watching thermal-marginal optics fail once the hall reaches its operating temperature. Soak under realistic inlet temperature or you have not tested the link you will operate.
  • PTP configured but never gated. Treating time-sync as a setup step rather than a measured acceptance criterion — no offset sweep across every node, no under-load test, no holdover pull. The first uncorrelatable incident reveals the timeline was never trustworthy. → Chapter 8.7.
  • No baseline captured. Declaring the fabric ready without saving the as-commissioned fingerprint, leaving operations with no referent to diff against and no way to catch a degrading link before it fails outright. → Chapter 14.2.
Fabric commissioning consumes the design decisions made across Part 8: rail-optimized topology and oversubscription in Chapter 8.5, the transport and protocol semantics being validated in Chapter 8.4, the congestion-control mechanisms stress-tested here in Chapter 8.6, the scale-up NVLink domain in Chapter 8.2, the physical-layer optics and FEC in Chapter 8.9, and the PTP timing plane whose acceptance gate this chapter executes in Chapter 8.7. Within Part 13, this gate sits between integrated-systems testing in Chapter 13.6 and node burn-in in Chapter 13.8, and hands cluster-scale collective benchmarking, the reference training run, and the contractual goodput SLA to Chapter 13.9. The acceptance-script and baseline-capture discipline is set in Chapter 13.2; the link-health register seeds the day-2 observability program in Chapter 14.2.