Chapter 8.4
Scale-Out Fabric: Protocols, Standards & Transport
The scale-out protocol you commit to is not a wiring detail — it is a 3–5 year bet on who supplies your switches, how much of your link rate survives as goodput under collective load, and whether you can ever leave the vendor whose congestion-control firmware your training run silently depends on.
What you'll decide here
- Which back-end transport — InfiniBand, a tuned RoCEv2 underlay, NVIDIA Spectrum-X, or a Ultra-Ethernet (UEC) fabric — you standardize on, knowing the choice sets your switch vendor list, your operational skill base, and your effective-throughput ceiling for the life of the cluster.
- Whether you accept the lossless-Ethernet bargain (PFC + ECN/DCQCN) and its head-of-line-blocking and deadlock failure modes, or move to a lossy, packet-spray transport that reorders in the NIC and tolerates drops by design.
- How much you are willing to pay — in switch premium, NIC lock-in, and integration risk — to close the gap between raw link rate and delivered all-reduce bandwidth, versus living with the RoCE in-order penalty.
- Whether the fabric is single-tenant (a captive training supercomputer) or must enforce hard multi-tenant isolation (PKeys, VXLAN/EVPN, DPU-enforced VPC) — a decision that constrains protocol choice and the security boundary at once.
- Which parts of the stack you are willing to leave proprietary-and-fast today and re-decide as an open standard (UEC) matures — i.e. where you buy InfiniBand or Spectrum-X now and keep an Ethernet exit open.
A scale-out fabric exists to do one thing well: carry the collective communication of a distributed training or inference job — all-reduce, all-gather, reduce-scatter, all-to-all — across the back-end network that stitches scale-up domains together, without becoming the bottleneck that idles the most expensive silicon in the building. The accelerators are bought; the power is contracted; the only variable left is how much of every step is spent computing versus waiting on the network. That fraction is set, more than by any topology decision, by the transport protocol running on the wire and in the NIC. This chapter is about that choice and its consequences.
The decision presents as four options — InfiniBand, RoCEv2 on lossless Ethernet, NVIDIA Spectrum-X, and Ultra Ethernet (UEC) — but it is really one fork asked twice. First: do you run a purpose-built lossless transport (InfiniBand) or Ethernet? Second, if Ethernet: do you accept RoCE's in-order, lossless-by-PFC model, or do you adopt a modern AI-Ethernet (Spectrum-X today, UEC tomorrow) that sprays packets across paths and reorders in the NIC? Each answer cascades into a switch-vendor list, an operational skill base, a congestion-control parameter space, and — the part nobody prices at scoping time — a multi-year lock-in to whoever owns the firmware your goodput silently depends on.
The protocol war: four answers to one question
For most of a decade the scale-out question had two default answers: InfiniBand for frontier training, and RoCEv2 on Ethernet for everyone cost-sensitive. That binary has shattered. Two forces broke it: the hyperscalers' refusal to single-source their largest capex line on one vendor's proprietary fabric, and the arrival of AI-tuned Ethernet that claims InfiniBand-class effective throughput on a merchant-silicon supply chain. The result is a genuine four-way fork, and the right answer now depends on cluster scale, tenancy model, in-house operational depth, and how much vendor lock-in you will tolerate against how much goodput you are willing to leave on the table.
InfiniBand (NVIDIA Quantum) is the incumbent for tightly-coupled training. It is lossless by construction — credit-based flow control means a sender never transmits a packet the receiver has no buffer for, so the fabric does not drop frames under congestion the way Ethernet does. It carries native RDMA, adaptive routing, and in-network reduction via SHARP (collective offload). The cost is a single-vendor supply chain (NVIDIA end-to-end: switch, NIC, cable, subnet manager), a separate operational discipline most Ethernet teams do not have, and a price premium. → in-network compute and SHARP are engineered in Chapter 8.6; the switch and NIC silicon in Chapter 8.3.
RoCEv2 (RDMA over Converged Ethernet v2) puts RDMA semantics on a routed UDP/IP Ethernet underlay. It promises the commodity economics and multi-vendor supply of Ethernet with RDMA's CPU-bypass data path. The catch is that classic RoCE inherited the go-back-N assumption of an ordered, lossless link: a single out-of-order or dropped packet forces retransmission from that point, so the fabric must be engineered lossless with Priority Flow Control (PFC) and congestion-managed with ECN/DCQCN. That engineering is the hard part, and the failure modes (head-of-line blocking, PFC deadlock, victim flows) recur throughout this chapter. Meta runs production RoCE at 24k-to-100k-GPU scale, which proves it works — but their published account makes clear how much fabric-engineering investment that took.
NVIDIA Spectrum-X is Ethernet that behaves like a purpose-built AI fabric: a Spectrum-4/5 switch plus a BlueField/ConnectX SuperNIC, doing adaptive routing, per-packet spraying, NIC-side reordering, and programmatic congestion control so that effective throughput approaches InfiniBand without the in-order penalty. NVIDIA markets ~95% effective throughput versus the ~60% an untuned vanilla-RoCE fabric can collapse to under all-to-all collisions, and xAI's Colossus is the flagship production proof. The cost: it is Ethernet on the wire but still effectively a NVIDIA end-to-end story — the switch and the SuperNIC come from one vendor, so the lock-in is real even though the protocol is nominally open.
Ultra Ethernet (UEC) is the industry's open answer — a full transport stack (UET) specified by the Ultra Ethernet Consortium, whose v1.0 specification landed in June 2025. It is designed from scratch for AI/HPC: packet spray across multipath, out-of-order delivery decoupled from message ordering with reassembly in the NIC, modern congestion control (UCCM), packet trimming for fast loss signaling, and native RDMA. The promise is Spectrum-X-class behavior on a genuinely multi-vendor (AMD, Broadcom, Cisco, Arista, HPE, Intel, Meta, Microsoft, and others — 100+ members) supply chain. The cost in 2026 is maturity: shipping silicon and interoperable, battle-tested deployments are still ramping, so adopting UEC today is a bet on a roadmap, not a purchase of a finished thing. → standards trajectory consolidated in Chapter 16.2.
| Transport | Loss model | Ordering | Effective throughput | Supply chain / lock-in | Best-fit |
|---|---|---|---|---|---|
| InfiniBand (NVIDIA Quantum) | Lossless — credit-based flow control | In-order; adaptive routing | Highest known; SHARP offloads reductions | Single-vendor end-to-end (switch+NIC+SM) | Frontier training; max goodput, willing to lock in |
| RoCEv2 (lossless Ethernet) | Lossless engineered via PFC; lossy if mis-tuned | In-order — go-back-N penalty on reorder/drop | ~95% tuned; collapses to ~60% untuned | Multi-vendor merchant silicon; you own the tuning | Cost-sensitive at scale with deep fabric-eng teams |
| NVIDIA Spectrum-X | Lossy-tolerant; adaptive routing + NIC reorder | Out-of-order on wire; reassembled in SuperNIC | ~95% effective; near-IB without in-order penalty | Ethernet wire, but NVIDIA switch+SuperNIC pair | Large Ethernet AI clusters wanting near-IB now |
| Ultra Ethernet / UEC (UET) | Lossy by design; packet trimming for fast signal | Out-of-order; reorder decoupled from message order | Targets IB-class; depends on NIC/switch maturity | Open, multi-vendor (100+ members); maturing | Hyperscaler/neocloud avoiding single-source |
Transport semantics: lossless vs lossy, ordered vs out-of-order
Underneath the four product names sit two orthogonal semantic axes that actually determine behavior: does the fabric drop packets under congestion (lossy) or refuse to (lossless)? and must packets arrive in order, or can the endpoint reassemble an out-of-order stream? Every transport is a point in that 2x2, and the two AI-Ethernet entrants exist precisely because the historically dominant corner — lossless-and-ordered — turned out to be a trap at scale.
Lossless means the fabric exerts back-pressure to prevent buffer overflow rather than dropping frames. InfiniBand does this natively with credits. Ethernet retrofits it with PFC (IEEE 802.1Qbb): when an ingress buffer fills, the switch sends a PAUSE upstream for that traffic class, stopping the sender. The problem is that PFC is a blunt, per-class, per-link hammer — it pauses all flows in the class on that link, not just the one causing congestion. That is head-of-line blocking: an innocent flow sharing the link is paused for a congested neighbor it has nothing to do with. Worse, PAUSE propagates hop-by-hop upstream, and in a multi-tier Clos with a cyclic buffer-dependency it can deadlock the fabric entirely — a class of failure that needs watchdogs to detect and break. → the PFC/ECN/DCQCN parameter space and its pathologies are the subject of Chapter 8.6.
Ordered delivery is the second trap. Classic RoCE assumes a packet stream arrives in sequence; an out-of-order arrival is treated as a loss, triggering go-back-N retransmission — the receiver discards everything after the gap and the sender retransmits from the lost packet forward. This is the RoCE in-order penalty, and it is why classic RoCE cannot freely spray a flow across multiple equal-cost paths: if two paths have different latency, the resulting reordering looks like loss and triggers cascading retransmits. So vanilla RoCE pins each flow to a single path (ECMP hashing on the flow tuple), which means a single hot link can bottleneck a flow while parallel links sit idle — exactly the all-to-all collision pattern that collapses untuned RoCE goodput.
Spectrum-X and UEC both attack the same root cause: they decouple wire order from message order. The NIC sprays packets of a single flow across many paths, accepts them out of order, and reassembles the message in hardware before delivering it in order to the application. A delay, a bit of congestion, or a single dropped packet on one path no longer poisons the whole flow — selective retransmission replaces go-back-N, and the cost of a drop falls from a full round-trip stall to one packet. UEC adds packet trimming: instead of dropping a congested packet outright, the switch truncates it to its header and forwards the stub, so the receiver learns of the loss immediately and signals a fast, surgical retransmit. This is the single most important semantic shift in the chapter — and it is why the protocol war is really a war over who owns the NIC that does the reordering.
The RoCE in-order penalty, quantified
The headline number that drives the entire AI-Ethernet movement is the gap between tuned and untuned RoCE effective throughput. An untuned RoCEv2 fabric running a heavy all-to-all can deliver as little as ~60% of raw link rate, because ECMP flow-pinning collides multiple elephant flows onto the same link while parallel paths idle, and the go-back-N penalty amplifies every collision into retransmission. A well-tuned RoCEv2 underlay — careful PFC thresholds, ECN marking, DCQCN parameters, and sometimes per-packet load-balancing tricks — closes ~80–90% of the gap to InfiniBand, landing around ~95% effective throughput. That delta is not academic: at a 100,000-GPU scale where the network is the second-largest line after the accelerators themselves, the difference between 60% and 95% goodput is the difference between a cluster that hits its job-completion-time targets and one that quietly burns a third of its fabric capex on retransmits.
The consequence for the decision: RoCE's cost advantage is real only if you have the team to capture it. The merchant-silicon switch is cheaper than InfiniBand, but the tuning labor, the validation rigs, and the on-call fabric expertise are not. Spectrum-X and UEC exist precisely to sell that tuning as a product — to deliver the ~95% number out of the box by moving the load-balancing and reordering into the silicon, so the operator does not have to become a congestion-control researcher. You are, in effect, choosing whether to buy the goodput as a capability (Spectrum-X/UEC) or build it as a competency (hardened RoCE).
Standards trajectory: where the protocols are heading
The strategic shape of 2026 is convergence-toward-Ethernet with a proprietary lead. InfiniBand retains the goodput and SHARP-offload crown for the most tightly-coupled frontier runs, but its addressable share is being squeezed from both sides: Spectrum-X gives NVIDIA an Ethernet answer for customers who demand it, and UEC gives everyone else an open one. The UEC 1.0 specification (June 2025) is the inflection — it is a 560-plus-page, vertically-integrated stack (physical, link, transport, software, storage, management) that, for the first time, gives the merchant ecosystem a complete blueprint to build AI-Ethernet NICs and switches that interoperate. The bet the industry is making is that open multi-vendor UEC silicon reaches Spectrum-X parity within a generation or two, at which point the question flips from "why would I leave InfiniBand?" to "why would I single-source?"
For the decision-maker, this trajectory argues for explicit optionality management. If you must deploy at frontier scale today, buy the finished proprietary fabric and capture the goodput — but build the cluster's physical layer (cabling, optics, structured plant) to a generic Ethernet/InfiniBand-agnostic standard so the transport can be re-decided at refresh. The physical investment is the irreversible part; the protocol running over it is comparatively reversible if you did not hard-wire your topology to one vendor's switch radix. → the full subsystem roadmap, including UEC milestones and the 800G→1.6T→3.2T optics ladder, is consolidated in Chapter 16.2; the physical-layer choices that preserve or destroy that optionality are in Chapter 8.9 and Chapter 8.10.
Deep dive: why packet spray + NIC reorder wins (and what it costs)
Strip away the brand names and the entire AI-Ethernet revolution reduces to one mechanism: spray a single flow's packets across every available path, accept them out of order, and reassemble in the NIC. Understanding why this is so powerful — and what it costs — is understanding the chapter.
Why it wins. AI collectives generate a small number of enormous "elephant" flows (a single GPU's all-reduce contribution can be gigabytes). Classic ECMP hashes each flow to one path, so two elephants can collide on one link while seven parallel links sit empty — and because RoCE is in-order, you cannot simply split the elephant across the seven free links without triggering go-back-N. Packet spray breaks the flow into per-packet (or per-"entropy") units load-balanced across all paths, so an elephant uses the full bisection bandwidth and a single congested link costs one packet, not a flow. This is why sprayed fabrics hold ~95% effective throughput where flow-pinned RoCE collapses under the same all-to-all.
What it costs. The reassembly is not free. The NIC must buffer out-of-order packets, track per-message completion, and reorder in hardware at line rate — which is precisely why these are SuperNICs (ConnectX/BlueField for Spectrum-X) or UET-compliant NICs, not commodity Ethernet adapters. That is the lock-in vector: the protocol is open-ish, but the NIC that makes it fast is sophisticated, and the switch and NIC must agree on the spraying/trimming scheme. So "open Ethernet" still means "a NIC and switch that implement the same advanced transport," which in 2026 is a short list of suppliers. UEC's value is that it standardizes the contract between switch and NIC so that, eventually, a Broadcom switch and an AMD NIC interoperate — turning today's vendor-pair lock-in into tomorrow's mix-and-match. Until that interop is proven in production, the open path carries integration risk that the proprietary path has already retired.
Multi-tenant fabrics: isolation as a fabric concern
Everything above assumes a single-tenant supercomputer. The moment the fabric serves multiple customers — any GPU neocloud, any cloud GPU service — the transport choice collides with an isolation requirement, and the two cannot be decided separately. A multi-tenant back-end fabric must guarantee that one tenant cannot see, congest, or attack another tenant's traffic, and it must do so on a fabric whose entire reason for existing is to be lossless and high-bandwidth — properties that fight against the buffering and policing that isolation traditionally relies on.
The mechanisms differ by transport. InfiniBand isolates with partition keys (PKeys) enforced by the subnet manager — a tenant's nodes share a PKey and cannot communicate across partition boundaries. Ethernet fabrics overlay tenancy with VXLAN/EVPN (network virtualization that gives each tenant an isolated L2/L3 segment over a shared underlay). The strongest 2026 pattern pushes enforcement into the DPU: a BlueField/Pensando-class DPU at the host edge terminates the tenant's VPC, enforces microsegmentation and encryption, and rate-limits per-tenant — so isolation is enforced in hardware at the boundary of every server rather than trusted to the fabric core. SemiAnalysis's ClusterMAX rating treats DPU-enforced VPC isolation as a maturity signal that separates serious neoclouds from those still relying on coarse VLAN/VXLAN segmentation.
The consequence for protocol choice: hard multi-tenancy biases toward Ethernet + DPU. InfiniBand PKeys give partitioning but the DPU-enforced VPC model — full network virtualization, per-tenant encryption, line-rate microsegmentation — maps more naturally onto an Ethernet fabric with programmable DPUs at the edge. A neocloud that must rent isolated slices to mutually-distrusting tenants on shared GPUs is therefore pulled toward the Spectrum-X / UEC + DPU side of the fork, where the isolation primitives and the security tooling are richer — even before goodput enters the conversation. This is where the fabric decision and the security decision merge: the same DPU that enforces the tenant boundary is the zero-trust enforcement point. → the tenant security boundary is engineered in Chapter 11.6, and zero-trust microsegmentation in Chapter 11.7; the DPU silicon itself in Chapter 8.3.
| Mechanism | Fabric | What it isolates | Strength | Cost / limitation |
|---|---|---|---|---|
| Partition keys (PKeys) | InfiniBand | Membership — who can talk to whom | Strong partitioning, SM-enforced | Coarse; no per-tenant encryption or VPC semantics |
| VLAN | Ethernet | L2 broadcast domain | Weak — 4,094 ID limit, no overlay | Does not scale to cloud tenancy; trivial to misconfigure |
| VXLAN / EVPN | Ethernet | Virtualized L2/L3 per tenant over shared underlay | Scalable overlay isolation | Underlay still shared; congestion isolation imperfect |
| DPU-enforced VPC | Ethernet (+ DPU) | Full per-tenant VPC, microseg, encryption, rate-limit | Strongest — hardware enforcement at host edge | Requires a DPU per host; cost and integration |
Deep dive: the operational-skill tax nobody scopes
The line item that wrecks fabric decisions is not on any vendor quote: the operational-skill tax. Each transport demands a different competency, and the cluster's reliability is hostage to whether you have it.
InfiniBand needs a subnet-manager discipline: managing the SM, partitioning, routing tables, and ibdiagnet-driven BER validation — a skill set that lives mostly in HPC shops and inside NVIDIA's ecosystem, scarce in general cloud-ops teams. RoCE needs a congestion-control research competency: someone who genuinely understands PFC thresholds, ECN marking points, DCQCN dynamics, and how they interact at your scale and topology — and who can debug a PFC deadlock from telemetry at 3 a.m. Spectrum-X trades much of that for vendor-managed behavior, but ties you to NVIDIA's switch-and-SuperNIC support model and its firmware cadence. UEC, in its current maturity, demands the most integration skill — making a multi-vendor NIC/switch combination interoperate and perform — which is exactly why its early adopters are hyperscalers with deep in-house networking teams, not enterprises.
The decision rule: pick the transport whose skill tax you can actually pay, not the one with the best spec sheet. A cost-optimal RoCE fabric operated by a team without congestion-control depth will under-deliver goodput so badly that the cheaper switch becomes the more expensive cluster. Conversely, a hyperscaler with a standing fabric-engineering function can extract RoCE/UEC's economics that an enterprise cannot. The fabric is only as good as the team that tunes it — and that team's cost belongs in the BOM. → fabric commissioning and the validation gates that catch a mis-tuned fabric before production are in Chapter 13.7; congestion telemetry and observability in Chapter 10.6 and Chapter 14.2.
Putting it together: how to choose
The choice resolves along four questions, in order. (1) Scale and coupling. Frontier, tightly-coupled training that lives and dies on all-reduce goodput, with a team fluent in HPC fabrics, still defaults to InfiniBand — the SHARP offload and in-order losslessness remain the highest-goodput known quantity. (2) Supply-chain strategy. If single-sourcing your largest network capex on one vendor is strategically unacceptable, you are on the Ethernet side of the fork regardless of goodput — the only question is which Ethernet. (3) Time vs maturity. If you need near-InfiniBand goodput on Ethernet today, Spectrum-X is the finished product and you accept the NVIDIA switch+SuperNIC pairing; if you can absorb integration risk to own a multi-vendor future, UEC is the bet. (4) Tenancy. Hard multi-tenant isolation pulls toward Ethernet + DPU, where VPC virtualization and microsegmentation are native.
Two anti-patterns recur. The first is buying link rate instead of goodput — specifying a faster port and getting a slower cluster because the transport collapses under collisions; the fix is to benchmark delivered all-reduce bandwidth, not read the port label. The second is adopting RoCE without the team to tune it — capturing Ethernet's switch discount while forfeiting its goodput, so the "cheaper" fabric runs at 60% and idles the accelerators it was supposed to feed. Both come from deciding on the spec sheet rather than on delivered goodput and total operational cost. The fabric exists to keep the goodput high; price it, validate it, and staff it against that job. → the topology and oversubscription decisions that sit on top of this transport choice are in Chapter 8.5; the traffic characterization that motivates all of it in Chapter 8.1; the scale-up domain this fabric stitches together in Chapter 8.2.