Guide › Networking, Fabrics & Optics › 8.6

Chapter 8.6

Congestion Control, Load Balancing & In-Network Compute

The fabric you bought in Chapter 8.5 only delivers its bandwidth if you win three fights at once — keeping the lossless mechanism from eating itself, spreading elephant flows across every path, and moving the reduction off the GPU into the switch — and losing any one of them turns a non-blocking Clos into a 50%-efficient one.

GOODPUTPOWER-BOUND

What you'll decide here

Whether you run a lossless fabric (InfiniBand credit-flow, or RoCE on PFC) and inherit head-of-line blocking and deadlock risk, or go lossy with packet trimming under Ultra Ethernet — the choice sets your entire congestion-control posture.
Whether you accept flow-level ECMP and its hash-collision tax on a handful of fat elephant flows, or commit to adaptive routing / per-packet spray and pay for NIC-side reordering to use it.
Whether you offload collectives into the switch (SHARP, NVLink-SHARP) to roughly double effective all-reduce bandwidth and reclaim GPU SMs — and accept the vendor lock-in and the radix/topology constraints that come with it.
How aggressively to tune the DCQCN parameter space (ECN marking thresholds, CNP cadence, PFC headroom) versus buying a fabric where the vendor has pre-tuned it — because an untuned RoCE fabric runs at a fraction of its rated throughput.
What congestion telemetry you instrument now (per-queue depth, ECN/CNP counters, PFC pause duration, RNR/retransmit) so the operations team in Part 10 and Part 14 can find the one straggler link before it taxes every step.

Chapter 8.5 sized a topology that, on paper, is non-blocking — enough bisection bandwidth that every GPU can talk to every other GPU at line rate. This chapter is about why that paper number is a lie until you solve three problems that the topology alone does not. AI traffic is the pathological worst case for a packet network: a handful of enormous, long-lived elephant flows per host, all synchronized to the same collective, all hitting the wire at the same microsecond, all converging on the same set of destinations during an all-reduce. There are no mice to fill the gaps and no statistical multiplexing to smooth the bursts. The fabric is either carrying a coordinated stampede or it is idle. That is the regime in which congestion control, load balancing, and in-network compute stop being tuning knobs and become the difference between the bandwidth you paid for and the bandwidth you get.

This chapter works through the lossless-vs-lossy fork (and the PFC pathologies that haunt the lossless path), the DCQCN parameter space that you must tune or buy pre-tuned, the load-balancing fork between flow-hashed ECMP and adaptive/per-packet spray (and the reordering bill the latter sends to the NIC), and in-network compute — SHARP and friends — the rare lever that improves goodput and cuts power at the same time. We close on the telemetry that makes all of it observable, because a congestion problem you cannot see is a goodput problem you cannot fix. → fundamentals and the goodput framing in Chapter 8.1; the protocol/transport choices these mechanisms ride on in Chapter 8.4.

Why AI traffic breaks ordinary congestion control

Datacenter congestion control was designed for the web-scale workload: millions of short flows, bursty but uncorrelated, where TCP's loss-and-backoff and a little ECN keep buffers shallow and tails short. AI training inverts every one of those assumptions. A single all-reduce on a 100k-GPU cluster is one logical operation decomposed into thousands of simultaneous, multi-gigabyte point-to-point transfers that all start on the same barrier and must all finish before the next step can begin. The collective runs at the speed of its slowest flow — this is the tail-latency tyranny of Chapter 8.1 made concrete at the packet layer. One congested link, one hash collision, one PFC pause that ripples the wrong way, and the straggler it creates stalls every GPU in the job on the next bulk-synchronous barrier.

That structure produces three distinct failure modes, and the rest of the chapter is organized around defending against each. Buffer pressure and loss from incast — many senders, one receiver, classic during the reduce phase — which the congestion-control loop must absorb. Path imbalance — a few elephant flows colliding on the same physical link while parallel links sit idle — which load balancing must spread. And endpoint overhead — every byte of the reduction traversing the GPU, the PCIe bus, the NIC, and back — which in-network compute can eliminate. Each has its own fork, and each fork has a downstream cost in goodput, power, or lock-in.

The lossless-vs-lossy fork and the PFC trap

RDMA — the zero-copy, kernel-bypass transport that makes GPU-to-GPU networking fast — was born assuming a lossless fabric. InfiniBand delivers losslessness natively with credit-based flow control: a sender never transmits unless the receiver has advertised buffer credits, so packets are never dropped for lack of room. Ethernet has no such mechanism in the base standard, so RoCEv2 borrows one: Priority Flow Control (PFC), a per-priority PAUSE frame that tells the upstream link to stop sending before a buffer overflows. PFC works — and it is the source of the nastiest pathologies in AI networking.

The first is head-of-line blocking. PFC pauses an entire traffic class, not a flow. When a downstream buffer fills, the PAUSE stops every flow in that priority on that link, including flows whose destinations are perfectly idle — innocent bystanders, the "victim flows." The second is the PFC pause storm and, in the worst case, deadlock: because PAUSE is hop-by-hop back-pressure, a congestion hotspot propagates upstream toward the sources, and in a topology with a cyclic buffer dependency (which fat-trees can form under certain failure or routing conditions) the pauses can form a permanent standstill where no switch can drain because every switch is waiting on another. A deadlocked fabric does not degrade — it stops. The defense is a PFC watchdog: a timer on every queue that, if a port stays paused beyond a threshold, drops the stuck traffic and logs it rather than letting the deadlock persist — trading a localized packet loss for fabric-wide survival.

This is why the lossless-vs-lossy question is the master fork of the chapter. Stay lossless and you inherit PFC's pathologies and the operational burden of keeping them caged. Meta's production RoCE fabric famously ran PFC-only at 400G on its back-end, leaning on careful topology and path management rather than a tuned end-to-end loop — a deliberate bet that simplicity beat the DCQCN tuning treadmill at their scale. The Ultra Ethernet Consortium takes the opposite bet: design RDMA to tolerate loss. UEC 1.0 (published June 11, 2025) pairs packet spraying with packet trimming — a congested switch drops the payload but forwards a truncated header so the endpoint learns instantly which packet was lost and retransmits just that one — turning loss from a catastrophe into a fast, surgical signal. → transport semantics and the RoCE in-order penalty in Chapter 8.4; the deep-buffer-vs-shallow switch architecture that shapes how much PFC you need in Chapter 8.3.

PFC is a loaded gun pointed at your whole fabric

The seductive thing about PFC is that it makes RoCE work on day one with almost no tuning — turn it on, the fabric is lossless, RDMA is happy. The trap is that the same mechanism that prevents loss also propagates congestion upstream toward the cores, so a single hot destination during an all-reduce can back-pressure its way across the spine and slow every job sharing the fabric, not just the one that caused it. Three rules are non-negotiable if you run PFC: (1) enable the PFC watchdog on every switch so a deadlock self-heals instead of freezing the cluster; (2) provision adequate PFC headroom — buffer reserved to absorb in-flight packets after a PAUSE is sent, sized to the link's bandwidth-delay product, or you drop anyway; (3) never let PFC be your primary congestion control — it is the lossless backstop, and DCQCN (next section) must do the real work of keeping you out of pause in the first place. A fabric that is constantly pausing is a fabric running on its emergency brake.

The DCQCN parameter space: tune it or buy it pre-tuned

DCQCN — Data Center Quantized Congestion Notification — is the end-to-end loop that keeps a RoCE fabric out of PFC pause and away from loss. The mechanism is elegant: a switch experiencing early congestion marks packets with ECN (sets the Congestion Experienced bit) before its buffer is full; the receiving NIC notices the mark and fires a Congestion Notification Packet (CNP) back to the sender; the sending NIC quantizes its injection rate down in response, then probes back up as the congestion clears. Done right, DCQCN holds buffer occupancy in a sweet spot — full enough to keep links busy, empty enough to never trigger PFC — and PFC becomes the backstop it was meant to be rather than the daily driver.

Done wrong, it is a disaster, and "wrong" is the default. DCQCN exposes a large, coupled parameter space — the ECN marking thresholds (Kmin and Kmax: when to start marking, when to mark every packet), the marking probability slope, the CNP generation cadence, the rate-increase and rate-decrease step sizes and timers, all interacting with the per-queue PFC thresholds and headroom. Set the ECN threshold too high and you mark too late, the buffer overruns, PFC fires, and you are back in pause-storm territory. Set it too low and you throttle senders that were not actually congested, leaving bandwidth on the table and inflating job completion time. The parameters are also topology-, speed-, and workload-dependent: a tuning that is perfect at 400G on a two-tier Clos is wrong at 800G on a three-tier one. This is the single largest reason an untuned RoCE fabric runs at a fraction of its rated throughput, while a tuned one closes 80–90% of the gap to InfiniBand.

Congestion-control posture: the four real choices

Posture	Loss model	Primary mechanism	Load balancing	Tuning burden	Best fit
InfiniBand (credit flow)	Lossless by design	Link-level credits + adaptive routing	Adaptive routing native	Low — vendor-integrated	Largest synchronous training; lowest tail; accept single-vendor
RoCE + PFC + DCQCN (tuned)	Lossless (PFC backstop)	DCQCN (ECN→CNP→rate)	ECMP, or adaptive on capable silicon	High — continuous tuning	Ethernet shops wanting near-IB goodput with merchant silicon
RoCE + PFC-only	Lossless (PFC primary)	PFC back-pressure	Path-pinned / careful ECMP	Medium — topology discipline over loop tuning	Hyperscaler with deep network engineering (e.g. Meta 400G)
Ultra Ethernet (UET)	Lossy-tolerant (trim + retransmit)	UCCM + packet trimming	Packet spray + NIC reorder native	Low–medium — designed to self-balance	Open multi-vendor fabrics; 2026+ as silicon ships

These are the deployed postures as of 2026, not a feature checklist. "Tuning burden" is the recurring operational cost of keeping the fabric at its rated throughput. Latency/throughput figures are workload-dependent; see keynumbers.

The fork: tune DCQCN yourself, or buy a fabric that hides it

If you build a RoCE fabric from merchant switch silicon and standard NICs, you own the DCQCN parameter space — and owning it means a standing investment in network engineers who can read ECN/CNP counters, run incast tests, and re-tune every time you change speed, scale, or topology. The alternative is to buy a fabric where the vendor has done that work and validated it: NVIDIA Spectrum-X bundles its switches and SuperNICs with congestion control and adaptive routing pre-integrated and markets ~95% effective throughput with near-zero flow-collision loss out of the box; InfiniBand sidesteps DCQCN entirely with credit flow. The decision is a classic build-vs-buy on the network: self-tuned merchant RoCE is the lowest capex and the highest operational risk; an integrated fabric costs more per port and gives back control, but it converts a recurring tuning liability into a line item. Pick based on whether you have the network-engineering bench to keep an untuned fabric tuned — most operators discover, expensively, that they do not.

Load balancing: ECMP's hash-collision tax vs adaptive routing and packet spray

A fat-tree gives you many equal-cost paths between any two endpoints. The question is how you spread traffic across them, and AI traffic makes the naive answer fail badly. The default is ECMP — Equal-Cost Multi-Path — which hashes each flow's 5-tuple to pick one of the available uplinks. For web traffic with thousands of small flows, the hash spreads load beautifully by the law of large numbers. For AI traffic with a handful of elephant flows per host, the law of large numbers does not apply: two fat flows can hash to the same physical link and collide, saturating it while a parallel link sits half-empty. The collision is not transient — these flows live for the whole training run — so the straggler it creates is permanent until something reshuffles the hash. On a synchronous collective, that one congested link taxes every step. This is the structural reason flow-level ECMP under-delivers on AI fabrics: there are too few flows to hash well, and they last too long to absorb a bad hash.

Two answers exist, and they trade simplicity against the cost of fixing in-order delivery. Adaptive routing keeps the flow concept but lets the switch re-pick the egress port dynamically based on real-time queue depth, steering a flow away from a congested link onto a quieter one — InfiniBand has done this for years, and Ethernet silicon increasingly supports it. Packet spraying goes further: it abandons per-flow pinning entirely and sprays the packets of a single flow across every available path, achieving near-perfect link utilization. The cost is that packets now arrive out of order, and classic RoCE's go-back-N retransmission treats out-of-order as loss — the in-order penalty that has historically made spray a non-starter on RoCE. The whole point of Ultra Ethernet's transport is to break that penalty: UET sprays at the switch and reorders at the NIC, making out-of-order delivery a feature rather than a fault, and exposing per-packet multipathing that was previously locked inside proprietary fabrics. → the in-order penalty and how each transport handles it in Chapter 8.4; the topology that defines how many paths exist to spread across in Chapter 8.5.

Deep dive: why elephant-flow ECMP collisions are worse than they look

The intuition that "a few collisions average out" is exactly wrong for synchronous training, and the math is worth seeing. Suppose each GPU server runs 8 flows during an all-reduce and the spine offers 16 equal-cost paths. With random hashing, the probability that no two of the 8 flows collide on a path is a birthday-problem calculation that drops well below certainty — collisions are not the exception, they are expected. Each collision halves the effective bandwidth of the two flows sharing the link. And because the collective is bulk-synchronous, the whole operation completes at the speed of the slowest flow, so a single collision among thousands of flows across the cluster can set the pace for the entire step. The damage does not average out; it is dominated by the worst case.

This is why the industry moved from "better hashing" to "stop pinning flows." Adaptive routing fixes the worst case reactively — when a link congests, flows are steered off it — but it operates at flow granularity and reacts on a control-loop timescale. Packet spray fixes it structurally: with packets striped across all paths, no single path can become the bottleneck for a flow, and utilization approaches the theoretical bisection. The reason spray was not universal years ago is the reorder problem: RDMA's transport assumed in-order arrival, and reordering at line rate at 400/800G needs hardware support in the NIC. That hardware now exists — Spectrum-X SuperNICs and UEC-class NICs reorder in silicon — which is what finally made per-packet load balancing viable on Ethernet. The load-balancing story of 2026 is, in one line: the NIC got smart enough to let the switch stop caring about flows. → silicon in Chapter 8.3.

In-network compute: SHARP and collective offload

The previous two sections fight congestion by managing the traffic. In-network compute attacks the problem from the other side: send less traffic in the first place by doing the collective's arithmetic inside the switch. In a conventional all-reduce, every GPU's gradient buffer is shuffled across the network in a ring or tree, summed at each hop on the endpoints, and the result scattered back — the data crosses the fabric multiple times, and every reduction step burns GPU streaming-multiprocessor cycles and PCIe/NIC bandwidth that could have been training. SHARP — NVIDIA's Scalable Hierarchical Aggregation and Reduction Protocol — moves the reduction into the switch ASIC: GPUs send their data up an aggregation tree, the switches sum it in-network as it passes, and only the single reduced result comes back down. The data crosses the fabric roughly once instead of many times.

The consequences are unusually clean for an infrastructure decision because they cut in the right direction on multiple axes at once. SHARP roughly doubles effective all-reduce bandwidth versus a non-SHARP configuration, because the aggregation happens in transit rather than via repeated endpoint exchanges. It frees GPU SMs that were busy doing reduction arithmetic, handing them back to compute. And it cuts the bytes on the wire, which means less congestion to control and less power burned moving data — a rare lever that improves goodput and reduces energy per step simultaneously. SHARPv4 on the Quantum-X800 generation pushes 14.4 TFLOPS of in-network compute (a 9x jump over the prior generation) and supports FP8, All-Reduce, and MPI_Alltoall offload, integrated into NCCL 2.27 so frameworks get it transparently. The same idea lives inside the scale-up domain as NVLink-SHARP, doing in-switch reduction across the NVL72/NVL576 NVLink fabric → canonical scale-up treatment in Chapter 8.2.

The catch is the catch you expect: in-network compute is, today, deeply vendor-coupled. SHARP is an InfiniBand/Quantum (and NVLink) feature; using it commits you to that fabric and its switch radix, topology, and aggregation-tree constraints. It also imposes structure — the reduction tree must map onto the physical topology, which interacts with how you laid out scalable units in Chapter 8.5. The open-standards answer is younger: collective offload is on the roadmap for Ultra Ethernet and the broader merchant ecosystem, but as of 2026 the production, double-the-bandwidth, frees-the-SMs version of in-network reduction is a single-vendor capability. The decision is therefore a goodput-vs-lock-in trade made at fabric-selection time, not later — you cannot bolt SHARP onto a fabric that was not built for it.

Where the goodput goes — and which lever recovers it

Symptom in the cluster	Underlying cause	Wrong fix	Right lever
Fabric-wide slowdown traced to one hot port	PFC pause propagating upstream	Add buffer / add bandwidth	Tune DCQCN to stay out of pause; enable PFC watchdog
One spine link saturated, parallels idle	Elephant-flow ECMP hash collision	Re-hash and hope	Adaptive routing or per-packet spray + NIC reorder
All-reduce slower than link rate implies	Reduction overhead on GPUs and wire	Buy faster GPUs	In-network reduction (SHARP / NVLink-SHARP)
Sporadic stalls, retransmits, RNR NAKs	Out-of-order treated as loss (RoCE in-order penalty)	Disable multipathing	Loss-tolerant transport (UET) or path-pinning discipline
Cluster healthy but JCT creeping up	Untuned congestion loop leaving BW on the table	Accept it	Re-tune ECN thresholds; instrument telemetry (below)

Diagnostic map from symptom to mechanism. The point is that 'add more bandwidth' fixes none of these; each has a specific congestion-control, load-balancing, or in-network-compute remedy.

~95%

effective throughput on a tuned/integrated AI Ethernet fabric (Spectrum-X) with near-zero flow-collision loss

2025NVIDIA (Spectrum-X / xAI Colossus); SemiAnalysis

~2x

effective all-reduce bandwidth from in-network reduction (SHARP) vs non-SHARP config

2025NVIDIA SHARP In-Network Computing

14.4 TFLOPS

in-network compute per SHARPv4 (Quantum-X800), 9x prior gen; FP8, NCCL 2.27 integrated

2025NVIDIA Quantum-X800 / SHARP

Jun 11 2025

UEC 1.0 published — packet spray at switch, reorder at NIC, UET transport, packet trimming, native RDMA (560+ pp spec)

2025Ultra Ethernet Consortium 1.0

~1–2 us

InfiniBand fabric latency; tuned RoCEv2 ~1.5–2.5 us; untuned RoCEv2 5–10 us

2025SemiAnalysis / NVIDIA

80–90%

of the IB throughput gap that careful DCQCN tuning closes on RoCEv2; untuned leaves the rest on the table

2025SemiAnalysis (100k H100 clusters)

400G

speed at which Meta ran production RoCE PFC-only on the back-end, leaning on topology over loop tuning

2024Engineering at Meta (RoCE at scale)

8.4%

share of Llama-3 training interruptions attributed to network root causes (vs 30% faulty GPU)

2024Meta (Llama 3 paper)

Congestion telemetry: you cannot fix what you cannot see

Every mechanism above produces a counter, and the difference between a fabric that runs at 95% goodput and one that mysteriously runs at 70% is almost always whether the operations team is watching the right ones. The congestion-control loop emits ECN-marked packet counts and CNP rates — a rising CNP rate on a link is the earliest signal that DCQCN is being asked to throttle, and a leading indicator of a hotspot before it becomes a PFC pause. PFC itself emits pause-frame counts and cumulative pause duration per priority per port — any sustained pause time is a red flag, and pause storms are visible here long before they deadlock. The transport emits retransmit, RNR-NAK, and out-of-order counters that distinguish a real loss problem from a load-balancing reordering problem. And the switches expose per-queue buffer occupancy and microburst histograms that show whether you are riding the DCQCN sweet spot or skating the edge of overflow.

The hard part is correlation, not collection. A straggler flow shows up as elevated JCT at the framework, a congested queue at one switch, a CNP spike at one NIC, and a pause counter three hops upstream — and tying those together to the one link that needs attention requires time-aligned telemetry across the whole fabric, which is exactly why precise time synchronization (PTP/IEEE-1588, → Chapter 8.7) is a prerequisite for congestion diagnosis, not a separate concern. The operational practice of streaming these counters, baselining them, and alerting on deviation belongs to the observability stack: fleet observability and GPU/network health in Chapter 10.6, and DCIM-grade facility-plus-fabric telemetry correlation in Chapter 14.2. The design rule for this chapter: instrument the counters before first light, because the first large training run is the worst time to discover you have no visibility into why a step is slow.

Congestion control is a goodput-and-power lever, not just a latency lever

It is tempting to file this chapter under "latency tuning" and hand it to the network team in isolation. That undersells it on both threads this book tracks. On GOODPUT: an untuned or badly load-balanced fabric can run a non-blocking topology at half its rated effective bandwidth, which directly inflates job completion time and burns GPU-hours — the most expensive resource in the building — on waiting rather than computing. On POWER-BOUND: in-network compute cuts the bytes that cross the fabric, and bytes on the wire are watts; SHARP-style reduction lowers energy per training step at the same time it raises throughput. In a power-constrained facility, a mechanism that does more useful work per joule moved is not a nicety — it is capacity. Congestion control, done well, is one of the few levers that pays out on goodput and power at once. → the goodput economics in Chapter 14.1.

Deep dive: the standards trajectory — why 2026 is the inflection for open congestion control

For most of the RoCE era, the honest summary was: InfiniBand solves congestion and load balancing cleanly but locks you to one vendor; Ethernet is open but makes you fight PFC and tune DCQCN by hand, and per-packet spray was off-limits because the transport could not tolerate reordering. That is the asymmetry Ultra Ethernet was built to erase. UEC 1.0 (June 2025) standardizes, in an open spec, the three things that were previously proprietary advantages: switch-level packet spraying for load balancing, NIC-side reordering so spray does not punish you, and a modern congestion-control method (UCCM) plus packet trimming so loss becomes a fast signal instead of a stall. The bet is that merchant silicon implementing UEC closes the gap to InfiniBand on goodput while keeping Ethernet's multi-vendor economics.

The caveat is timing. A 560-page specification is not a shipping fabric. As of 2026 the UEC-native NICs and switches are arriving rather than ubiquitous, in-network collective offload is on the roadmap but not yet at SHARP's production maturity, and the operators who need congestion control solved today still choose between tuned/integrated RoCE (Spectrum-X) and InfiniBand. The strategic read: the load-balancing and loss-tolerance problems are converging on an open answer, but in-network compute — the goodput-and-power lever — remains the area where the proprietary fabrics lead, and likely will through the near term. Treat the fabric-selection decision as a snapshot of where the standard is now, with a re-decide point as UEC silicon matures. The consolidated roadmap for all of this lives in Chapter 16.2.

This chapter sits on top of the rest of Part 8. The goodput, tail-latency, and collective fundamentals that motivate every mechanism here are in Chapter 8.1; the scale-up fabric where NVLink-SHARP does in-switch reduction is Chapter 8.2; the switch ASICs, NICs, and SuperNICs that implement spray, reorder, ECN, and SHARP are Chapter 8.3; the protocol/transport fork (InfiniBand vs RoCE vs Spectrum-X vs UEC) and the in-order penalty are Chapter 8.4; and the topology that determines how many paths there are to load-balance across is Chapter 8.5. The time synchronization that makes congestion telemetry correlatable is Chapter 8.7. Downstream, fabric observability and health live in Chapter 10.6, facility-plus-fabric telemetry correlation in Chapter 14.2, the goodput economics that price all of this in Chapter 14.1, and the standards roadmap in Chapter 16.2.