Chapter 8.6
Congestion Control, Load Balancing & In-Network Compute
The fabric you bought in Chapter 8.5 only delivers its bandwidth if you win three fights at once — keeping the lossless mechanism from eating itself, spreading elephant flows across every path, and moving the reduction off the GPU into the switch — and losing any one of them turns a non-blocking Clos into a 50%-efficient one.
What you'll decide here
- Whether you run a lossless fabric (InfiniBand credit-flow, or RoCE on PFC) and inherit head-of-line blocking and deadlock risk, or go lossy with packet trimming under Ultra Ethernet — the choice sets your entire congestion-control posture.
- Whether you accept flow-level ECMP and its hash-collision tax on a handful of fat elephant flows, or commit to adaptive routing / per-packet spray and pay for NIC-side reordering to use it.
- Whether you offload collectives into the switch (SHARP, NVLink-SHARP) to roughly double effective all-reduce bandwidth and reclaim GPU SMs — and accept the vendor lock-in and the radix/topology constraints that come with it.
- How aggressively to tune the DCQCN parameter space (ECN marking thresholds, CNP cadence, PFC headroom) versus buying a fabric where the vendor has pre-tuned it — because an untuned RoCE fabric runs at a fraction of its rated throughput.
- What congestion telemetry you instrument now (per-queue depth, ECN/CNP counters, PFC pause duration, RNR/retransmit) so the operations team in Part 10 and Part 14 can find the one straggler link before it taxes every step.
Chapter 8.5 sized a topology that, on paper, is non-blocking — enough bisection bandwidth that every GPU can talk to every other GPU at line rate. This chapter is about why that paper number is a lie until you solve three problems that the topology alone does not. AI traffic is the pathological worst case for a packet network: a handful of enormous, long-lived elephant flows per host, all synchronized to the same collective, all hitting the wire at the same microsecond, all converging on the same set of destinations during an all-reduce. There are no mice to fill the gaps and no statistical multiplexing to smooth the bursts. The fabric is either carrying a coordinated stampede or it is idle. That is the regime in which congestion control, load balancing, and in-network compute stop being tuning knobs and become the difference between the bandwidth you paid for and the bandwidth you get.
This chapter works through the lossless-vs-lossy fork (and the PFC pathologies that haunt the lossless path), the DCQCN parameter space that you must tune or buy pre-tuned, the load-balancing fork between flow-hashed ECMP and adaptive/per-packet spray (and the reordering bill the latter sends to the NIC), and in-network compute — SHARP and friends — the rare lever that improves goodput and cuts power at the same time. We close on the telemetry that makes all of it observable, because a congestion problem you cannot see is a goodput problem you cannot fix. → fundamentals and the goodput framing in Chapter 8.1; the protocol/transport choices these mechanisms ride on in Chapter 8.4.
Why AI traffic breaks ordinary congestion control
Datacenter congestion control was designed for the web-scale workload: millions of short flows, bursty but uncorrelated, where TCP's loss-and-backoff and a little ECN keep buffers shallow and tails short. AI training inverts every one of those assumptions. A single all-reduce on a 100k-GPU cluster is one logical operation decomposed into thousands of simultaneous, multi-gigabyte point-to-point transfers that all start on the same barrier and must all finish before the next step can begin. The collective runs at the speed of its slowest flow — this is the tail-latency tyranny of Chapter 8.1 made concrete at the packet layer. One congested link, one hash collision, one PFC pause that ripples the wrong way, and the straggler it creates stalls every GPU in the job on the next bulk-synchronous barrier.
That structure produces three distinct failure modes, and the rest of the chapter is organized around defending against each. Buffer pressure and loss from incast — many senders, one receiver, classic during the reduce phase — which the congestion-control loop must absorb. Path imbalance — a few elephant flows colliding on the same physical link while parallel links sit idle — which load balancing must spread. And endpoint overhead — every byte of the reduction traversing the GPU, the PCIe bus, the NIC, and back — which in-network compute can eliminate. Each has its own fork, and each fork has a downstream cost in goodput, power, or lock-in.
The lossless-vs-lossy fork and the PFC trap
RDMA — the zero-copy, kernel-bypass transport that makes GPU-to-GPU networking fast — was born assuming a lossless fabric. InfiniBand delivers losslessness natively with credit-based flow control: a sender never transmits unless the receiver has advertised buffer credits, so packets are never dropped for lack of room. Ethernet has no such mechanism in the base standard, so RoCEv2 borrows one: Priority Flow Control (PFC), a per-priority PAUSE frame that tells the upstream link to stop sending before a buffer overflows. PFC works — and it is the source of the nastiest pathologies in AI networking.
The first is head-of-line blocking. PFC pauses an entire traffic class, not a flow. When a downstream buffer fills, the PAUSE stops every flow in that priority on that link, including flows whose destinations are perfectly idle — innocent bystanders, the "victim flows." The second is the PFC pause storm and, in the worst case, deadlock: because PAUSE is hop-by-hop back-pressure, a congestion hotspot propagates upstream toward the sources, and in a topology with a cyclic buffer dependency (which fat-trees can form under certain failure or routing conditions) the pauses can form a permanent standstill where no switch can drain because every switch is waiting on another. A deadlocked fabric does not degrade — it stops. The defense is a PFC watchdog: a timer on every queue that, if a port stays paused beyond a threshold, drops the stuck traffic and logs it rather than letting the deadlock persist — trading a localized packet loss for fabric-wide survival.
This is why the lossless-vs-lossy question is the master fork of the chapter. Stay lossless and you inherit PFC's pathologies and the operational burden of keeping them caged. Meta's production RoCE fabric famously ran PFC-only at 400G on its back-end, leaning on careful topology and path management rather than a tuned end-to-end loop — a deliberate bet that simplicity beat the DCQCN tuning treadmill at their scale. The Ultra Ethernet Consortium takes the opposite bet: design RDMA to tolerate loss. UEC 1.0 (published June 11, 2025) pairs packet spraying with packet trimming — a congested switch drops the payload but forwards a truncated header so the endpoint learns instantly which packet was lost and retransmits just that one — turning loss from a catastrophe into a fast, surgical signal. → transport semantics and the RoCE in-order penalty in Chapter 8.4; the deep-buffer-vs-shallow switch architecture that shapes how much PFC you need in Chapter 8.3.
The DCQCN parameter space: tune it or buy it pre-tuned
DCQCN — Data Center Quantized Congestion Notification — is the end-to-end loop that keeps a RoCE fabric out of PFC pause and away from loss. The mechanism is elegant: a switch experiencing early congestion marks packets with ECN (sets the Congestion Experienced bit) before its buffer is full; the receiving NIC notices the mark and fires a Congestion Notification Packet (CNP) back to the sender; the sending NIC quantizes its injection rate down in response, then probes back up as the congestion clears. Done right, DCQCN holds buffer occupancy in a sweet spot — full enough to keep links busy, empty enough to never trigger PFC — and PFC becomes the backstop it was meant to be rather than the daily driver.
Done wrong, it is a disaster, and "wrong" is the default. DCQCN exposes a large, coupled parameter space — the ECN marking thresholds (Kmin and Kmax: when to start marking, when to mark every packet), the marking probability slope, the CNP generation cadence, the rate-increase and rate-decrease step sizes and timers, all interacting with the per-queue PFC thresholds and headroom. Set the ECN threshold too high and you mark too late, the buffer overruns, PFC fires, and you are back in pause-storm territory. Set it too low and you throttle senders that were not actually congested, leaving bandwidth on the table and inflating job completion time. The parameters are also topology-, speed-, and workload-dependent: a tuning that is perfect at 400G on a two-tier Clos is wrong at 800G on a three-tier one. This is the single largest reason an untuned RoCE fabric runs at a fraction of its rated throughput, while a tuned one closes 80–90% of the gap to InfiniBand.
| Posture | Loss model | Primary mechanism | Load balancing | Tuning burden | Best fit |
|---|---|---|---|---|---|
| InfiniBand (credit flow) | Lossless by design | Link-level credits + adaptive routing | Adaptive routing native | Low — vendor-integrated | Largest synchronous training; lowest tail; accept single-vendor |
| RoCE + PFC + DCQCN (tuned) | Lossless (PFC backstop) | DCQCN (ECN→CNP→rate) | ECMP, or adaptive on capable silicon | High — continuous tuning | Ethernet shops wanting near-IB goodput with merchant silicon |
| RoCE + PFC-only | Lossless (PFC primary) | PFC back-pressure | Path-pinned / careful ECMP | Medium — topology discipline over loop tuning | Hyperscaler with deep network engineering (e.g. Meta 400G) |
| Ultra Ethernet (UET) | Lossy-tolerant (trim + retransmit) | UCCM + packet trimming | Packet spray + NIC reorder native | Low–medium — designed to self-balance | Open multi-vendor fabrics; 2026+ as silicon ships |
Load balancing: ECMP's hash-collision tax vs adaptive routing and packet spray
A fat-tree gives you many equal-cost paths between any two endpoints. The question is how you spread traffic across them, and AI traffic makes the naive answer fail badly. The default is ECMP — Equal-Cost Multi-Path — which hashes each flow's 5-tuple to pick one of the available uplinks. For web traffic with thousands of small flows, the hash spreads load beautifully by the law of large numbers. For AI traffic with a handful of elephant flows per host, the law of large numbers does not apply: two fat flows can hash to the same physical link and collide, saturating it while a parallel link sits half-empty. The collision is not transient — these flows live for the whole training run — so the straggler it creates is permanent until something reshuffles the hash. On a synchronous collective, that one congested link taxes every step. This is the structural reason flow-level ECMP under-delivers on AI fabrics: there are too few flows to hash well, and they last too long to absorb a bad hash.
Two answers exist, and they trade simplicity against the cost of fixing in-order delivery. Adaptive routing keeps the flow concept but lets the switch re-pick the egress port dynamically based on real-time queue depth, steering a flow away from a congested link onto a quieter one — InfiniBand has done this for years, and Ethernet silicon increasingly supports it. Packet spraying goes further: it abandons per-flow pinning entirely and sprays the packets of a single flow across every available path, achieving near-perfect link utilization. The cost is that packets now arrive out of order, and classic RoCE's go-back-N retransmission treats out-of-order as loss — the in-order penalty that has historically made spray a non-starter on RoCE. The whole point of Ultra Ethernet's transport is to break that penalty: UET sprays at the switch and reorders at the NIC, making out-of-order delivery a feature rather than a fault, and exposing per-packet multipathing that was previously locked inside proprietary fabrics. → the in-order penalty and how each transport handles it in Chapter 8.4; the topology that defines how many paths exist to spread across in Chapter 8.5.
Deep dive: why elephant-flow ECMP collisions are worse than they look
The intuition that "a few collisions average out" is exactly wrong for synchronous training, and the math is worth seeing. Suppose each GPU server runs 8 flows during an all-reduce and the spine offers 16 equal-cost paths. With random hashing, the probability that no two of the 8 flows collide on a path is a birthday-problem calculation that drops well below certainty — collisions are not the exception, they are expected. Each collision halves the effective bandwidth of the two flows sharing the link. And because the collective is bulk-synchronous, the whole operation completes at the speed of the slowest flow, so a single collision among thousands of flows across the cluster can set the pace for the entire step. The damage does not average out; it is dominated by the worst case.
This is why the industry moved from "better hashing" to "stop pinning flows." Adaptive routing fixes the worst case reactively — when a link congests, flows are steered off it — but it operates at flow granularity and reacts on a control-loop timescale. Packet spray fixes it structurally: with packets striped across all paths, no single path can become the bottleneck for a flow, and utilization approaches the theoretical bisection. The reason spray was not universal years ago is the reorder problem: RDMA's transport assumed in-order arrival, and reordering at line rate at 400/800G needs hardware support in the NIC. That hardware now exists — Spectrum-X SuperNICs and UEC-class NICs reorder in silicon — which is what finally made per-packet load balancing viable on Ethernet. The load-balancing story of 2026 is, in one line: the NIC got smart enough to let the switch stop caring about flows. → silicon in Chapter 8.3.
In-network compute: SHARP and collective offload
The previous two sections fight congestion by managing the traffic. In-network compute attacks the problem from the other side: send less traffic in the first place by doing the collective's arithmetic inside the switch. In a conventional all-reduce, every GPU's gradient buffer is shuffled across the network in a ring or tree, summed at each hop on the endpoints, and the result scattered back — the data crosses the fabric multiple times, and every reduction step burns GPU streaming-multiprocessor cycles and PCIe/NIC bandwidth that could have been training. SHARP — NVIDIA's Scalable Hierarchical Aggregation and Reduction Protocol — moves the reduction into the switch ASIC: GPUs send their data up an aggregation tree, the switches sum it in-network as it passes, and only the single reduced result comes back down. The data crosses the fabric roughly once instead of many times.
The consequences are unusually clean for an infrastructure decision because they cut in the right direction on multiple axes at once. SHARP roughly doubles effective all-reduce bandwidth versus a non-SHARP configuration, because the aggregation happens in transit rather than via repeated endpoint exchanges. It frees GPU SMs that were busy doing reduction arithmetic, handing them back to compute. And it cuts the bytes on the wire, which means less congestion to control and less power burned moving data — a rare lever that improves goodput and reduces energy per step simultaneously. SHARPv4 on the Quantum-X800 generation pushes 14.4 TFLOPS of in-network compute (a 9x jump over the prior generation) and supports FP8, All-Reduce, and MPI_Alltoall offload, integrated into NCCL 2.27 so frameworks get it transparently. The same idea lives inside the scale-up domain as NVLink-SHARP, doing in-switch reduction across the NVL72/NVL576 NVLink fabric → canonical scale-up treatment in Chapter 8.2.
The catch is the catch you expect: in-network compute is, today, deeply vendor-coupled. SHARP is an InfiniBand/Quantum (and NVLink) feature; using it commits you to that fabric and its switch radix, topology, and aggregation-tree constraints. It also imposes structure — the reduction tree must map onto the physical topology, which interacts with how you laid out scalable units in Chapter 8.5. The open-standards answer is younger: collective offload is on the roadmap for Ultra Ethernet and the broader merchant ecosystem, but as of 2026 the production, double-the-bandwidth, frees-the-SMs version of in-network reduction is a single-vendor capability. The decision is therefore a goodput-vs-lock-in trade made at fabric-selection time, not later — you cannot bolt SHARP onto a fabric that was not built for it.
| Symptom in the cluster | Underlying cause | Wrong fix | Right lever |
|---|---|---|---|
| Fabric-wide slowdown traced to one hot port | PFC pause propagating upstream | Add buffer / add bandwidth | Tune DCQCN to stay out of pause; enable PFC watchdog |
| One spine link saturated, parallels idle | Elephant-flow ECMP hash collision | Re-hash and hope | Adaptive routing or per-packet spray + NIC reorder |
| All-reduce slower than link rate implies | Reduction overhead on GPUs and wire | Buy faster GPUs | In-network reduction (SHARP / NVLink-SHARP) |
| Sporadic stalls, retransmits, RNR NAKs | Out-of-order treated as loss (RoCE in-order penalty) | Disable multipathing | Loss-tolerant transport (UET) or path-pinning discipline |
| Cluster healthy but JCT creeping up | Untuned congestion loop leaving BW on the table | Accept it | Re-tune ECN thresholds; instrument telemetry (below) |
Congestion telemetry: you cannot fix what you cannot see
Every mechanism above produces a counter, and the difference between a fabric that runs at 95% goodput and one that mysteriously runs at 70% is almost always whether the operations team is watching the right ones. The congestion-control loop emits ECN-marked packet counts and CNP rates — a rising CNP rate on a link is the earliest signal that DCQCN is being asked to throttle, and a leading indicator of a hotspot before it becomes a PFC pause. PFC itself emits pause-frame counts and cumulative pause duration per priority per port — any sustained pause time is a red flag, and pause storms are visible here long before they deadlock. The transport emits retransmit, RNR-NAK, and out-of-order counters that distinguish a real loss problem from a load-balancing reordering problem. And the switches expose per-queue buffer occupancy and microburst histograms that show whether you are riding the DCQCN sweet spot or skating the edge of overflow.
The hard part is correlation, not collection. A straggler flow shows up as elevated JCT at the framework, a congested queue at one switch, a CNP spike at one NIC, and a pause counter three hops upstream — and tying those together to the one link that needs attention requires time-aligned telemetry across the whole fabric, which is exactly why precise time synchronization (PTP/IEEE-1588, → Chapter 8.7) is a prerequisite for congestion diagnosis, not a separate concern. The operational practice of streaming these counters, baselining them, and alerting on deviation belongs to the observability stack: fleet observability and GPU/network health in Chapter 10.6, and DCIM-grade facility-plus-fabric telemetry correlation in Chapter 14.2. The design rule for this chapter: instrument the counters before first light, because the first large training run is the worst time to discover you have no visibility into why a step is slow.
Deep dive: the standards trajectory — why 2026 is the inflection for open congestion control
For most of the RoCE era, the honest summary was: InfiniBand solves congestion and load balancing cleanly but locks you to one vendor; Ethernet is open but makes you fight PFC and tune DCQCN by hand, and per-packet spray was off-limits because the transport could not tolerate reordering. That is the asymmetry Ultra Ethernet was built to erase. UEC 1.0 (June 2025) standardizes, in an open spec, the three things that were previously proprietary advantages: switch-level packet spraying for load balancing, NIC-side reordering so spray does not punish you, and a modern congestion-control method (UCCM) plus packet trimming so loss becomes a fast signal instead of a stall. The bet is that merchant silicon implementing UEC closes the gap to InfiniBand on goodput while keeping Ethernet's multi-vendor economics.
The caveat is timing. A 560-page specification is not a shipping fabric. As of 2026 the UEC-native NICs and switches are arriving rather than ubiquitous, in-network collective offload is on the roadmap but not yet at SHARP's production maturity, and the operators who need congestion control solved today still choose between tuned/integrated RoCE (Spectrum-X) and InfiniBand. The strategic read: the load-balancing and loss-tolerance problems are converging on an open answer, but in-network compute — the goodput-and-power lever — remains the area where the proprietary fabrics lead, and likely will through the near term. Treat the fabric-selection decision as a snapshot of where the standard is now, with a re-decide point as UEC silicon matures. The consolidated roadmap for all of this lives in Chapter 16.2.