Guide › Networking, Fabrics & Optics › 8.5

Chapter 8.5

Scale-Out Topology, Sizing & Oversubscription

Topology, switch radix, and oversubscription are not networking aesthetics — they are a single sizing decision that converts a GPU count into a bill of materials, a blocking factor, and an MFU ceiling, and the dominant mistake of the 2026 era is buying a training fabric for an inference workload (or the reverse).

GOODPUTPOWER-BOUND

What you'll decide here

Which topology family — rail-optimized fat-tree, rail-only, dragonfly, or OCS/torus — your scale and your dominant collective actually demand, because the choice sets switch count, optics count, worst-case hop diameter, and bisection bandwidth all at once.
The oversubscription ratio on the scale-out east-west path: 1:1 non-blocking for synchronous training versus 2:1–3:1 for loosely-coupled inference, the single lever that moves ~31% of back-end cost.
The scalable-unit (SU) boundary you build in — the repeatable GPU/switch/optic block you replicate — because sizing in SUs, not in ad-hoc GPU counts, is what keeps the BOM, the cabling, and the fault domains tractable as the cluster ramps.
How your parallelism plan (TP/EP inside the scale-up domain, PP and DP across the scale-out fabric) maps onto the topology — get the mapping wrong and a non-blocking fabric still starves the collective it was built to serve.
Where you spend the reach budget: copper inside the rack and the SU, optics across the spine — because at 800G/1.6T the copper cliff, not the switch radix, often decides the physical topology.

By the time you reach this chapter you have already chosen a protocol and transport (Chapter 8.4) and the switch, NIC, and DPU silicon (Chapter 8.3). What remains is the decision that turns those components into an actual machine: how do you wire N GPUs together, and how much bandwidth do you refuse to pay for? Topology and oversubscription are the same decision viewed from two angles. Topology fixes the shape — how many tiers, how many switches, how long the longest path, how much bisection bandwidth the cluster can sustain when every GPU talks to every other GPU at once. Oversubscription fixes the price — how much of that theoretical bisection you actually build versus deliberately starve, because the workload will never use it.

Every topology choice propagates deterministically into a switch count, an optics count, a cable-length distribution, a power draw, and an effective bisection bandwidth that caps the collective throughput of the parallelism plan running on top of it. This chapter builds the topology taxonomy, derives the sizing math in scalable units, confronts the oversubscription fork that separates a training fabric from an inference fabric, and maps the parallelism strategy onto the wires. The back-end fabric is the second-largest line item after the accelerators themselves, and it is the one where a mis-sized decision is silently paid for every step of every job, forever, as lost goodput.

The topology taxonomy: four families and what each buys

Four topology families account for essentially every AI cluster shipping in 2026. They trade the same three currencies against each other: bisection bandwidth (how much the fabric can move when the traffic is all-to-all), switch and optic count (the cost and the failure surface), and worst-case hop diameter (latency and the blast radius of congestion). No family wins on all three; the right one is a function of scale and of which collective dominates your workload.

Fat-tree / Clos is the default and the safe choice. A folded multi-tier Clos with the right radix delivers full bisection bandwidth at every tier — genuinely non-blocking — and scales by adding tiers: a 64-port switch (32 down, 32 up) builds a three-tier fat-tree to 32,768 endpoints; NVIDIA's 144-port Quantum-X800 reaches 10,368 NICs in just two tiers, and tens of thousands of GPUs in three. The cost is switch count and optics: a non-blocking three-tier Clos is the most switch- and transceiver-heavy topology there is, and at 100k-GPU scale the optics alone become a top-tier line item and the leading hardware failure source.

Rail-optimized fat-tree is the AI-specific refinement, and it is the dominant production pattern for NVIDIA-class clusters. Instead of one flat Clos, the scale-out fabric is split into rails — typically 8, one per GPU NIC in the server — where GPU i in every node connects to leaf switch i. Same-rail traffic (the common case for rail-aligned collectives) traverses a single leaf hop; cross-rail traffic falls back to NVLink inside the scale-up domain. The payoff is that the all-reduce, which is rail-aligned by construction, mostly stays one hop away, dramatically cutting spine load. Rail-only pushes this further — it removes the spine tier entirely for workloads that never need cross-rail scale-out, trading generality for a radically cheaper fabric.

Dragonfly trades bisection for switch count. Groups of switches are fully connected internally and sparsely connected to each other, yielding at most three hops endpoint-to-endpoint with far fewer switches and optics than a fat-tree of equal size — but lower bisection bandwidth, which is why it favors HPC and latency-bound patterns over bandwidth-bound all-reduce. OCS and torus is the hyperscaler-custom lane: Google's TPU pods use optical circuit switches to reconfigure the physical topology on demand (a 3D torus or twisted variant), decoupling logical from physical wiring and cutting worst-case diameter — Google's reconfigurable approach reduces a 1,024-chip worst case from 16 hops on a plain 3D torus to ~7. OCS removes a whole tier of packet switches and their optics, but it is a vertically-integrated play most operators cannot buy off the shelf.

Topology family → bisection, cost, diameter, fit

Topology	Bisection BW	Switch / optic count	Worst-case hop diameter	Best fit
Fat-tree / Clos (non-blocking)	Full (1:1) at every tier	Highest — most switches & optics	2 tiers ~3 hops; 3 tiers ~5 hops	General training; the safe default at any scale
Rail-optimized fat-tree	Full on-rail; spine carries cross-rail only	High, but spine tier shrinks vs flat Clos	On-rail 1 leaf hop; cross-rail via NVLink + spine	NVIDIA-class synchronous training (the 2026 default)
Rail-only (spineless)	Full on-rail; none cross-rail	Lowest fat-tree variant — no spine	1 hop on-rail; no cross-rail scale-out path	Workloads that never need cross-rail east-west
Dragonfly	Lower than fat-tree; sparse global links	Low — far fewer switches/optics	≤3 hops by design	HPC, latency-bound, cost-sensitive at scale
OCS / torus (reconfigurable)	High within a pod; topology-on-demand	Lowest packet-switch count; OCS replaces a tier	Reduced vs static torus (e.g. 16 → ~7)	Vertically-integrated XPU pods (TPU-class)

Synthesis of NVIDIA DGX SuperPOD network-fabric reference, Meta RoCE/DSF, Google TPU OCS, and Introl/SemiAnalysis topology analyses, 2025-2026. 'Bisection' is the all-to-all worst case the fabric sustains; 'hop diameter' is endpoint-to-endpoint worst case.

In this table the leftmost column is a choice and everything to its right is a consequence, the same structure as the archetype cascade in Chapter 1.1. Choose a non-blocking fat-tree and you have implicitly chosen the highest optics count and the largest failure surface in the building. Choose dragonfly and you have accepted lower bisection in exchange for far fewer transceivers — a rational trade only if your dominant collective is not bandwidth-bound. The families do not move independently, which is exactly why topology is a sizing decision and not a diagram.

Sizing in scalable units: the BOM math

The professional way to size a fabric is not to count GPUs — it is to count scalable units (SUs). An SU is the smallest repeatable block of GPUs, leaf switches, optics, and cabling that you replicate to grow the cluster. Sizing in SUs is what keeps the BOM, the cabling plan, the fault domains, and the procurement schedule tractable as a cluster ramps from one SU to dozens. In NVIDIA's GB200 SuperPOD reference, 1 SU = 8 rack-scale systems = 576 GPUs; a 16-SU cluster of 9,216 GPUs needs on the order of 512 leaf + 384 spine + 144 core switches for a non-blocking three-tier compute fabric, plus a separate storage fabric and a separate management fabric.

The math that turns an SU into a BOM is mechanical once the radix and the blocking factor are fixed. For a non-blocking fat-tree built from radix-k switches, each tier must provide as much uplink bandwidth as downlink bandwidth — so for every port facing the GPUs there is a port facing the next tier up. That 1:1 uplink:downlink rule is the entire definition of 'non-blocking,' and it is also the entire cost of it: half of every switch's ports, and the optics on them, exist solely to preserve bisection bandwidth. Relax the rule to 2:1 (two downlinks per uplink) and you delete a large fraction of the spine switches and their transceivers — which is precisely the oversubscription lever the next section turns.

Three counting traps recur. First, the optics dominate the back-end BOM, not the switches — at 100k-GPU scale a cluster carries tens of thousands of miles of fiber and the transceiver line can rival the switch line. Second, the storage and management fabrics are not free: the compute fabric is the headline, but a production SU also carries a non-blocking storage fabric (e.g. 16x800G per SU) and an out-of-band management network (Chapter 8.7). Third, the reach budget, not the radix, often caps the physical topology: passive DAC dies around 1–2 m at 800G, active copper (AEC) stretches to 3–7 m, and beyond that you are paying for optics on every link — so the cable-length distribution implied by your floor plan is part of the sizing math, not an afterthought.

Deep dive: from one SU to a BOM — worked switch and optic counting

Take the NVIDIA GB200 reference and walk the count. One SU is 8 NVL72-class systems, 576 GPUs, each GPU with one ~400–800G scale-out SuperNIC plus separate CPU/storage NICs. For a non-blocking compute fabric, every GPU NIC needs a leaf-switch downlink, and every leaf needs an equal count of uplinks to spine — so leaf radix is split half-down, half-up. With a 64-port leaf you land 32 GPU links down and 32 spine links up per leaf; the SU's 576 NICs therefore consume ~18 leaf switches' worth of downlinks, and the spine tier must absorb an equal uplink count. Replicate to 16 SUs and the tiers compound into the ~512 leaf / ~384 spine / ~144 core figure in the SuperPOD reference, with the core tier organized into groups so any leaf reaches any other in a bounded hop count.

Now the optics. Every spine and core link is beyond copper reach, so each is a transceiver pair — two optics per link, two ends. A non-blocking three-tier fabric at 9,216 GPUs therefore carries on the order of tens of thousands of transceivers before you add storage and management. Drop to a 2:1 oversubscribed spine and you delete roughly a third of the spine switches and the optics riding them — the ~31% back-end saving the oversubscription section quantifies. The lesson: the SU is the unit you reason in, the radix and blocking factor are the multipliers, and the optics count — not the switch count — is the number that decides the back-end budget. → physical-layer reach and optic taxonomy in Chapter 8.9; structured cabling in Chapter 8.10.

The oversubscription fork: 1:1 for training, 2:1–3:1 for inference

This fork follows directly from coupling. Oversubscription is the ratio of downlink bandwidth to uplink bandwidth at a tier — a 1:1 fabric provisions as much bandwidth leaving a tier as entering it (non-blocking); a 3:1 fabric provisions a third, betting that the traffic offered will never simultaneously demand full bisection. The bet is sometimes safe and sometimes catastrophic, and which one it is depends entirely on the workload's coupling.

Synchronous training must be 1:1 non-blocking in the GPU east-west path. A pre-training step is dominated by all-reduce/all-gather collectives that move gradient and activation tensors across the entire data-parallel group on every iteration. Those collectives are bandwidth-bound and bursty in lockstep — every GPU hits the fabric at the same instant — so any oversubscription directly throttles the collective, stalls the slowest rank, and collapses model FLOPs utilization (MFU). There is no statistical multiplexing to exploit because the traffic is not statistically independent; it is synchronized by construction. Oversubscribe a training fabric and you do not save money, you buy a slower supercomputer that costs the same to run and finishes the job later.

Inference tolerates 2:1–3:1, and refusing to oversubscribe it is a waste. An inference request fits inside a node or a small scale-up domain (Chapter 8.2), so most traffic never touches the scale-out spine at all; what does is statistically independent across requests, so a tier can multiplex many bursty flows and rarely saturate. A 2:1 'optimized' fabric cuts back-end cost roughly 31% versus non-blocking (a contested, largely single-source figure), and that saving is real money you redeploy into geo-distribution, uptime, or more accelerators. Build a non-blocking fabric for an inference business and you have stranded a third of your back-end budget on bisection bandwidth that never carries a packet — the exact anti-pattern Chapter 1.1 names.

The nuance is that real clusters are tiered, and the ratio can vary by tier. The bottom tier (leaf-to-GPU) is almost always non-blocking even on inference fabrics — the leaf is cheap and HOL blocking there is unforgiving. The oversubscription is applied up the tree, at the inter-zone and inter-building tiers where flows have aggregated and statistical multiplexing is strongest. Meta has publicly run as high as 7:1 on parts of a 24k-H100 fabric, and on training fabrics has run RTSW uplinks deliberately under-subscribed (1:2) as a congestion mitigation — proof that the ratio is a per-tier knob, not a single global number.

The fork that prices the fabric: coupling, not vibes, sets the ratio

Decide this before you order switches. If the dominant workload is synchronous training, the GPU east-west path is non-blocking — full stop — and you pay for every spine switch and every optic, because the all-reduce is bandwidth-bound and lockstep-synchronized and will expose any starvation as lost MFU on every step. If the dominant workload is inference or batch, you oversubscribe the upper tiers 2:1–3:1, bank the ~31% back-end saving, and redeploy it into the things inference actually values — proximity, uptime, and more serving capacity. The expensive errors are symmetric: a training fabric oversubscribed is a supercomputer that runs slow forever; an inference fabric built non-blocking is a third of the back-end budget set on fire. The ratio is a consequence of coupling, and coupling is a property of the workload, not a preference.

Oversubscription decision matrix by workload

Workload	Dominant traffic	East-west ratio	Rationale	Cost consequence
Pre-training (synchronous)	All-reduce / all-gather, lockstep	1:1 non-blocking	Bandwidth-bound, synchronized; no multiplexing to exploit	Full spine + optics cost; starvation = lost MFU forever
Post-training / RL	Async rollouts + bursty trainer sync	Tight 1:1 trainer; tolerant rollout tier	Disaggregated; only the trainer path is lockstep	Non-blocking where the gradient step lives; relax the rest
Online inference	Per-request, fits node/scale-up domain	2:1–3:1 upper tiers	Statistically independent flows multiplex well	~31% back-end saving vs non-blocking; redeploy to uptime
Batch inference	Throughput, queue-tolerant	3:1+ acceptable	No latency SLO; congestion is reschedulable	Cheapest fabric; spend the saving elsewhere

Ratios are east-west GPU-fabric guidance from Juniper AI-cluster design, SemiAnalysis Neocloud Playbook, and Meta RoCE-at-scale, 2025-2026. Lower tiers run non-blocking even when upper tiers are oversubscribed.

Mapping parallelism onto the topology

A non-blocking fabric is necessary but not sufficient — you also have to place the parallelism on it correctly, or you starve the collective the fabric was built to serve. The discipline is to match each parallelism dimension's bandwidth appetite to the tier that can feed it, working outward from the fattest pipe.

Tensor parallelism (TP) and expert parallelism (EP) belong inside the scale-up domain. These are the most bandwidth-hungry dimensions — TP shards a single layer across GPUs and exchanges activations every layer; EP routes tokens to experts with all-to-all on every MoE layer. They demand the ~1.8 TB/s/GPU of NVLink (NVL72 = 130 TB/s rack aggregate), which is roughly 18x the ~400G scale-out NIC. The rule is unambiguous: fit TP and EP entirely inside the scale-up domain; the moment they spill onto the scale-out fabric, the collective runs at NIC bandwidth instead of NVLink bandwidth and throughput falls off a cliff. This is the deepest reason scale-up domain size is a workload decision — a 72-GPU NVL72 domain admits far wider EP (and thus larger MoE models served efficiently) than an 8-GPU HGX domain. → scale-up fabric in Chapter 8.2; wide-EP economics in Chapter 10.11.

Pipeline parallelism (PP) and data parallelism (DP) ride the scale-out fabric. PP passes activations between pipeline stages — point-to-point, latency-sensitive but not bandwidth-crushing — and DP runs the gradient all-reduce across replicas. These tolerate the scale-out fabric precisely because they are designed to: PP overlaps communication with computation, and DP's all-reduce, while bandwidth-heavy, is the collective the rail-optimized fat-tree is laid out to serve one hop at a time. Place DP groups along rails so the all-reduce stays on-rail, and you keep the spine carrying only what genuinely must cross rails.

The consequence of getting the mapping wrong is invisible in a topology diagram and brutal in production: a perfectly non-blocking fabric, fully provisioned, running a job whose TP dimension was placed across the scale-out path — every layer's activation exchange crawling at NIC speed, MFU in the floor, and a network team insisting the fabric is healthy because it is. Topology-aware scheduling (Chapter 10.2) exists to prevent exactly this; the fabric and the scheduler are co-designed, not independent.

1:1 vs 2:1–3:1

training non-blocking vs inference oversubscription; 2:1 cuts back-end cost ~31% (contested); Meta ran 7:1 on 24k H100

2025SemiAnalysis Neocloud Playbook; Juniper AI-cluster design; Meta

576 GPUs / SU

NVIDIA GB200 SuperPOD scalable unit (8 systems); 9,216-GPU cluster ≈ 512 leaf + 384 spine + 144 core switches

2025NVIDIA DGX SuperPOD reference architecture

144 × 800G

Quantum-X800 radix (72 OSFP); 2-tier fat-tree to 10,368 NICs, tens of thousands of GPUs in 3 tiers

2025NVIDIA Quantum-X800 platform

102.4 Tbps

Broadcom Tomahawk 6 per-chip switching; 512×200G or 1024×100G SerDes; targets 100k–1M XPU

2025Broadcom Tomahawk 6 product release

~18×

NVLink5 scale-up (1.8 TB/s/GPU, 130 TB/s NVL72 rack) over ~400G scale-out NIC — keep TP/EP inside scale-up

2025NVIDIA NVLink; SemiAnalysis

~1–7 m

copper reach budget at 800G: passive DAC ~1–2 m, active AEC ~3–7 m, optics beyond — caps physical topology

2025SemiAnalysis GB200 architecture

≤3 hops

dragonfly worst-case diameter; Google OCS cuts a 1,024-chip torus worst case from 16 to ~7 hops

2025Introl topology synthesis; Google TPU OCS

~10.7%

share of significant GPU-job failures Meta attributes to network config/topology-dependent congestion

2025Introl (Meta Llama-3 synthesis)

The cost / reach / blocking tradeoff, made explicit

Every topology decision resolves to a point in a three-axis space — cost, reach, and blocking — and the three are coupled, so you cannot optimize one without paying in the others. Blocking (the inverse of bisection bandwidth you provision) is set by the oversubscription ratio: non-blocking is the most expensive and the only acceptable choice for synchronous training. Reach is set by physics: at 800G and 1.6T the copper cliff arrives within a couple of meters, so a floor plan that spreads racks out converts copper links into optical links and multiplies the transceiver count — meaning the building layout is a fabric-cost variable. Cost is the resultant: it rises with switch tiers (more radix to cross), with non-blocking provisioning (more spine), and with optical reach (more transceivers).

The practical optimization is to spend copper where you can and optics where you must, and to push as much traffic as possible into the cheapest tier. Rail-optimized topologies are popular precisely because they keep the dominant collective one leaf hop away — minimizing the spine traffic that forces expensive optical links. Compact, dense floor plans keep more links inside the copper reach budget. And the SU boundary is drawn, in part, to keep intra-SU links on copper and reserve optics for SU-to-SU. The fabric that looks cheapest on a per-port basis is rarely cheapest in the building; the all-in number is dominated by how many of your links crossed the copper cliff into optics.

The optics are the budget — and the failure source

The single most under-modeled line in a back-end fabric is the optics, on both cost and reliability. At 100k-GPU scale a cluster carries on the order of 40,000+ miles of fiber and a transceiver count that can rival or exceed the switch budget; 1.6T OSFP modules run roughly $1,300–1,500 each, and they are simultaneously a top-tier capital line and the leading hardware failure source in the fabric. Two consequences follow. First, any topology decision that converts a copper link into an optical one — spreading racks out, choosing a flat Clos over rail-optimized, oversizing the spine — is a decision to buy more of the most failure-prone component in the building. Second, optics reliability is a goodput problem, not just a capex one: a flapping transceiver degrades a collective long before it hard-fails. Model optics as a recurring reliability load, not a one-time BOM line. → optics as a failure/cost driver in Chapter 8.9; CPO's power and serviceability case in Chapter 8.10.

Deep dive: why rail-optimized beat flat Clos for synchronous training

A flat non-blocking Clos and a rail-optimized fat-tree can both deliver full bisection bandwidth, so why did rail-optimized become the 2026 default for NVIDIA-class training? The answer is traffic locality. In a synchronous training job the dominant collective — the data-parallel all-reduce — is rail-aligned: GPU i in every node communicates predominantly with GPU i in every other node. A rail-optimized topology wires exactly that pattern as a single leaf hop: GPU i across all nodes hangs off leaf i, so the all-reduce that defines the step never climbs to the spine. The spine exists only for the rarer cross-rail traffic, which means it can be smaller — fewer switches, fewer of the optical links that dominate cost and failures.

A flat Clos, by contrast, makes no assumption about traffic locality, so it must provision full bisection across every path, spreading the same all-reduce across the entire spine and paying for optical links the rail-optimized design avoids. The trade is generality for efficiency: rail-optimized is optimal for the rail-aligned collective and worse for arbitrary all-to-all (which is why EP-heavy MoE inference, with its all-to-all token routing, leans harder on the scale-up domain and complicates the rail assumption). The design lesson generalizes: the cheapest correct fabric is the one whose topology encodes the dominant collective's communication pattern — match the wires to the math, and the spine shrinks. → collective and traffic characterization in Chapter 8.1.

Anti-patterns

The recurring fabric-sizing mistakes all share a root cause: sizing from a component spec or a topology preference instead of from the workload's coupling and parallelism plan. Four are worth naming:

Non-blocking fabric for an inference business. Building 1:1 bisection for a workload whose requests fit inside a node strands ~31% of the back-end budget on bandwidth that never carries a packet. Oversubscribe the upper tiers and spend the saving on uptime and proximity.
Oversubscribing a training fabric. The symmetric error. A 2:1 ratio on a synchronous all-reduce throttles every step forever — you bought a permanently slower supercomputer to save capex that lost MFU repays many times over.
Parallelism spilling out of the scale-up domain. Placing TP or wide EP across the ~400G scale-out path instead of the ~1.8 TB/s NVLink domain runs the most bandwidth-hungry collective ~18x too slow on a fabric the network team will swear is healthy. Fit TP/EP inside the scale-up domain or accept the collapse.
Sizing in GPUs instead of SUs. Ad-hoc GPU counts produce ad-hoc cabling, irregular fault domains, and a BOM that does not replicate. Draw the SU boundary first; size, cable, and procure in SUs.

Traffic characterization and the collectives this chapter sizes for are in Chapter 8.1; the scale-up domain that TP/EP must fit inside is Chapter 8.2; the switch, NIC, and DPU silicon whose radix sets the topology is Chapter 8.3; the protocol and transport layered on this topology is Chapter 8.4. Congestion control and load balancing that keep an oversubscribed fabric honest are Chapter 8.6; the management/OOB fabric and timing are Chapter 8.7; scale-across to multi-campus is Chapter 8.8; the physical-layer reach budget and optic taxonomy that cap the topology are Chapter 8.9, and the fiber plant and structured cabling are Chapter 8.10. The oversubscription fork inherits its logic from the archetype cascade in Chapter 1.1; topology-aware placement of the parallelism plan is scheduled in Chapter 10.2 and exploited for wide-EP inference in Chapter 10.11; fabric commissioning and bisection-bandwidth validation are an acceptance gate in Chapter 13.7.