Guide › Part 8

Part 8

Networking, Fabrics & Optics

10 chapters

Network Fundamentals & AI Traffic Characterization

In a synchronous AI cluster the network is not plumbing between computers — it is the part of the computer that decides how fast every other part is allowed to run, because the whole job moves at the speed of its slowest collective and stalls on its longest tail.

Scale-Up Fabric (Intra-Node / Intra-Rack)

The scale-up domain — the set of accelerators that talk at memory speed over a switched fabric an order of magnitude faster than the back-end network — is the single hardware boundary that sets your tensor- and expert-parallel ceilings, your MoE inference economics, and your largest blast radius; how big you can make it, and over what medium, is now the most contested decision in AI networking.

Network Silicon: Switch ASICs, NICs & DPUs

The switch ASIC, the NIC, and the DPU are three silicon decisions that set the ceiling on every fabric you can build on top of them — pick the SerDes generation, the buffer architecture, and the offload engine before you draw a topology, because the topology is downstream of all three.

Scale-Out Fabric: Protocols, Standards & Transport

The scale-out protocol you commit to is not a wiring detail — it is a 3–5 year bet on who supplies your switches, how much of your link rate survives as goodput under collective load, and whether you can ever leave the vendor whose congestion-control firmware your training run silently depends on.

Scale-Out Topology, Sizing & Oversubscription

Topology, switch radix, and oversubscription are not networking aesthetics — they are a single sizing decision that converts a GPU count into a bill of materials, a blocking factor, and an MFU ceiling, and the dominant mistake of the 2026 era is buying a training fabric for an inference workload (or the reverse).

Congestion Control, Load Balancing & In-Network Compute

The fabric you bought in Chapter 8.5 only delivers its bandwidth if you win three fights at once — keeping the lossless mechanism from eating itself, spreading elephant flows across every path, and moving the reduction off the GPU into the switch — and losing any one of them turns a non-blocking Clos into a 50%-efficient one.

Management, Out-of-Band Fabric & PTP/IEEE-1588 Timing

An AI cluster has two fabrics that almost nobody scopes on purpose — the out-of-band network that lets you reach a wedged node when the data plane is gone, and the timing plane that gives every telemetry record, RoCE counter, and training step a common clock — and when either is missing you discover it during the incident, which is the most expensive possible time to learn the lesson.

Scale-Across: Multi-Campus & Cross-Region Fabric (DCI for Distributed Training)

When no single campus can be energized fast enough to hold the run, the fork is no longer how to wire one building but how to split a synchronous job across buildings — and that decision propagates into your optics, your transport layer, your training algorithm, and your failure model all at once.

Physical-Layer & Interconnect Taxonomy

Every link in an AI cluster is a bet that a given reach can be crossed by the cheapest, lowest-power medium that still closes the budget at the required bit-error rate — and at 224G-per-lane that bet is now lost a metre sooner, an entire stadium's worth of pluggable optics earlier, than it was one SerDes generation ago.

CPO, Fiber Plant & Structured Cabling

The link is a pluggable transceiver, the fiber plant is a structured-cabling system, and the next decade's question is whether the optics stay on the faceplate or move onto the package — a fork that trades the worst power problem in the building against the worst serviceability problem in the building.