Chapter 8.9
Physical-Layer & Interconnect Taxonomy
Every link in an AI cluster is a bet that a given reach can be crossed by the cheapest, lowest-power medium that still closes the budget at the required bit-error rate — and at 224G-per-lane that bet is now lost a metre sooner, an entire stadium's worth of pluggable optics earlier, than it was one SerDes generation ago.
What you'll decide here
- Where the copper-to-optics boundary falls for your generation — passive DAC, active copper (ACC/AEC), or optics — given that the reach each medium can close roughly halves every time the per-lane SerDes rate doubles.
- Which DSP placement (fully-retimed pluggable vs LPO vs LRO vs half-retimed) you certify, trading 4-10 W per port of optics power against host-board co-design burden and multi-vendor interoperability risk.
- Which pluggable form factor (QSFP-DD, OSFP, OSFP-XD) and thermal lid (flat-top vs finned) you standardize on — a cage decision that is mechanically locked into the switch and the rack airflow for the life of the hall.
- How much of your reliability and goodput budget you spend on optics — because at fleet scale the binding metric is not hard-failure MTBF but mean-time-to-flap, and a single unstable link can burn tens of thousands of GPU-hours before it is found.
- Which FEC and link-margin (TDECQ) discipline you commission to, since the difference between a clean cluster and one losing 20-30% of productivity to flapping links is set at acceptance test, not in production.
The fabric chapters that precede this one — scale-up (8.2), scale-out protocols and topology (8.4, 8.5), the switch and NIC silicon (8.3) — all assume that bits arrive at the far end of every link intact. This chapter is about the wire and the glass that make that assumption true, and about the single physical decision that propagates through all of them: for a given reach at a given lane rate, what is the cheapest, lowest-power medium that still closes the link budget? That question has exactly three answers — a passive copper cable, an electrically-boosted copper cable, or an optical link — and the boundary between them is not a matter of taste. It is set by signal-integrity physics, and it moves toward the operator's wallet every time the SerDes ladder steps up.
The taxonomy matters because the answer is a portfolio, not a single choice. A 2026-vintage AI hall ships copper inside the rack and optics between racks — "copper inside, optics outside" — and the exact location of that inside/outside seam is the most consequential physical-layer fork in the building. Put it too far toward copper and you stretch DACs past their margin and eat link flaps. Put it too far toward optics and you burn megawatts of transceiver DSP power and a fortune in pluggables on spans copper could have carried for free. This chapter walks the link budget, the SerDes ladder, the FEC/TDECQ margin discipline, the interconnect medium taxonomy, the form factors, and the DSP-placement spectrum — then treats optics as what it actually is at scale: the dominant single failure and operating-cost driver in the network.
The link budget: the equation every interconnect decision answers
Strip away the marketing and a link is an accounting problem. The transmitter launches a signal with some power and some quality; the channel — copper trace, cable, connectors, or fiber and its splices — subtracts loss and adds distortion; the receiver needs the signal to arrive above a threshold with enough eye-opening that the forward-error-correction engine can clean up what remains and hit the target bit-error rate. The link budget is that ledger: launch power and equalization on one side, insertion loss and impairments in the middle, receiver sensitivity and FEC coding gain on the other. Every entry in the interconnect taxonomy is just a different way of balancing that ledger over a different distance.
Two physical facts dominate the ledger in 2026. First, modern datacom links are PAM4, not the older NRZ. PAM4 encodes two bits per symbol using four amplitude levels, doubling throughput at a given symbol rate — but it cuts the vertical eye opening to roughly a third of NRZ's, so the same channel loss costs far more margin and the link leans hard on equalization and FEC to survive. Second, copper loss rises steeply with frequency. Doubling the per-lane rate (50G to 100G to 200G to 400G) roughly doubles the Nyquist frequency, and the channel attenuates the higher band so aggressively that the reach a passive cable can carry halves with each step. That single fact — reach halving as rate doubles — is the engine that drives the entire copper-to-optics migration this chapter describes.
The SerDes ladder and the FEC/TDECQ margin discipline
The SerDes (serializer/deserializer) is the gating spec for the entire fabric — it sets the per-lane electrical rate that switch ASICs, NICs, and optics must all agree on. The ladder runs 50G to 100G to 200G (224G signaling) to the 400G/lane (448G signaling) era now demonstrated at OFC 2026. Aggregate port speeds are just lane-count multiples: 800G is 8x100G or 4x200G; 1.6T is 8x200G; 3.2T arrives as 8x400G. Because every link partner has to speak the same lane rate, a SerDes generation transition is a synchronized, fleet-wide event — you cannot mix a 224G switch port with a 112G transceiver and expect them to negotiate.
What keeps a PAM4 link alive across the budget is forward error correction. Datacom standardized on Reed-Solomon FEC (RS(544,514), the "KP4" code), which delivers enough coding gain that a raw pre-FEC bit-error rate around 1e-4 is corrected to a post-FEC rate of 1e-12 or better. FEC is not free: it adds latency and a small bandwidth tax, and — critically — it has a cliff. Below its correction threshold the link is pristine; cross the threshold and errors leak through catastrophically. That makes margin the real engineering quantity. For optical transmitters the margin metric is TDECQ (Transmitter Dispersion Eye Closure Quaternary) — a single number, in dB, capturing how much the transmitter's eye is already closed before the channel touches it. A transmitter that ships with marginal TDECQ has no headroom for connector contamination, temperature drift, or aging, and it is the link that starts flapping six months into the deployment. TDECQ discipline at acceptance test is therefore a goodput decision dressed up as a metrology one.
Deep dive: PAM4, TDECQ, and why temperature drift is a commissioning trap
PAM4's three stacked eyes are not equal, and they are not stable. Transmitter non-linearity squeezes the upper and lower eyes differently, and the laser/driver behavior shifts with junction temperature — so an optic that passes TDECQ on a cool bench at the factory can drift out of margin once it is jammed into a finned cage behind a hot switch ASIC pulling air that has already crossed a 130 kW rack. This is why TDECQ is specified with temperature, and why field-relevant qualification matters more than a datasheet number. The failure mode is insidious: the link does not fail at install, it fails statistically. As the eye closes under thermal stress, pre-FEC errors creep up toward the RS-FEC threshold; for most of the day the code corrects them and the link looks healthy on the dashboard, but at the thermal peak it tips over the cliff and the link flaps. The operator sees an intermittent, load-correlated, time-of-day-correlated fault that is maddening to chase — and the root cause was a transmitter commissioned with two-tenths of a dB too little TDECQ margin. The discipline that prevents it is boring and non-negotiable: clean every connector (a single fiber contamination event can cost more margin than the entire fiber span), qualify optics at the temperature they will actually run, and reject marginal TDECQ at acceptance rather than discovering it in production. The connector-level EMC and bonding that protect copper links from a different class of impairment are canonical in Chapter 4.11.
The interconnect medium taxonomy: copper inside, optics outside
With the budget and the ladder established, the taxonomy itself is a reach ladder. As required distance grows, you climb from passive copper, to electrically-assisted copper, to optics — paying more power and more dollars at each rung in exchange for more reach. The governing principle every hyperscaler converges on is copper inside, optics outside: keep the densest, highest-bandwidth, latency-critical links (scale-up inside the rack) on copper, and spend optics only where copper physically cannot reach (scale-out between racks and across the row).
Passive DAC (direct-attach copper) is a plain shielded twinax cable with no active electronics — zero added power, lowest cost, lowest latency, and effectively zero failure rate, but the shortest reach (~1-2 m at 224G). ACC/AEC (active copper / active electrical cable) adds a small redriver or retimer IC in the connector to boost and re-clean the signal, buying a few more metres (~3-7 m) at the cost of a couple of watts and a small active-component failure population. AOC (active optical cable) is a captive, factory-terminated optical assembly — optics on both ends, fiber in between, sold as a single fixed-length cable; it reaches tens to hundreds of metres but you cannot break it out or re-length it. Pluggable transceivers are the fully modular option: a transceiver in a cage on each end with field-installed structured fiber between them, the most flexible and the basis of every large scale-out fabric — and the most power-hungry and failure-prone rung on the ladder.
| Medium | Added electronics | Typical reach (224G) | Added power / port | Failure profile | Where it lives in an AI cluster |
|---|---|---|---|---|---|
| Passive DAC | None (plain twinax) | ~1-2 m | ~0 W | Effectively nil; no active parts | Scale-up inside the rack; shortest in-row hops |
| ACC / AEC (active copper) | Redriver or retimer in connector | ~3-7 m | ~1-3 W | Small active-component population | Rack-to-rack where copper still reaches; top-of-rack to spine within a row |
| AOC (active optical cable) | Optics fixed on both ends | Tens to ~100+ m | ~10-17 W (pair) | Optics failure rate, but factory-terminated | Fixed point-to-point runs where length is known |
| Pluggable optics + structured fiber | Transceiver each end, field fiber between | 100 m to multi-km | ~14-17 W (DSP pluggable, 800G) | Dominant network failure/flap source | Scale-out leaf-spine and beyond; the modular default |
Pluggable form factors: the cage you are mechanically married to
Once you are on pluggable optics, the form factor is a long-lived commitment because the cage — the receptacle soldered to the switch faceplate — is fixed for the life of the box. The three that matter in 2026 are QSFP-DD, OSFP, and OSFP-XD. QSFP-DD (8 lanes, the evolution of the ubiquitous QSFP) is the volume Ethernet workhorse and is backward-compatible with older QSFP optics in the same cage. OSFP is slightly larger, carries 8 lanes, and — crucially — was designed with a larger thermal envelope and an integrated heat-management path, which is why it dominates the highest-power 800G/1.6T AI deployments where the older QSFP-DD cage struggles to dissipate the module's heat. OSFP-XD doubles to 16 lanes, enabling 1.6T today (16x100G) and a path to 3.2T (16x200G) — the form factor GB200-class fabrics reach for when a single port must carry two NVLink/NIC widths.
The thermal lid is its own fork. Finned-top OSFP modules carry their own heatsink and rely on the switch's own airflow — standard in air-cooled halls. Flat-top OSFP modules omit the integral fin so the cage can press against a cold plate or a liquid-cooled riding heatsink — the variant you need when the switch itself is liquid-cooled. The trap is that finned and flat-top modules are not freely interchangeable in a given cage and airflow design, so the form-factor-plus-lid choice is locked in lockstep with the cooling architecture of the hall. Choose a liquid-cooled switch and you have implicitly chosen flat-top optics and a cold-plate-compatible cage; retrofitting that later is a faceplate redesign, not a swap.
The DSP-placement spectrum: retimed, LRO, LPO, half-retimed
Inside a conventional pluggable transceiver sits a power-hungry DSP that re-times and re-equalizes the signal on both the host (electrical) and line (optical) sides. That DSP is the single biggest power and cost item in the module — and the largest lever the industry has to cut optics power, which is why a spectrum of architectures has emerged that progressively removes it. The decision is a four-way fork, and it trades optics power against host-board co-design burden and interoperability risk.
Fully-retimed pluggable is the incumbent: a full DSP retimes both sides, isolating the optic from the host channel. It is the most robust, the most plug-and-play, the most multi-vendor-interoperable — and the most power-hungry, at roughly 14-17 W for an 800G module. LRO (linear receive optics, or linear-drive on the receive side) removes the DSP from one direction, keeping a retimer where it helps most and going linear where the channel is friendly — a pragmatic middle that recovers several watts with modest co-design. LPO (linear pluggable optics) removes the DSP entirely, relying on the host ASIC's own SerDes equalization to drive the optic directly; it cuts 800G optics power to roughly 7-8.5 W — close to halving it — but demands tight host-board co-design and end-to-end link qualification, because the optic no longer cleans up the host channel. Half-retimed architectures split the difference per-direction. The further you move toward LPO, the more power and cost you save and the more you take on the burden of owning the whole electrical path and validating it against a narrower vendor set.
| Architecture | DSP placement | ~Power / 800G | Host co-design burden | Interop / serviceability | Best fit |
|---|---|---|---|---|---|
| Fully-retimed pluggable | Full DSP, both sides | ~14-17 W | Low — optic isolates host | Highest; multi-vendor, field-swappable | Default scale-out; multi-vendor fabrics |
| Half-retimed | Retimer one direction only | ~10-13 W | Moderate | High | Power-sensitive links with friendly channel one way |
| LRO (linear receive) | Retimer kept where needed; linear elsewhere | ~9-11 W | Moderate | High | Pragmatic power recovery without full LPO risk |
| LPO (linear pluggable) | No DSP; host SerDes drives optic | ~7-8.5 W | High — own the whole electrical path | Narrower vendor set; still field-swappable | Hyperscale fabrics that control host design |
| CPO (co-packaged, for contrast) | Optics co-packaged with switch ASIC | under ~6 W | Highest — switch/optics co-design | Lowest; not field-serviceable (→ 8.10) | Highest-radix switches at the power wall |
Optics as the dominant failure and operating-cost driver
Here is the part the datasheets bury. At the scale of a modern AI cluster, optics are not a component — they are a population, and populations fail statistically. The reliability metric that matters is the mean-time-to-flap — the far shorter interval between transient link drops that re-establish in seconds — not the headline hard-failure MTBF, which at millions of hours per module reads comfortably on paper. The arithmetic is brutal. A cluster wiring up a few hundred thousand GPUs runs on the order of millions of optical links; at fleet scale that translates to a hard optic failure every few days but a link flap roughly every 48 seconds. Each flap is a momentary loss of a link, and in a synchronous training job a single flapping link on the wrong path can stall a collective and waste the work of every GPU waiting on it.
The goodput consequence is first-order, not marginal. With 32,000 GPUs and a one-hour checkpoint interval, one unstable link that forces a restart wastes up to 32,000 GPU-hours of compute. Operators report losing 20-30% of cluster productivity to persistent link instability before the offending transceivers are isolated and replaced — a number that dwarfs the capex of the optics themselves. This is why a generation of "flap-resistant" transceivers (e.g. Credo's ZeroFlap family) emerged in 2025-2026 marketing better mean-time-to-flap rather than better MTBF: the industry finally measured the metric that was actually costing money. The Llama 3 herd's published failure breakdown is the canonical at-scale dataset here — network and cable issues were a meaningful slice of the 466 interruptions over 54 days, and the run still achieved over 90% effective training time only because of disciplined fault isolation and fast restart. The checkpoint math that converts a flap into wasted GPU-hours is derived in Chapter 9.4; the goodput-vs-availability reframing is in Chapter 12.2.
Deep dive: commissioning and operating the optical plant as a reliability program
Treating optics as a managed population changes what "done" means at every lifecycle stage. At factory and incoming inspection, qualify TDECQ at the temperature the optic will actually run, not bench-cool; sample-test incoming lots rather than trusting the supplier's coupon. At install, connector cleanliness is the single highest-leverage discipline — a contaminated ferrule can erase more link margin than the entire fiber run, and contamination is cumulative across every mate/de-mate, so inspect-and-clean before every connection is a hard rule, not a courtesy. At commissioning, run the fabric under synthetic all-reduce load at thermal soak and read pre-FEC BER on every link; reject any link without comfortable FEC margin even if it "passes" — a link at the FEC threshold at acceptance is a flap in production. In operation, the win is telemetry-driven: trend per-link pre-FEC BER and flag the optic whose error rate is climbing toward the cliff before it stalls a job, so the swap is scheduled into a maintenance window rather than triggered by a 32,000-GPU-hour restart. Keep a hot-spare optic population sized to the statistical replacement rate, and design the cabling so a single module is swappable without disturbing its neighbors. None of this is exotic; all of it is the difference between a cluster at 96% goodput and one at 75%. The structured-cabling, polarity, and loss-budget engineering that the field-fiber half of this depends on is canonical in Chapter 8.10.
Putting the taxonomy to work: drawing your boundaries
The physical layer is a sequence of three boundaries, each derived from the layer above it, and each costly if drawn wrong. First, the copper-to-optics boundary: set by your SerDes generation's reach physics. Keep scale-up and the shortest scale-out hops on passive copper (zero power, zero flaps); cross to active copper only for the reach copper still closes; reserve optics for spans copper physically cannot carry. Draw it toward optics and you burn ~20 kW/rack of needless DSP power; draw it toward copper and you eat link flaps. Second, the DSP-placement boundary: how much optics power you cut versus how much host co-design and interop risk you absorb. Fully-retimed is safe and power-hungry; LPO halves the power but makes you the owner of the whole electrical path. Third, the form-factor boundary: locked in lockstep with the switch's cooling — flat-top optics for liquid-cooled switches, finned for air — and unwound only by a faceplate redesign.
Wrapping all three is the reliability discipline, because the physical layer is the layer where goodput is silently won or lost. A fabric that looks identical on a topology diagram can deliver 96% or 75% effective training time depending entirely on whether its optics were commissioned with TDECQ margin and operated as a managed population. The taxonomy tells you which medium to use; the commissioning discipline tells you whether it will actually carry your bits for the life of the cluster.