Guide › Networking, Fabrics & Optics › 8.9

Chapter 8.9

Physical-Layer & Interconnect Taxonomy

Every link in an AI cluster is a bet that a given reach can be crossed by the cheapest, lowest-power medium that still closes the budget at the required bit-error rate — and at 224G-per-lane that bet is now lost a metre sooner, an entire stadium's worth of pluggable optics earlier, than it was one SerDes generation ago.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Where the copper-to-optics boundary falls for your generation — passive DAC, active copper (ACC/AEC), or optics — given that the reach each medium can close roughly halves every time the per-lane SerDes rate doubles.
Which DSP placement (fully-retimed pluggable vs LPO vs LRO vs half-retimed) you certify, trading 4-10 W per port of optics power against host-board co-design burden and multi-vendor interoperability risk.
Which pluggable form factor (QSFP-DD, OSFP, OSFP-XD) and thermal lid (flat-top vs finned) you standardize on — a cage decision that is mechanically locked into the switch and the rack airflow for the life of the hall.
How much of your reliability and goodput budget you spend on optics — because at fleet scale the binding metric is not hard-failure MTBF but mean-time-to-flap, and a single unstable link can burn tens of thousands of GPU-hours before it is found.
Which FEC and link-margin (TDECQ) discipline you commission to, since the difference between a clean cluster and one losing 20-30% of productivity to flapping links is set at acceptance test, not in production.

The fabric chapters that precede this one — scale-up (8.2), scale-out protocols and topology (8.4, 8.5), the switch and NIC silicon (8.3) — all assume that bits arrive at the far end of every link intact. This chapter is about the wire and the glass that make that assumption true, and about the single physical decision that propagates through all of them: for a given reach at a given lane rate, what is the cheapest, lowest-power medium that still closes the link budget? That question has exactly three answers — a passive copper cable, an electrically-boosted copper cable, or an optical link — and the boundary between them is not a matter of taste. It is set by signal-integrity physics, and it moves toward the operator's wallet every time the SerDes ladder steps up.

The taxonomy matters because the answer is a portfolio, not a single choice. A 2026-vintage AI hall ships copper inside the rack and optics between racks — "copper inside, optics outside" — and the exact location of that inside/outside seam is the most consequential physical-layer fork in the building. Put it too far toward copper and you stretch DACs past their margin and eat link flaps. Put it too far toward optics and you burn megawatts of transceiver DSP power and a fortune in pluggables on spans copper could have carried for free. This chapter walks the link budget, the SerDes ladder, the FEC/TDECQ margin discipline, the interconnect medium taxonomy, the form factors, and the DSP-placement spectrum — then treats optics as what it actually is at scale: the dominant single failure and operating-cost driver in the network.

The link budget: the equation every interconnect decision answers

Strip away the marketing and a link is an accounting problem. The transmitter launches a signal with some power and some quality; the channel — copper trace, cable, connectors, or fiber and its splices — subtracts loss and adds distortion; the receiver needs the signal to arrive above a threshold with enough eye-opening that the forward-error-correction engine can clean up what remains and hit the target bit-error rate. The link budget is that ledger: launch power and equalization on one side, insertion loss and impairments in the middle, receiver sensitivity and FEC coding gain on the other. Every entry in the interconnect taxonomy is just a different way of balancing that ledger over a different distance.

Two physical facts dominate the ledger in 2026. First, modern datacom links are PAM4, not the older NRZ. PAM4 encodes two bits per symbol using four amplitude levels, doubling throughput at a given symbol rate — but it cuts the vertical eye opening to roughly a third of NRZ's, so the same channel loss costs far more margin and the link leans hard on equalization and FEC to survive. Second, copper loss rises steeply with frequency. Doubling the per-lane rate (50G to 100G to 200G to 400G) roughly doubles the Nyquist frequency, and the channel attenuates the higher band so aggressively that the reach a passive cable can carry halves with each step. That single fact — reach halving as rate doubles — is the engine that drives the entire copper-to-optics migration this chapter describes.

Why 224G-per-lane is the hinge of the 2026 fabric

At 50G/lane, passive copper comfortably crossed a rack. At 100G it still reached a couple of metres. At 224 Gb/s per lane (the lane rate behind 1.6T optics and the GB200/GB300-class fabric), passive DAC reach has collapsed to roughly 1-2 m and active copper to a few metres — which is precisely why NVIDIA engineered the NVL72 so the worst-case in-rack NVLink span is about 0.83 m: it keeps the most bandwidth-hungry links of the whole machine on copper, where they cost no DSP power and never flap. Step beyond that span and you are forced onto optics, where each 1.6T port costs 14-17 W of pluggable DSP and becomes a reliability liability. The 224G transition did not change the rules; it moved the copper-to-optics boundary inward by a metre and made that boundary a costly one to draw wrong.

The SerDes ladder and the FEC/TDECQ margin discipline

The SerDes (serializer/deserializer) is the gating spec for the entire fabric — it sets the per-lane electrical rate that switch ASICs, NICs, and optics must all agree on. The ladder runs 50G to 100G to 200G (224G signaling) to the 400G/lane (448G signaling) era now demonstrated at OFC 2026. Aggregate port speeds are just lane-count multiples: 800G is 8x100G or 4x200G; 1.6T is 8x200G; 3.2T arrives as 8x400G. Because every link partner has to speak the same lane rate, a SerDes generation transition is a synchronized, fleet-wide event — you cannot mix a 224G switch port with a 112G transceiver and expect them to negotiate.

What keeps a PAM4 link alive across the budget is forward error correction. Datacom standardized on Reed-Solomon FEC (RS(544,514), the "KP4" code), which delivers enough coding gain that a raw pre-FEC bit-error rate around 1e-4 is corrected to a post-FEC rate of 1e-12 or better. FEC is not free: it adds latency and a small bandwidth tax, and — critically — it has a cliff. Below its correction threshold the link is pristine; cross the threshold and errors leak through catastrophically. That makes margin the real engineering quantity. For optical transmitters the margin metric is TDECQ (Transmitter Dispersion Eye Closure Quaternary) — a single number, in dB, capturing how much the transmitter's eye is already closed before the channel touches it. A transmitter that ships with marginal TDECQ has no headroom for connector contamination, temperature drift, or aging, and it is the link that starts flapping six months into the deployment. TDECQ discipline at acceptance test is therefore a goodput decision dressed up as a metrology one.

Deep dive: PAM4, TDECQ, and why temperature drift is a commissioning trap

PAM4's three stacked eyes are not equal, and they are not stable. Transmitter non-linearity squeezes the upper and lower eyes differently, and the laser/driver behavior shifts with junction temperature — so an optic that passes TDECQ on a cool bench at the factory can drift out of margin once it is jammed into a finned cage behind a hot switch ASIC pulling air that has already crossed a 130 kW rack. This is why TDECQ is specified with temperature, and why field-relevant qualification matters more than a datasheet number. The failure mode is insidious: the link does not fail at install, it fails statistically. As the eye closes under thermal stress, pre-FEC errors creep up toward the RS-FEC threshold; for most of the day the code corrects them and the link looks healthy on the dashboard, but at the thermal peak it tips over the cliff and the link flaps. The operator sees an intermittent, load-correlated, time-of-day-correlated fault that is maddening to chase — and the root cause was a transmitter commissioned with two-tenths of a dB too little TDECQ margin. The discipline that prevents it is boring and non-negotiable: clean every connector (a single fiber contamination event can cost more margin than the entire fiber span), qualify optics at the temperature they will actually run, and reject marginal TDECQ at acceptance rather than discovering it in production. The connector-level EMC and bonding that protect copper links from a different class of impairment are canonical in Chapter 4.11.

The interconnect medium taxonomy: copper inside, optics outside

With the budget and the ladder established, the taxonomy itself is a reach ladder. As required distance grows, you climb from passive copper, to electrically-assisted copper, to optics — paying more power and more dollars at each rung in exchange for more reach. The governing principle every hyperscaler converges on is copper inside, optics outside: keep the densest, highest-bandwidth, latency-critical links (scale-up inside the rack) on copper, and spend optics only where copper physically cannot reach (scale-out between racks and across the row).

Passive DAC (direct-attach copper) is a plain shielded twinax cable with no active electronics — zero added power, lowest cost, lowest latency, and effectively zero failure rate, but the shortest reach (~1-2 m at 224G). ACC/AEC (active copper / active electrical cable) adds a small redriver or retimer IC in the connector to boost and re-clean the signal, buying a few more metres (~3-7 m) at the cost of a couple of watts and a small active-component failure population. AOC (active optical cable) is a captive, factory-terminated optical assembly — optics on both ends, fiber in between, sold as a single fixed-length cable; it reaches tens to hundreds of metres but you cannot break it out or re-length it. Pluggable transceivers are the fully modular option: a transceiver in a cage on each end with field-installed structured fiber between them, the most flexible and the basis of every large scale-out fabric — and the most power-hungry and failure-prone rung on the ladder.

Interconnect medium taxonomy → reach / power / failure tradeoff (224G/1.6T-class)

Medium	Added electronics	Typical reach (224G)	Added power / port	Failure profile	Where it lives in an AI cluster
Passive DAC	None (plain twinax)	~1-2 m	~0 W	Effectively nil; no active parts	Scale-up inside the rack; shortest in-row hops
ACC / AEC (active copper)	Redriver or retimer in connector	~3-7 m	~1-3 W	Small active-component population	Rack-to-rack where copper still reaches; top-of-rack to spine within a row
AOC (active optical cable)	Optics fixed on both ends	Tens to ~100+ m	~10-17 W (pair)	Optics failure rate, but factory-terminated	Fixed point-to-point runs where length is known
Pluggable optics + structured fiber	Transceiver each end, field fiber between	100 m to multi-km	~14-17 W (DSP pluggable, 800G)	Dominant network failure/flap source	Scale-out leaf-spine and beyond; the modular default

Reach and power are 2026 practitioner ranges at 224G-per-lane / 1.6T aggregate (SemiAnalysis GB200 architecture; NVIDIA cabling docs; Semtech/Credo OFC 2026). Reach shrinks at each SerDes step; the copper/optics boundary moves inward with rate.

The copper-to-optics boundary is a power and reliability decision, not a cabling one

Drawing this boundary one rack-unit too far toward optics has a measurable cost. A passive DAC adds zero watts and never flaps; the 800G pluggable that replaces it adds 14-17 W per port and joins the largest failure population in the fabric. At cluster scale that delta compounds: SemiAnalysis estimated that keeping NVL72 scale-up on copper rather than optics saves on the order of 20 kW per rack in transceiver power alone — power that, in a power-bound facility, is GPUs you could have energized instead. Conversely, stretching a DAC past its margin to save a transceiver buys you intermittent link flaps that cost far more in lost goodput than the optic would have cost in capex. The right boundary is the one where copper's reach margin runs out — not a rack-unit before, not a rack-unit after. This is the same logic that drives the scale-up fabric's copper-first design in Chapter 8.2 and the CPO transition in Chapter 8.10.

Pluggable form factors: the cage you are mechanically married to

Once you are on pluggable optics, the form factor is a long-lived commitment because the cage — the receptacle soldered to the switch faceplate — is fixed for the life of the box. The three that matter in 2026 are QSFP-DD, OSFP, and OSFP-XD. QSFP-DD (8 lanes, the evolution of the ubiquitous QSFP) is the volume Ethernet workhorse and is backward-compatible with older QSFP optics in the same cage. OSFP is slightly larger, carries 8 lanes, and — crucially — was designed with a larger thermal envelope and an integrated heat-management path, which is why it dominates the highest-power 800G/1.6T AI deployments where the older QSFP-DD cage struggles to dissipate the module's heat. OSFP-XD doubles to 16 lanes, enabling 1.6T today (16x100G) and a path to 3.2T (16x200G) — the form factor GB200-class fabrics reach for when a single port must carry two NVLink/NIC widths.

The thermal lid is its own fork. Finned-top OSFP modules carry their own heatsink and rely on the switch's own airflow — standard in air-cooled halls. Flat-top OSFP modules omit the integral fin so the cage can press against a cold plate or a liquid-cooled riding heatsink — the variant you need when the switch itself is liquid-cooled. The trap is that finned and flat-top modules are not freely interchangeable in a given cage and airflow design, so the form-factor-plus-lid choice is locked in lockstep with the cooling architecture of the hall. Choose a liquid-cooled switch and you have implicitly chosen flat-top optics and a cold-plate-compatible cage; retrofitting that later is a faceplate redesign, not a swap.

The DSP-placement spectrum: retimed, LRO, LPO, half-retimed

Inside a conventional pluggable transceiver sits a power-hungry DSP that re-times and re-equalizes the signal on both the host (electrical) and line (optical) sides. That DSP is the single biggest power and cost item in the module — and the largest lever the industry has to cut optics power, which is why a spectrum of architectures has emerged that progressively removes it. The decision is a four-way fork, and it trades optics power against host-board co-design burden and interoperability risk.

Fully-retimed pluggable is the incumbent: a full DSP retimes both sides, isolating the optic from the host channel. It is the most robust, the most plug-and-play, the most multi-vendor-interoperable — and the most power-hungry, at roughly 14-17 W for an 800G module. LRO (linear receive optics, or linear-drive on the receive side) removes the DSP from one direction, keeping a retimer where it helps most and going linear where the channel is friendly — a pragmatic middle that recovers several watts with modest co-design. LPO (linear pluggable optics) removes the DSP entirely, relying on the host ASIC's own SerDes equalization to drive the optic directly; it cuts 800G optics power to roughly 7-8.5 W — close to halving it — but demands tight host-board co-design and end-to-end link qualification, because the optic no longer cleans up the host channel. Half-retimed architectures split the difference per-direction. The further you move toward LPO, the more power and cost you save and the more you take on the burden of owning the whole electrical path and validating it against a narrower vendor set.

DSP placement → optics power vs co-design burden (800G class)

Architecture	DSP placement	~Power / 800G	Host co-design burden	Interop / serviceability	Best fit
Fully-retimed pluggable	Full DSP, both sides	~14-17 W	Low — optic isolates host	Highest; multi-vendor, field-swappable	Default scale-out; multi-vendor fabrics
Half-retimed	Retimer one direction only	~10-13 W	Moderate	High	Power-sensitive links with friendly channel one way
LRO (linear receive)	Retimer kept where needed; linear elsewhere	~9-11 W	Moderate	High	Pragmatic power recovery without full LPO risk
LPO (linear pluggable)	No DSP; host SerDes drives optic	~7-8.5 W	High — own the whole electrical path	Narrower vendor set; still field-swappable	Hyperscale fabrics that control host design
CPO (co-packaged, for contrast)	Optics co-packaged with switch ASIC	under ~6 W	Highest — switch/optics co-design	Lowest; not field-serviceable (→ 8.10)	Highest-radix switches at the power wall

Power figures are 2026 per-800G-link practitioner ranges (SemiAnalysis / Broadcom / Credo / Nokia). CPO row shown for contrast; CPO is engineered in Chapter 8.10. Co-design burden rises as the DSP is removed.

50G→400G/lane

SerDes ladder: 800G = 8x100G or 4x200G; 1.6T = 8x200G; 3.2T = 448G signaling (OFC 2026)

2026SemiAnalysis (AI networks); Semtech/Synopsys OFC 2026

~1-2 m

passive DAC reach at 224G; active copper (AEC) ~3-7 m; optics beyond — reach halves as lane rate doubles

2025SemiAnalysis (GB200 architecture)

~0.83 m

NVL72 worst-case in-rack NVLink span — engineered to keep scale-up on copper

2025SemiAnalysis (GB200 hardware architecture)

14-17 W → 7-8.5 W

800G optics power: DSP pluggable vs LPO; CPO under ~6 W

2025SemiAnalysis / Broadcom

~20 kW/rack

transceiver power saved by keeping NVL72 scale-up on copper vs optics

2025SemiAnalysis (Nvidia's Optical Boogeyman)

flap every ~48 s

link-flap cadence in a 10M-optic fleet (hard failure only every few days); MTTF far below MTBF

2025Credo ZeroFlap analysis

20-30%

cluster productivity operators report losing to persistent link instability before isolating bad optics

2025Credo / EDGE Optical synthesis

first volume yr 2026

1.6T switch deployment; ramp faster than 800G, >5M ports within 1-2 years of shipping

2026Dell'Oro Group

Optics as the dominant failure and operating-cost driver

Here is the part the datasheets bury. At the scale of a modern AI cluster, optics are not a component — they are a population, and populations fail statistically. The reliability metric that matters is the mean-time-to-flap — the far shorter interval between transient link drops that re-establish in seconds — not the headline hard-failure MTBF, which at millions of hours per module reads comfortably on paper. The arithmetic is brutal. A cluster wiring up a few hundred thousand GPUs runs on the order of millions of optical links; at fleet scale that translates to a hard optic failure every few days but a link flap roughly every 48 seconds. Each flap is a momentary loss of a link, and in a synchronous training job a single flapping link on the wrong path can stall a collective and waste the work of every GPU waiting on it.

The goodput consequence is first-order, not marginal. With 32,000 GPUs and a one-hour checkpoint interval, one unstable link that forces a restart wastes up to 32,000 GPU-hours of compute. Operators report losing 20-30% of cluster productivity to persistent link instability before the offending transceivers are isolated and replaced — a number that dwarfs the capex of the optics themselves. This is why a generation of "flap-resistant" transceivers (e.g. Credo's ZeroFlap family) emerged in 2025-2026 marketing better mean-time-to-flap rather than better MTBF: the industry finally measured the metric that was actually costing money. The Llama 3 herd's published failure breakdown is the canonical at-scale dataset here — network and cable issues were a meaningful slice of the 466 interruptions over 54 days, and the run still achieved over 90% effective training time only because of disciplined fault isolation and fast restart. The checkpoint math that converts a flap into wasted GPU-hours is derived in Chapter 9.4; the goodput-vs-availability reframing is in Chapter 12.2.

The flap, not the failure, is what you commission against

A standard reliability spec — "MTBF in the millions of hours" — is a true statement that tells you almost nothing about how an AI fabric will behave. It counts hard failures and ignores flaps, and flaps are where the goodput goes. Commission against the flap: qualify optics at field temperature with real TDECQ margin, mandate connector cleanliness as a hard gate, instrument every link for pre-FEC BER trending (so a degrading optic is found before it tips over the FEC cliff, not after it has stalled a job), and build the operational muscle to isolate and hot-swap a flapping module fast. The cluster that does this loses single-digit percentages to optics; the one that trusts the MTBF number loses 20-30% and spends weeks hunting ghosts. Optics reliability is an operations program, not a procurement checkbox.

Deep dive: commissioning and operating the optical plant as a reliability program

Treating optics as a managed population changes what "done" means at every lifecycle stage. At factory and incoming inspection, qualify TDECQ at the temperature the optic will actually run, not bench-cool; sample-test incoming lots rather than trusting the supplier's coupon. At install, connector cleanliness is the single highest-leverage discipline — a contaminated ferrule can erase more link margin than the entire fiber run, and contamination is cumulative across every mate/de-mate, so inspect-and-clean before every connection is a hard rule, not a courtesy. At commissioning, run the fabric under synthetic all-reduce load at thermal soak and read pre-FEC BER on every link; reject any link without comfortable FEC margin even if it "passes" — a link at the FEC threshold at acceptance is a flap in production. In operation, the win is telemetry-driven: trend per-link pre-FEC BER and flag the optic whose error rate is climbing toward the cliff before it stalls a job, so the swap is scheduled into a maintenance window rather than triggered by a 32,000-GPU-hour restart. Keep a hot-spare optic population sized to the statistical replacement rate, and design the cabling so a single module is swappable without disturbing its neighbors. None of this is exotic; all of it is the difference between a cluster at 96% goodput and one at 75%. The structured-cabling, polarity, and loss-budget engineering that the field-fiber half of this depends on is canonical in Chapter 8.10.

Putting the taxonomy to work: drawing your boundaries

The physical layer is a sequence of three boundaries, each derived from the layer above it, and each costly if drawn wrong. First, the copper-to-optics boundary: set by your SerDes generation's reach physics. Keep scale-up and the shortest scale-out hops on passive copper (zero power, zero flaps); cross to active copper only for the reach copper still closes; reserve optics for spans copper physically cannot carry. Draw it toward optics and you burn ~20 kW/rack of needless DSP power; draw it toward copper and you eat link flaps. Second, the DSP-placement boundary: how much optics power you cut versus how much host co-design and interop risk you absorb. Fully-retimed is safe and power-hungry; LPO halves the power but makes you the owner of the whole electrical path. Third, the form-factor boundary: locked in lockstep with the switch's cooling — flat-top optics for liquid-cooled switches, finned for air — and unwound only by a faceplate redesign.

Wrapping all three is the reliability discipline, because the physical layer is the layer where goodput is silently won or lost. A fabric that looks identical on a topology diagram can deliver 96% or 75% effective training time depending entirely on whether its optics were commissioned with TDECQ margin and operated as a managed population. The taxonomy tells you which medium to use; the commissioning discipline tells you whether it will actually carry your bits for the life of the cluster.

The fabrics these links serve: scale-up (where copper-first design lives) in Chapter 8.2, the switch/NIC SerDes silicon in Chapter 8.3, scale-out protocols in Chapter 8.4, topology and oversubscription in Chapter 8.5, and cross-campus coherent DCI optics in Chapter 8.8. The co-packaged-optics endgame, fiber plant, and structured cabling that this chapter's pluggable taxonomy hands off to are engineered in Chapter 8.10. Connector-level grounding, bonding, and EMC are canonical in Chapter 4.11; the cooling architecture that the flat-top/finned form-factor fork is married to in Chapter 5.4. The goodput cost of a link flap is converted into wasted GPU-hours via the checkpoint math in Chapter 9.4 and reframed as goodput-vs-availability in Chapter 12.2; the consolidated optics roadmap to 3.2T and beyond lives in Chapter 16.2.