Guide › Compute, Silicon & System Integration › 7.14

Chapter 7.14

Server & System Integration

The rack does not arrive — it is integrated, and the level at which it is integrated (DGX appliance, HGX-OEM, ODM-direct, or OCP self-design) decides who owns the factory burn-in, who owns the acceptance gate, who owns the RMA, and ultimately how many days of stranded goodput sit between a powered shell and a producing cluster.

GOODPUTDENSITY-RAMPPOWER-BOUND

What you'll decide here

The integration model — DGX/turnkey vs HGX-from-an-OEM vs ODM-direct vs OCP self-design — which sets your margin stack, your serviceability terms, and how much systems-integration risk you are insourcing.
Factory (L11/L12) vs field integration — where the rack actually gets built and tested, and therefore whether you ship wet or dry, how you move a 1.5–3 t rack, and how much install-day risk you carry.
The acceptance gate you will sign against — goodput/MFU and a multi-day burn-in, not a power-on smoke test — because the gate, not the datasheet, is what you are actually buying.
Where your true lead-time gate is — CoWoS and HBM allocation upstream, not rack assembly downstream — so the build plan is sequenced against the constraint that actually slips.
The spares, RMA, and serviceability posture you contract for, because at one failure every few days per cluster, mean-time-to-repair on a tray is a goodput line item, not an afterthought.

A modern AI rack is not a product you buy off a shelf and bolt to the floor. It is the output of a manufacturing pipeline that starts at silicon and ends at a benchmarked cluster, and the decision that most shapes cost and risk is where the operator enters it. Enter at the top, and a vendor hands you a turnkey, validated NVL72 with a warranty and a phone number. Enter at the bottom, and you are the systems integrator: you own the bill of materials, the firmware matrix, the burn-in scripts, the acceptance gate, and every tray that fails at 3 a.m. The two ends of that spectrum differ by double-digit points of gross margin, by months of time-to-goodput, and by who is liable when a cluster does not hit its MFU number. This chapter is about that entry decision and everything it cascades into.

This chapter starts with the L1–L12 manufacturing-level model — the shared vocabulary the industry uses to say who builds what — and the ODM / OEM / systems-integrator roles mapped onto it. We then walk the build-vs-buy fork (DGX vs HGX-OEM vs ODM-direct vs OCP self-design), the OCP Open Rack standards and the 2026 reference systems (HGX, MGX, GB200/GB300 NVL72, AMD Helios), the factory-vs-field integration question and the logistics of shipping a wet 1.5–3 t rack, the goodput-oriented acceptance gate, the CoWoS/HBM lead-time reality that gates the whole thing, and the commissioning handoff to operations. The rack as a physical integration unit is treated in Chapter 7.13; this chapter is about integrating it.

The L1–L12 manufacturing-level model

The industry talks about hardware integration in numbered "levels," and getting fluent in them is the precondition for every contract you sign. The scale runs from raw components to a benchmarked cluster, and the level at which a vendor delivers is exactly the line that separates "you bought a server" from "you bought a working AI factory." The numbering varies slightly by vendor, but the structure is stable: L1–L5 are component and PCB assembly (bare board, SMT placement, the GPU baseboard or UBB); L6 is the populated motherboard/baseboard; L10 is a fully assembled server that boots an OS; L11 is a fully cabled rack — compute trays, NVSwitch trays, busbar, manifolds, top-of-rack switches, in-rack network and power cabling, tested as a unit; and L12 is a multi-rack cluster with cross-rack cabling, the customer's software loaded, and rack-scale benchmarks run to prove the thing actually performs (DCD; AMAX; Hyperscalers, 2025).

The classic division of labor: ODMs (Foxconn/Hon Hai, Quanta/QCT, Wistron/Wiwynn, Inventec, Supermicro on its ODM side) own roughly L1–L6 and increasingly push up into L10–L11; OEMs (Dell, HPE, Lenovo, Supermicro on its brand side) take L6–L10 and add brand, warranty, supply assurance, and a global service organization; the systems integrator — which can be the OEM, a specialist, or the operator itself — owns L11–L12, the part where a pile of validated servers becomes a producing cluster. The strategic point: the L11/L12 boundary is where time-to-goodput is won or lost. Whoever owns it owns the burn-in, the acceptance gate, and the install-day risk.

The integration levels and who typically owns them

Level	What it produces	Typical owner	What you are buying	Where the risk sits
L1–L5	Bare PCB → SMT-populated board → GPU baseboard (UBB/SXM)	ODM / contract manufacturer	Components and sub-assemblies	Yield, HBM/CoWoS supply
L6	Populated motherboard / GPU baseboard	ODM (handed to OEM)	A tested board	Firmware, board-level defects
L10	Fully assembled server that boots an OS	OEM / ODM	A working node	Node burn-in, thermal validation
L11	Fully cabled, tested rack (trays, busbar, manifolds, ToR, cabling)	OEM / systems integrator	A deployable rack	Mis-cabling, leak test, rack-level burn-in
L12	Multi-rack cluster, cross-rack cabling, software loaded, benchmarked	Systems integrator / operator	A producing cluster	Goodput/MFU acceptance, fabric validation

Level numbering follows common ODM/OEM usage (DCD, AMAX, Hyperscalers 2025); exact boundaries vary by vendor. The right-hand columns are the load-bearing ones — they say where your money and your risk actually sit.

Build vs buy: the four entry points

Map the entry decision onto four archetypes, ordered from most-bought to most-built. Each trades margin paid against integration risk insourced and control gained. There is no universally right answer — the right cell is a function of your scale, your engineering depth, and how much of the systems-integration burden you can actually carry.

DGX / turnkey (NVIDIA DGX, the GB-series "NVL72" sold as a system). You buy a fully-integrated, factory-validated, single-throat-to-choke supercomputer with NVIDIA's software stack, reference fabric, and warranty. Highest unit cost, lowest integration risk, fastest path to a known-good cluster — and the deepest lock-in (Chapter 7.9). This is the right call for an enterprise standing up its first cluster or anyone who values a single accountable vendor over unit economics.

HGX-from-an-OEM (Dell, HPE, Lenovo, Supermicro building on the NVIDIA HGX/MGX baseboard). The middle path and the volume of the market. NVIDIA sells the HGX 8-GPU baseboard (or the MGX modular rack reference); the OEM does L6–L11 integration, adds its own chassis, thermals, BMC, service, and supply assurance. You get brand-name support and a global RMA org while escaping the full DGX premium. The cost: you inherit the OEM's firmware/validation cadence and pay an integration margin the ODM-direct buyer skips.

ODM-direct (buying L10/L11 straight from Quanta, Wiwynn, Foxconn, Supermicro's ODM arm). You strip out the OEM brand margin and contract the integrator directly, often to your own spec. Lower unit cost, more control over BOM and firmware — but you are now closer to owning the acceptance gate and the RMA logistics yourself. This is the hyperscaler and large-neocloud default once volume justifies an in-house hardware team.

OCP self-design (you specify the rack against Open Compute standards and have ODMs build to it). Maximum control, lowest unit cost at scale, no brand margin at all — and you are the systems integrator. You own the design, the BOM, the firmware matrix, the burn-in scripts, the acceptance criteria, and every serviceability decision. Only justified at hyperscale, where a point of efficiency across hundreds of thousands of GPUs dwarfs the cost of an in-house infrastructure org. Meta, Microsoft, Google, and Amazon live here.

Build-vs-buy: DGX vs HGX-OEM vs ODM-direct vs OCP self-design

Entry point	Who integrates L11/L12	Relative unit cost	Integration risk you own	Best-fit buyer
DGX / turnkey NVL72	Vendor (factory-validated)	Highest (full premium)	Minimal — vendor owns the gate	First cluster; single-vendor accountability
HGX-from-an-OEM	OEM (Dell/HPE/Lenovo/SMCI)	High (brand + integration margin)	Low — OEM warranty & RMA	Enterprise/mid-scale wanting brand support
ODM-direct	ODM, to your spec	Low (no brand margin)	Moderate — you co-own acceptance	Large neoclouds; in-house HW team
OCP self-design	You (the operator)	Lowest at scale	Full — you are the integrator	Hyperscalers; fleets >100k GPUs

Margin and lead-time deltas are 2026 practitioner ranges relative to ODM-direct as the baseline; directional, not quotes. Lock-in and control are qualitative. The fork is who owns L11/L12.

The fork that actually matters: who owns the L11/L12 acceptance gate

Strip away the brand names and the build-vs-buy decision reduces to one question: when the cluster fails to hit its goodput number, whose problem is it? Buy DGX or HGX-from-an-OEM and the answer is the vendor's — you have outsourced the acceptance gate and you pay for that insurance in margin. Go ODM-direct or OCP self-design and the answer is yours — you have insourced the gate and the savings, and you had better have the validation engineering to back it up. The trap is buying ODM-direct economics while assuming OEM-style accountability: nobody owns the goodput gate, and the stranded weeks between "racks installed" and "cluster producing" are the most expensive line item nobody budgeted for. Decide who owns L11/L12 in writing before the first PO.

OCP, Open Rack & the 2026 reference systems

The Open Compute Project is the standards substrate that makes ODM-direct and self-design viable: it turns proprietary rack designs into shared, multi-vendor specifications so an operator can second-source the same rack from Quanta, Wiwynn, or Foxconn instead of being captive to one builder. The relevant standards for AI in 2026 are the Open Rack family. ORV3 (Open Rack v3, Meta-led, published 2022) moved the industry to a 21-inch rack with a vertical DC busbar, blind-mate power, native 48 V distribution, and provisions for direct liquid cooling — the form factor most current high-density AI racks descend from (OCP; Introl, 2025). At OCP 2025, Meta introduced Open Rack Wide (ORW), a double-wide standard explicitly designed for the power, cooling, and serviceability demands of next-generation rack-scale AI — the spec AMD's Helios is built on (OCP / Meta; AMD, 2025).

The reference systems are where these standards meet silicon. HGX is NVIDIA's 8-GPU baseboard reference — the building block OEMs integrate into air- or liquid-cooled servers; it is the inference and small-training workhorse. MGX is NVIDIA's modular rack-level reference that lets partners mix CPUs, GPUs, and DPUs into standardized rack designs. GB200/GB300 NVL72 is the rack-as-the-unit: 72 Blackwell (or Blackwell Ultra) GPUs and 36 Grace CPUs fused into a single ~1.36 t, liquid-cooled, ~120–135 kW NVLink domain — the densest tightly-coupled training/inference unit in volume in 2026. AMD Helios is the open challenger: an ORW double-wide rack carrying up to 72 MI450-series GPUs, ~1.4 EF FP8 / 2.9 EF FP4 and 31 TB HBM4 at rack scale, compliant with OCP, UALink, and Ultra Ethernet — the open-standards answer to a single-vendor NVL72 (AMD; NextPlatform; DCD, 2025–2026).

2026 rack-scale reference systems

System	Unit of integration	Accelerators	Power / weight	Fabric & standards posture
NVIDIA HGX (B200/B300)	8-GPU server baseboard	8 Blackwell/Ultra	~30–60 kW/rack (air or liquid)	NVLink in-board; vendor-proprietary
NVIDIA MGX	Modular rack reference	Mix-and-match GPU/CPU/DPU	Density by configuration	NVLink/NVSwitch; NVIDIA reference
GB200 NVL72	The rack (72-GPU NVLink domain)	72 Blackwell + 36 Grace	~120–132 kW, ~1.36 t	NVLink5 (130 TB/s rack); proprietary
GB300 NVL72	The rack (Blackwell Ultra)	72 Blackwell Ultra + 36 Grace	~135 kW TDP (to ~155 kW peak), ~1.36 t	NVLink5; ~90% liquid / ~10% air
AMD Helios (ORW)	Double-wide rack	Up to 72 MI450-series	ORW double-wide; weight spread across 2 bays	UALink + Ultra Ethernet; OCP-open

Vendor specs as of late 2025–early 2026; FP4/FP8 figures are sparse/peak vendor numbers. NVL72 weights/power per NVIDIA OCP and OEM datasheets; Helios per AMD/NextPlatform. Helios is a 2H-2026 reference-to-volume design.

Read the last column as the real strategic axis. NVL72 is a vertically-integrated, single-vendor unit: you get a validated NVLink domain and a proprietary scale-up fabric, and you accept the lock-in. Helios is the open bet: a double-wide ORW rack on UALink and Ultra Ethernet, multi-sourceable through OCP, with the explicit design goal of spreading weight and improving serviceability by going wide rather than tall. The double-wide move is not cosmetic — it directly attacks the floor-loading and field-serviceability problems that a 1.36 t single-bay NVL72 creates, which is the next section's subject. The scale-up fabric choices behind NVLink vs UALink are treated in Chapter 8.2; the merchant-vs-captive silicon framing in Chapter 7.1.

Factory vs field integration: where the rack gets built

Once you know who integrates, the next fork is where: is the rack built and tested at the factory (L11/L12 done before it ships) or assembled in the field at your site? This is the central velocity decision of the deployment, and it pivots on a hard physical fact — a populated NVL72 weighs roughly 1.36 t (~3,000 lb), concentrated in a single ~48U footprint, and is plumbed with ~200 L of coolant and thousands of in-rack cables.

Factory integration (ship the rack whole) is the 2026 default for dense liquid-cooled systems precisely because mis-cabling and leak risk are too high to absorb on the install floor. The integrator assembles trays, busbar, manifolds, ToR switches, and in-rack cabling in a controlled environment, runs rack-level burn-in, and ships a tested unit. NVIDIA's rack-scale partners explicitly factory-integrate the liquid loop and re-test at the rack level so the rack can be deployed directly at the customer site. The cost is logistics: you are now moving a 1.36 t object, and the question becomes whether it ships wet (coolant already in the loop, factory-tested as-shipped) or dry (drained for transit, then filled and leak-tested in the field). Shipping wet preserves the factory test state and shaves field commissioning time but adds weight, freeze/spill risk, and stricter handling; shipping dry is lighter and safer in transit but reintroduces a fill-and-leak-test step on the critical path. Most high-density racks ship dry-of-coolant for transit and are filled on site, with the factory loop integrity certified separately — but the choice is contractual and worth pinning down explicitly.

Field integration (build the rack on site) — populating an empty rack with trays and cabling it in the data hall — survives only for lower-density, air-cooled, or 19-inch-EIA configurations where the weight and cabling risk are manageable. For NVL72-class systems it is an anti-pattern: you are doing precision liquid plumbing and thousands of cable terminations in an uncontrolled environment, against an install clock, with mis-cabling as the dominant acceptance failure (the velocity and cabling discipline this demands is the whole subject of Chapter 7.15).

The 1.5–3 t rack breaks your floor, your dock, and your aisle

The logistics of a factory-integrated rack are an engineering problem in their own right, and skipping the analysis strands the rack on the loading dock. A ~1.36 t NVL72 (and the heavier multi-rack pallets behind it) imposes a concentrated point load most legacy raised floors cannot bear — the structural floor-loading basis is one of the irreversible scoping decisions for exactly this reason (Chapter 6.7). Then check the rest of the path: dock height and capacity, freight-elevator rating, door widths, aisle turning radius, ramp gradients, and the rigging gear to move a 3,000 lb object without tipping it. AMD's double-wide Helios is, in part, an answer to this — spreading the same compute across two bays cuts the point load and eases the move. Walk the physical path from truck to final position before the rack ships, not after it arrives.

Burn-in, validation & the goodput acceptance gate

The most expensive mistake in system integration is accepting a cluster on a power-on smoke test — "it boots, it pings, sign here." AI clusters fail in ways a smoke test never sees: a GPU that trains fine for an hour and then throttles on a thermal excursion, an HBM stack with marginal bit-error rates, an optic that flaps under load, a single mis-cabled link that quietly halves bisection bandwidth. The acceptance gate that catches these is goodput-oriented: a multi-day burn-in that drives the cluster at full power and measures whether it sustains its target goodput / MFU, not merely whether it powers on.

The empirical case for a long gate is overwhelming. New clusters fail far more than mature ones — the burn-in period runs 3–4 weeks before failure rates settle, and infant-mortality components (GPUs, HBM, optics) surface precisely under sustained thermal and electrical stress. Meta's Llama 3 405B run logged 419 unplanned interruptions over 54 days on 16,384 H100s — about one every three hours — with 78% hardware-caused and the majority GPU/HBM-related (Meta Llama 3 paper, 2024). A best-in-class mature H100 cluster still sees roughly one failure per 512 GPUs every ~7 days (SemiAnalysis, 2025). An acceptance gate that does not stress the cluster long enough to surface the infant-mortality tail is not a gate; it is a handshake that defers the failures into your production goodput.

A defensible acceptance program therefore layers tests at each level: L10 node burn-in (thermal soak, memory test, per-GPU stress); L11 rack-level validation (leak test on the liquid loop, power-sequencing, in-rack link integrity, mis-cabling verification); and L12 cluster-level acceptance (collective-communication benchmarks like all-reduce bandwidth, a representative training run held to a target MFU, and a sustained multi-day goodput soak). The gate is contractual: it defines the number the integrator must hit, the duration the cluster must hold it, and the remedy if it does not. This connects directly to the formal commissioning levels in Part 13 — the integrated-systems and rack-scale acceptance machinery is built out in Chapter 13.1 and the cooling/electrical acceptance specifics in their respective chapters there.

Deep dive: what a goodput acceptance gate actually measures (and the failures it catches)

A goodput gate is not one test — it is a sequence designed so each layer catches a class of defect the layer below misses. Run them in order, because a fabric benchmark on a cluster with a thermally-marginal GPU just gives you a confusing number.

1. Component & node (L10). Per-GPU stress (compute + memory bandwidth at full TDP), HBM bit-error screening, and a thermal soak that holds the node at its power limit long enough to surface throttling. This is where the bulk of infant mortality — the faulty GPUs and marginal HBM stacks that dominated Meta's failure breakdown — is supposed to die before the rack is sealed.

2. Rack (L11). Liquid-loop leak test and pressure-hold; power-sequencing and busbar integrity; and the one that catches the most acceptance failures — cabling verification. A single transposed or under-seated link can pass a ping and still cripple collective bandwidth; automated link-map verification against the intended topology is the only reliable catch. The NVL72 packs thousands of in-rack copper NVLink cables, so the failure surface is large.

3. Cluster (L12). Collective benchmarks (all-reduce / all-gather bandwidth at scale, the operations a real training step is dominated by — see Chapter 8.2) to prove the fabric delivers its non-blocking promise; then a sustained, representative workload held to a target MFU for multiple days. Best-in-class operators target ~96% goodput against an industry average near ~90% (SemiAnalysis ClusterMAX, 2025); the gate decides which you are buying. The output is a benchmarked, signed-off cluster — the L12 deliverable — not a rack of servers that boots.

The real lead-time gate is upstream: CoWoS & HBM

It is tempting to plan the build around rack assembly — the visible, schedulable step. That is a mistake, because the binding constraint is not on the integration floor; it is two tiers upstream in advanced packaging. TSMC's CoWoS (Chip-on-Wafer-on-Substrate) capacity is the single most contended resource in the AI supply chain in 2026: backend packaging facilities have run sold out through 2027 with 52–78 week lead times, and NVIDIA alone reportedly booked the bulk of available capacity — on the order of 800,000–850,000 wafers for 2026 — even as TSMC raced CoWoS capacity from ~35k wafers/month (end 2024) toward a ~125k–130k/month target by end 2026 (TSMC; SemiAnalysis; siliconanalysts, 2026). The companion gate is HBM: 2026 HBM3E sold out, an estimated supply gap on the order of ~30%, and quarter-on-quarter price escalation (SemiAnalysis / TrendForce, 2026). HBM is the binding constraint on AI compute in its own right (Chapter 7.6); the packaging substrate that fuses it to logic is treated in Chapter 7.7.

The consequence for system integration is a sequencing rule: allocation, not assembly, sets your delivery date. A flawless L11/L12 integration line standing idle waiting for accelerators is the default failure mode of a build plan that scheduled against the wrong constraint. The operators who deploy fastest secure CoWoS/HBM allocation a year or more ahead, treat the accelerator delivery curve as the master schedule, and stage the powered shell, cooling plant, and integration capacity to be waiting on silicon rather than the reverse. Procurement and allocation strategy is the subject of Chapter 7.11.

~1.36 t

GB200/GB300 NVL72 shipping weight (~3,000 lb in a ~48U footprint); the logistics-defining number

2025NVIDIA OCP / OEM datasheets (HPE, Lenovo, Supermicro)

~135 kW

GB300 NVL72 rack TDP (to ~155 kW peak); ~90% heat to liquid, ~10% to air

2025Supermicro / Lenovo GB300 NVL72 datasheets

52–78 wk

CoWoS advanced-packaging lead time; backend lines sold out through 2027

2026SemiAnalysis / TSMC; siliconanalysts

~125–130k

TSMC CoWoS wafers/month target by end 2026 (up from ~35k end-2024, ~75k end-2025)

2026TSMC guidance / SemiAnalysis

419

unplanned interruptions over 54 days on 16,384 H100s (~1 every 3 hr); 78% hardware-caused — the case for a long burn-in gate

2024Meta, Llama 3 405B paper

~7 days

best-in-class H100 MTBF per 512 GPUs in a mature cluster; new clusters far worse (3–4 wk burn-in)

2025SemiAnalysis, 100k H100 clusters

~96% / ~90%

best-in-class vs industry-average goodput — the number the L12 acceptance gate buys

2025SemiAnalysis ClusterMAX

~2.9 EF FP4

AMD Helios (72× MI450) rack-scale FP4 (1.4 EF FP8), 31 TB HBM4, ORW double-wide

2026AMD / NextPlatform / DCD

Deployment, commissioning & the operations handoff

The last act of integration is the handoff to operations — the moment the cluster stops being the integrator's project and becomes the operator's producing asset. A clean handoff is itself an acceptance gate: it transfers not just hardware but the documentation that makes the hardware operable — the as-built rack and link maps, the firmware/driver baseline, the burn-in and acceptance results, the asset-and-port inventory that feeds DCIM (Chapter 14.2), and the spares and RMA terms. Skip the documentation transfer and you have a cluster nobody can service without reverse-engineering it.

Spares, RMA, and serviceability are where the goodput thread closes the loop. At one failure every few days per cluster, mean-time-to-repair on a tray is not an operational footnote — it is a direct multiplier on effective availability and therefore on goodput. The serviceability decisions made at integration time govern it: front-serviceable trays vs racks you must pull from the aisle; blind-mate power and liquid quick-disconnects that let you swap a tray without draining the loop; an on-site spares depot sized to the fleet's failure rate rather than a vendor's standard SLA; and an RMA path whose turnaround you actually measured. The build-vs-buy fork resurfaces here: a turnkey buyer inherits the vendor's RMA org and SLA, while the OCP self-designer owns the spares pool and the repair logistics outright — another reason the entry decision is a multi-year operational commitment, not a one-time purchase. The reliability math behind why repair time dominates goodput is developed in Chapter 12.2; checkpointing, the software complement that bounds the cost of each failure, in Chapter 9.4.

Integration level and serviceability are the same decision, seen twice

The entry point you chose at the top of this chapter (DGX → OCP self-design) and the serviceability posture you live with for the cluster's life are one decision observed at two points in time. Buy turnkey and you bought the vendor's RMA org, spares logistics, and SLA along with the rack; the margin you paid is the serviceability insurance. Self-design and you saved that margin but now own the spares depot, the repair turnaround, and the firmware baseline yourself — permanently. The operators who get burned are the ones who priced the integration decision on day-one unit cost and discovered the serviceability bill three months into production, when the goodput tax of a slow tray swap showed up in the numbers. Price the whole life of the asset, not the PO.

The rack as a physical integration unit — busbars, manifolds, anatomy — is Chapter 7.13; this chapter integrated it. The install velocity and cabling discipline that factory integration is designed to protect are Chapter 7.15. The upstream gates that set your delivery date: HBM in Chapter 7.6, advanced packaging in Chapter 7.7, and procurement/allocation strategy in Chapter 7.11. Scale-up fabric (NVLink vs UALink) behind the NVL72-vs-Helios fork is Chapter 8.2; software lock-in behind the DGX choice is Chapter 7.9. The floor-loading basis the 1.36 t rack demands is Chapter 6.7; formal commissioning and acceptance are built out across Chapter 13.1; the reliability and goodput math behind the acceptance gate is Chapter 12.2 with checkpointing in Chapter 9.4; and the DCIM asset/port handoff is Chapter 14.2.