Chapter 7.14
Server & System Integration
The rack does not arrive — it is integrated, and the level at which it is integrated (DGX appliance, HGX-OEM, ODM-direct, or OCP self-design) decides who owns the factory burn-in, who owns the acceptance gate, who owns the RMA, and ultimately how many days of stranded goodput sit between a powered shell and a producing cluster.
What you'll decide here
- The integration model — DGX/turnkey vs HGX-from-an-OEM vs ODM-direct vs OCP self-design — which sets your margin stack, your serviceability terms, and how much systems-integration risk you are insourcing.
- Factory (L11/L12) vs field integration — where the rack actually gets built and tested, and therefore whether you ship wet or dry, how you move a 1.5–3 t rack, and how much install-day risk you carry.
- The acceptance gate you will sign against — goodput/MFU and a multi-day burn-in, not a power-on smoke test — because the gate, not the datasheet, is what you are actually buying.
- Where your true lead-time gate is — CoWoS and HBM allocation upstream, not rack assembly downstream — so the build plan is sequenced against the constraint that actually slips.
- The spares, RMA, and serviceability posture you contract for, because at one failure every few days per cluster, mean-time-to-repair on a tray is a goodput line item, not an afterthought.
A modern AI rack is not a product you buy off a shelf and bolt to the floor. It is the output of a manufacturing pipeline that starts at silicon and ends at a benchmarked cluster, and the decision that most shapes cost and risk is where the operator enters it. Enter at the top, and a vendor hands you a turnkey, validated NVL72 with a warranty and a phone number. Enter at the bottom, and you are the systems integrator: you own the bill of materials, the firmware matrix, the burn-in scripts, the acceptance gate, and every tray that fails at 3 a.m. The two ends of that spectrum differ by double-digit points of gross margin, by months of time-to-goodput, and by who is liable when a cluster does not hit its MFU number. This chapter is about that entry decision and everything it cascades into.
This chapter starts with the L1–L12 manufacturing-level model — the shared vocabulary the industry uses to say who builds what — and the ODM / OEM / systems-integrator roles mapped onto it. We then walk the build-vs-buy fork (DGX vs HGX-OEM vs ODM-direct vs OCP self-design), the OCP Open Rack standards and the 2026 reference systems (HGX, MGX, GB200/GB300 NVL72, AMD Helios), the factory-vs-field integration question and the logistics of shipping a wet 1.5–3 t rack, the goodput-oriented acceptance gate, the CoWoS/HBM lead-time reality that gates the whole thing, and the commissioning handoff to operations. The rack as a physical integration unit is treated in Chapter 7.13; this chapter is about integrating it.
The L1–L12 manufacturing-level model
The industry talks about hardware integration in numbered "levels," and getting fluent in them is the precondition for every contract you sign. The scale runs from raw components to a benchmarked cluster, and the level at which a vendor delivers is exactly the line that separates "you bought a server" from "you bought a working AI factory." The numbering varies slightly by vendor, but the structure is stable: L1–L5 are component and PCB assembly (bare board, SMT placement, the GPU baseboard or UBB); L6 is the populated motherboard/baseboard; L10 is a fully assembled server that boots an OS; L11 is a fully cabled rack — compute trays, NVSwitch trays, busbar, manifolds, top-of-rack switches, in-rack network and power cabling, tested as a unit; and L12 is a multi-rack cluster with cross-rack cabling, the customer's software loaded, and rack-scale benchmarks run to prove the thing actually performs (DCD; AMAX; Hyperscalers, 2025).
The classic division of labor: ODMs (Foxconn/Hon Hai, Quanta/QCT, Wistron/Wiwynn, Inventec, Supermicro on its ODM side) own roughly L1–L6 and increasingly push up into L10–L11; OEMs (Dell, HPE, Lenovo, Supermicro on its brand side) take L6–L10 and add brand, warranty, supply assurance, and a global service organization; the systems integrator — which can be the OEM, a specialist, or the operator itself — owns L11–L12, the part where a pile of validated servers becomes a producing cluster. The strategic point: the L11/L12 boundary is where time-to-goodput is won or lost. Whoever owns it owns the burn-in, the acceptance gate, and the install-day risk.
| Level | What it produces | Typical owner | What you are buying | Where the risk sits |
|---|---|---|---|---|
| L1–L5 | Bare PCB → SMT-populated board → GPU baseboard (UBB/SXM) | ODM / contract manufacturer | Components and sub-assemblies | Yield, HBM/CoWoS supply |
| L6 | Populated motherboard / GPU baseboard | ODM (handed to OEM) | A tested board | Firmware, board-level defects |
| L10 | Fully assembled server that boots an OS | OEM / ODM | A working node | Node burn-in, thermal validation |
| L11 | Fully cabled, tested rack (trays, busbar, manifolds, ToR, cabling) | OEM / systems integrator | A deployable rack | Mis-cabling, leak test, rack-level burn-in |
| L12 | Multi-rack cluster, cross-rack cabling, software loaded, benchmarked | Systems integrator / operator | A producing cluster | Goodput/MFU acceptance, fabric validation |
Build vs buy: the four entry points
Map the entry decision onto four archetypes, ordered from most-bought to most-built. Each trades margin paid against integration risk insourced and control gained. There is no universally right answer — the right cell is a function of your scale, your engineering depth, and how much of the systems-integration burden you can actually carry.
DGX / turnkey (NVIDIA DGX, the GB-series "NVL72" sold as a system). You buy a fully-integrated, factory-validated, single-throat-to-choke supercomputer with NVIDIA's software stack, reference fabric, and warranty. Highest unit cost, lowest integration risk, fastest path to a known-good cluster — and the deepest lock-in (Chapter 7.9). This is the right call for an enterprise standing up its first cluster or anyone who values a single accountable vendor over unit economics.
HGX-from-an-OEM (Dell, HPE, Lenovo, Supermicro building on the NVIDIA HGX/MGX baseboard). The middle path and the volume of the market. NVIDIA sells the HGX 8-GPU baseboard (or the MGX modular rack reference); the OEM does L6–L11 integration, adds its own chassis, thermals, BMC, service, and supply assurance. You get brand-name support and a global RMA org while escaping the full DGX premium. The cost: you inherit the OEM's firmware/validation cadence and pay an integration margin the ODM-direct buyer skips.
ODM-direct (buying L10/L11 straight from Quanta, Wiwynn, Foxconn, Supermicro's ODM arm). You strip out the OEM brand margin and contract the integrator directly, often to your own spec. Lower unit cost, more control over BOM and firmware — but you are now closer to owning the acceptance gate and the RMA logistics yourself. This is the hyperscaler and large-neocloud default once volume justifies an in-house hardware team.
OCP self-design (you specify the rack against Open Compute standards and have ODMs build to it). Maximum control, lowest unit cost at scale, no brand margin at all — and you are the systems integrator. You own the design, the BOM, the firmware matrix, the burn-in scripts, the acceptance criteria, and every serviceability decision. Only justified at hyperscale, where a point of efficiency across hundreds of thousands of GPUs dwarfs the cost of an in-house infrastructure org. Meta, Microsoft, Google, and Amazon live here.
| Entry point | Who integrates L11/L12 | Relative unit cost | Integration risk you own | Best-fit buyer |
|---|---|---|---|---|
| DGX / turnkey NVL72 | Vendor (factory-validated) | Highest (full premium) | Minimal — vendor owns the gate | First cluster; single-vendor accountability |
| HGX-from-an-OEM | OEM (Dell/HPE/Lenovo/SMCI) | High (brand + integration margin) | Low — OEM warranty & RMA | Enterprise/mid-scale wanting brand support |
| ODM-direct | ODM, to your spec | Low (no brand margin) | Moderate — you co-own acceptance | Large neoclouds; in-house HW team |
| OCP self-design | You (the operator) | Lowest at scale | Full — you are the integrator | Hyperscalers; fleets >100k GPUs |
OCP, Open Rack & the 2026 reference systems
The Open Compute Project is the standards substrate that makes ODM-direct and self-design viable: it turns proprietary rack designs into shared, multi-vendor specifications so an operator can second-source the same rack from Quanta, Wiwynn, or Foxconn instead of being captive to one builder. The relevant standards for AI in 2026 are the Open Rack family. ORV3 (Open Rack v3, Meta-led, published 2022) moved the industry to a 21-inch rack with a vertical DC busbar, blind-mate power, native 48 V distribution, and provisions for direct liquid cooling — the form factor most current high-density AI racks descend from (OCP; Introl, 2025). At OCP 2025, Meta introduced Open Rack Wide (ORW), a double-wide standard explicitly designed for the power, cooling, and serviceability demands of next-generation rack-scale AI — the spec AMD's Helios is built on (OCP / Meta; AMD, 2025).
The reference systems are where these standards meet silicon. HGX is NVIDIA's 8-GPU baseboard reference — the building block OEMs integrate into air- or liquid-cooled servers; it is the inference and small-training workhorse. MGX is NVIDIA's modular rack-level reference that lets partners mix CPUs, GPUs, and DPUs into standardized rack designs. GB200/GB300 NVL72 is the rack-as-the-unit: 72 Blackwell (or Blackwell Ultra) GPUs and 36 Grace CPUs fused into a single ~1.36 t, liquid-cooled, ~120–135 kW NVLink domain — the densest tightly-coupled training/inference unit in volume in 2026. AMD Helios is the open challenger: an ORW double-wide rack carrying up to 72 MI450-series GPUs, ~1.4 EF FP8 / 2.9 EF FP4 and 31 TB HBM4 at rack scale, compliant with OCP, UALink, and Ultra Ethernet — the open-standards answer to a single-vendor NVL72 (AMD; NextPlatform; DCD, 2025–2026).
| System | Unit of integration | Accelerators | Power / weight | Fabric & standards posture |
|---|---|---|---|---|
| NVIDIA HGX (B200/B300) | 8-GPU server baseboard | 8 Blackwell/Ultra | ~30–60 kW/rack (air or liquid) | NVLink in-board; vendor-proprietary |
| NVIDIA MGX | Modular rack reference | Mix-and-match GPU/CPU/DPU | Density by configuration | NVLink/NVSwitch; NVIDIA reference |
| GB200 NVL72 | The rack (72-GPU NVLink domain) | 72 Blackwell + 36 Grace | ~120–132 kW, ~1.36 t | NVLink5 (130 TB/s rack); proprietary |
| GB300 NVL72 | The rack (Blackwell Ultra) | 72 Blackwell Ultra + 36 Grace | ~135 kW TDP (to ~155 kW peak), ~1.36 t | NVLink5; ~90% liquid / ~10% air |
| AMD Helios (ORW) | Double-wide rack | Up to 72 MI450-series | ORW double-wide; weight spread across 2 bays | UALink + Ultra Ethernet; OCP-open |
Read the last column as the real strategic axis. NVL72 is a vertically-integrated, single-vendor unit: you get a validated NVLink domain and a proprietary scale-up fabric, and you accept the lock-in. Helios is the open bet: a double-wide ORW rack on UALink and Ultra Ethernet, multi-sourceable through OCP, with the explicit design goal of spreading weight and improving serviceability by going wide rather than tall. The double-wide move is not cosmetic — it directly attacks the floor-loading and field-serviceability problems that a 1.36 t single-bay NVL72 creates, which is the next section's subject. The scale-up fabric choices behind NVLink vs UALink are treated in Chapter 8.2; the merchant-vs-captive silicon framing in Chapter 7.1.
Factory vs field integration: where the rack gets built
Once you know who integrates, the next fork is where: is the rack built and tested at the factory (L11/L12 done before it ships) or assembled in the field at your site? This is the central velocity decision of the deployment, and it pivots on a hard physical fact — a populated NVL72 weighs roughly 1.36 t (~3,000 lb), concentrated in a single ~48U footprint, and is plumbed with ~200 L of coolant and thousands of in-rack cables.
Factory integration (ship the rack whole) is the 2026 default for dense liquid-cooled systems precisely because mis-cabling and leak risk are too high to absorb on the install floor. The integrator assembles trays, busbar, manifolds, ToR switches, and in-rack cabling in a controlled environment, runs rack-level burn-in, and ships a tested unit. NVIDIA's rack-scale partners explicitly factory-integrate the liquid loop and re-test at the rack level so the rack can be deployed directly at the customer site. The cost is logistics: you are now moving a 1.36 t object, and the question becomes whether it ships wet (coolant already in the loop, factory-tested as-shipped) or dry (drained for transit, then filled and leak-tested in the field). Shipping wet preserves the factory test state and shaves field commissioning time but adds weight, freeze/spill risk, and stricter handling; shipping dry is lighter and safer in transit but reintroduces a fill-and-leak-test step on the critical path. Most high-density racks ship dry-of-coolant for transit and are filled on site, with the factory loop integrity certified separately — but the choice is contractual and worth pinning down explicitly.
Field integration (build the rack on site) — populating an empty rack with trays and cabling it in the data hall — survives only for lower-density, air-cooled, or 19-inch-EIA configurations where the weight and cabling risk are manageable. For NVL72-class systems it is an anti-pattern: you are doing precision liquid plumbing and thousands of cable terminations in an uncontrolled environment, against an install clock, with mis-cabling as the dominant acceptance failure (the velocity and cabling discipline this demands is the whole subject of Chapter 7.15).
Burn-in, validation & the goodput acceptance gate
The most expensive mistake in system integration is accepting a cluster on a power-on smoke test — "it boots, it pings, sign here." AI clusters fail in ways a smoke test never sees: a GPU that trains fine for an hour and then throttles on a thermal excursion, an HBM stack with marginal bit-error rates, an optic that flaps under load, a single mis-cabled link that quietly halves bisection bandwidth. The acceptance gate that catches these is goodput-oriented: a multi-day burn-in that drives the cluster at full power and measures whether it sustains its target goodput / MFU, not merely whether it powers on.
The empirical case for a long gate is overwhelming. New clusters fail far more than mature ones — the burn-in period runs 3–4 weeks before failure rates settle, and infant-mortality components (GPUs, HBM, optics) surface precisely under sustained thermal and electrical stress. Meta's Llama 3 405B run logged 419 unplanned interruptions over 54 days on 16,384 H100s — about one every three hours — with 78% hardware-caused and the majority GPU/HBM-related (Meta Llama 3 paper, 2024). A best-in-class mature H100 cluster still sees roughly one failure per 512 GPUs every ~7 days (SemiAnalysis, 2025). An acceptance gate that does not stress the cluster long enough to surface the infant-mortality tail is not a gate; it is a handshake that defers the failures into your production goodput.
A defensible acceptance program therefore layers tests at each level: L10 node burn-in (thermal soak, memory test, per-GPU stress); L11 rack-level validation (leak test on the liquid loop, power-sequencing, in-rack link integrity, mis-cabling verification); and L12 cluster-level acceptance (collective-communication benchmarks like all-reduce bandwidth, a representative training run held to a target MFU, and a sustained multi-day goodput soak). The gate is contractual: it defines the number the integrator must hit, the duration the cluster must hold it, and the remedy if it does not. This connects directly to the formal commissioning levels in Part 13 — the integrated-systems and rack-scale acceptance machinery is built out in Chapter 13.1 and the cooling/electrical acceptance specifics in their respective chapters there.
Deep dive: what a goodput acceptance gate actually measures (and the failures it catches)
A goodput gate is not one test — it is a sequence designed so each layer catches a class of defect the layer below misses. Run them in order, because a fabric benchmark on a cluster with a thermally-marginal GPU just gives you a confusing number.
1. Component & node (L10). Per-GPU stress (compute + memory bandwidth at full TDP), HBM bit-error screening, and a thermal soak that holds the node at its power limit long enough to surface throttling. This is where the bulk of infant mortality — the faulty GPUs and marginal HBM stacks that dominated Meta's failure breakdown — is supposed to die before the rack is sealed.
2. Rack (L11). Liquid-loop leak test and pressure-hold; power-sequencing and busbar integrity; and the one that catches the most acceptance failures — cabling verification. A single transposed or under-seated link can pass a ping and still cripple collective bandwidth; automated link-map verification against the intended topology is the only reliable catch. The NVL72 packs thousands of in-rack copper NVLink cables, so the failure surface is large.
3. Cluster (L12). Collective benchmarks (all-reduce / all-gather bandwidth at scale, the operations a real training step is dominated by — see Chapter 8.2) to prove the fabric delivers its non-blocking promise; then a sustained, representative workload held to a target MFU for multiple days. Best-in-class operators target ~96% goodput against an industry average near ~90% (SemiAnalysis ClusterMAX, 2025); the gate decides which you are buying. The output is a benchmarked, signed-off cluster — the L12 deliverable — not a rack of servers that boots.
The real lead-time gate is upstream: CoWoS & HBM
It is tempting to plan the build around rack assembly — the visible, schedulable step. That is a mistake, because the binding constraint is not on the integration floor; it is two tiers upstream in advanced packaging. TSMC's CoWoS (Chip-on-Wafer-on-Substrate) capacity is the single most contended resource in the AI supply chain in 2026: backend packaging facilities have run sold out through 2027 with 52–78 week lead times, and NVIDIA alone reportedly booked the bulk of available capacity — on the order of 800,000–850,000 wafers for 2026 — even as TSMC raced CoWoS capacity from ~35k wafers/month (end 2024) toward a ~125k–130k/month target by end 2026 (TSMC; SemiAnalysis; siliconanalysts, 2026). The companion gate is HBM: 2026 HBM3E sold out, an estimated supply gap on the order of ~30%, and quarter-on-quarter price escalation (SemiAnalysis / TrendForce, 2026). HBM is the binding constraint on AI compute in its own right (Chapter 7.6); the packaging substrate that fuses it to logic is treated in Chapter 7.7.
The consequence for system integration is a sequencing rule: allocation, not assembly, sets your delivery date. A flawless L11/L12 integration line standing idle waiting for accelerators is the default failure mode of a build plan that scheduled against the wrong constraint. The operators who deploy fastest secure CoWoS/HBM allocation a year or more ahead, treat the accelerator delivery curve as the master schedule, and stage the powered shell, cooling plant, and integration capacity to be waiting on silicon rather than the reverse. Procurement and allocation strategy is the subject of Chapter 7.11.
Deployment, commissioning & the operations handoff
The last act of integration is the handoff to operations — the moment the cluster stops being the integrator's project and becomes the operator's producing asset. A clean handoff is itself an acceptance gate: it transfers not just hardware but the documentation that makes the hardware operable — the as-built rack and link maps, the firmware/driver baseline, the burn-in and acceptance results, the asset-and-port inventory that feeds DCIM (Chapter 14.2), and the spares and RMA terms. Skip the documentation transfer and you have a cluster nobody can service without reverse-engineering it.
Spares, RMA, and serviceability are where the goodput thread closes the loop. At one failure every few days per cluster, mean-time-to-repair on a tray is not an operational footnote — it is a direct multiplier on effective availability and therefore on goodput. The serviceability decisions made at integration time govern it: front-serviceable trays vs racks you must pull from the aisle; blind-mate power and liquid quick-disconnects that let you swap a tray without draining the loop; an on-site spares depot sized to the fleet's failure rate rather than a vendor's standard SLA; and an RMA path whose turnaround you actually measured. The build-vs-buy fork resurfaces here: a turnkey buyer inherits the vendor's RMA org and SLA, while the OCP self-designer owns the spares pool and the repair logistics outright — another reason the entry decision is a multi-year operational commitment, not a one-time purchase. The reliability math behind why repair time dominates goodput is developed in Chapter 12.2; checkpointing, the software complement that bounds the cost of each failure, in Chapter 9.4.