Chapter 13.1
Commissioning Fundamentals, Levels & Program Governance
Commissioning is where the building's design intent is converted into evidence. For an AI factory that evidence must span two parallel, interlocked tracks (facility and cluster) whose acceptance gates either you sequence deliberately or the schedule sequences for you, badly and at the worst possible time.
What you'll decide here
- How far up the L1–L5 ladder you actually intend to test — and specifically whether you commit to a true Level 5 Integrated Systems Test (IST) with a black-building / pull-the-plug demonstration, or stop at functional performance and accept the residual integration risk.
- Which resilience model the IST must prove — concurrent maintainability (Tier III: take any one path down for service with no IT load loss) versus fault tolerance (Tier IV: survive an unplanned single fault, including a fire or flood in one path) — because that choice writes the failure-mode test matrix.
- How the two parallel tracks — facility Cx and IT/cluster validation — interlock, and where the explicit overlapping gates sit (mechanical-Cx ↔ GPU burn-in; electrical acceptance ↔ load-bank IST ↔ first real workload).
- Who owns the program: an independent Commissioning Authority (CxA) engaged at design, not a general-contractor afterthought — and which governing documents (OPR, BOD, SOO) the whole acceptance chain traces back to.
- Which standards spine you commission against — ASHRAE Guideline 0 / 1.1 / 1.6 for process and documentation, Uptime Tier or BICSI 002 for the resilience target — and how you reconcile the facility-centric standards with a cluster the standards do not yet describe.
Commissioning (Cx) is the discipline that refuses to take the design's word for it. Every prior part of this guide makes claims — that the power chain will ride through a utility dip, that the cold plates will hold the GPUs under sustained load, that the fabric will carry a non-blocking all-reduce, that the standby plant will pick up the building before the UPS runs flat. Commissioning is the structured process of generating the evidence that those claims are true under the conditions that matter, before the facility carries revenue load. It is the bridge between Part 2's design and Part 14's day-2 operations, and it is the last point in the project where a latent integration defect costs hours to fix rather than a multi-day outage on a live cluster.
For a conventional enterprise data center this is a mature, well-codified ritual. For an AI factory it is not — and the gap is the subject of this entire Part. The reason is structural: an AI facility is two machines pretending to be one. There is the facility — substation, switchgear, generators, UPS/BESS, CDUs, chillers, piping, BMS — which the commissioning industry knows how to accept. And there is the cluster — tens of thousands of GPUs, the scale-up and scale-out fabric, storage, and the scheduler — which the facility commissioning standards barely acknowledge exists. The two tracks run in parallel, share critical resources (power and cooling), and must be interlocked at specific gates, or each track signs off in isolation against a load the other half cannot actually deliver. This chapter establishes the fundamentals — the levels, the two tracks, the roles, and the governing documents — that the rest of Part 13 builds on.
The L1–L5 ladder
Mission-critical commissioning is organized as a ladder of increasing integration, conventionally Level 1 through Level 5 (with some programs adding a Level 0 design-review and a Level 6 post-occupancy / seasonal stage). Each rung tests a wider boundary than the last; you do not climb to the next rung until the current one is signed and its deficiencies closed. The discipline of the ladder is that integration faults are found at the lowest level at which they can possibly appear — a mis-wired CT caught at L3 standalone test is a morning's rework; the same fault discovered at L5 during a black-building test invalidates the run and resets days of scripted work.
The names below are the most common form; vendors and owners vary the labels, but the boundary each rung tests is stable. The two parallel tracks (next section) each have their own analogue of this ladder, and Chapter 13.2 turns each rung into a quantitative script with explicit pass/fail gates.
| Level | Name | Boundary tested | Where it happens | Cluster-track analogue |
|---|---|---|---|---|
| L1 | Factory Acceptance Test (FAT) / factory witness | A single piece of equipment meets spec before it ships | Vendor factory / integrator floor | Rack/node integration test & firmware baseline at the integrator |
| L2 | Site Acceptance / installation verification (SAT) | Equipment arrived undamaged and is installed correctly | On site, pre-energization | Inventory, cabling/optics verification, link-up checks |
| L3 | Pre-functional / standalone (PFT) | One system works on its own, energized | On site, per-system | Single-node burn-in & DCGM diagnostics (13.8) |
| L4 | Functional performance (FPT) | One system performs across its operating envelope, including its own failure modes | On site, per-discipline | Point-to-point fabric BW/latency, single-rail validation (13.7) |
| L5 | Integrated Systems Test (IST) | All systems behave correctly together under load and fault — the building as one machine | On site, whole-facility | Cluster-scale NCCL/collective acceptance + proxy training run (13.9) |
Two things about the ladder are easy to get wrong and expensive to recover from. First, L4 already includes failure-mode testing within a discipline — a UPS must demonstrate its own transfer and its battery ride-through at L4, not wait for L5. L5 then tests failure modes across disciplines, where the interesting cascades live (a generator that starts fine and a cooling plant that rides through, but a BMS sequence that does not hand the load off in the right order). Second, the temptation under schedule pressure is to truncate the ladder — to skip a real L5 and call functional testing 'good enough.' That truncation pushes the integration risk onto day two.
Concurrent maintainability vs fault tolerance
The L5 test matrix is dictated by the resilience claim the facility was designed to meet, and the two claims that matter are the Uptime Institute's concurrent maintainability (Tier III) and fault tolerance (Tier IV). They sound similar and they are not, and commissioning is where the difference becomes physical.
Concurrent maintainability (Tier III) means any single capacity component or distribution path can be removed from service — for planned maintenance — with no impact on the IT load. The IST must therefore prove that you can take down each path in turn, one at a time, deliberately, and the load never notices. It does not require surviving an unplanned simultaneous fault. Fault tolerance (Tier IV) is the stronger claim: the facility survives any single unplanned worst-case failure — including the loss of an entire distribution path to fire or flood — with no load impact, while still being concurrently maintainable. Tier IV adds compartmentalization and continuous cooling requirements that Tier III does not, and the IST must demonstrate the auto-response to a fault, not just an orderly manual switchover.
The consequence for the commissioning program is direct: the resilience target you committed to in design writes your failure-mode demonstration script. A Tier III program scripts orderly, one-path-at-a-time maintenance scenarios. A Tier IV program scripts unannounced faults, dual-path compartment loss, and continuous-cooling ride-through — a materially larger, riskier, longer IST. Choosing the tier is a design and capital decision (Chapter 12.1); but it is at commissioning that you discover whether the building actually earns it, and that is not a decision to defer to the IST.
The two parallel tracks and how they interlock
Here is the structural fact that makes AI-factory commissioning different from every commissioning program that came before it. There are two acceptance ladders running at once, on different schedules, owned by different organizations, governed by different standards — and they are coupled by shared physical resources.
The facility track (Chapters 13.3–13.6) is the classical mission-critical Cx program: utility energization, switchgear, generators, UPS/BESS, CDUs, chillers, piping, controls, culminating in the L5 IST. It is owned by the owner's CxA and the MEP trades, governed by ASHRAE Guideline 0/1.1/1.6, Uptime, and BICSI 002. The cluster track (Chapters 13.7–13.9) is the IT validation program: fabric commissioning (BER, link-flap, topology, bandwidth/latency), GPU node burn-in and diagnostics, cluster-scale NCCL/collective acceptance, and a reference/proxy training run. It is owned by the platform/SRE/ML-infra team, governed by vendor deployment guides and de-facto standards like SemiAnalysis ClusterMAX — none of which the facility CxA typically touches.
If those two tracks each sign off in isolation, you have certified nothing useful. The facility track accepts a power-and-cooling envelope using load banks — resistive heaters that reject to air and draw a smooth, steady load. The cluster track accepts compute using real GPUs that draw a violent, synchronized, switching load and reject heat into liquid. Neither track, alone, exercises the seam between them. The interlock gates are where that seam is tested.
| Interlock gate | Facility track delivers | Cluster track delivers | What the seam actually tests | Chapter |
|---|---|---|---|---|
| Mechanical-Cx ↔ GPU burn-in | Flushed, filled, leak-tested liquid loop; CDU at flow/temperature setpoint | Nodes drawing real heat flux into the cold plates | Whether the loop, CDU controls and worst-case branch hold under real transient GPU heat — load banks cannot do this | 13.5 |
| Electrical acceptance ↔ load-bank IST | Energized power chain proven concurrently maintainable / fault tolerant | (Not yet present — emulated by load banks) | Redundancy topology under steady load; but NOT the dynamic load-swing the GPUs will impose | 13.3 |
| Load-bank IST ↔ first real workload | Building proven against resistive/reactive/AI-emulating load banks | Proxy training run imposing real synchronized power and thermal dynamics | Power smoothing, dynamic-load-swing tolerance, thermal ride-through under the only true emulator: a real run | 13.6 / 13.9 |
The middle row is the defining limitation of facility commissioning for AI and the reason this Part exists. A load bank is a deliberately boring load: a bank of resistive elements that draws a smooth, constant current and rejects its heat straight to the room air. A frontier GPU cluster is the opposite — tens of thousands of accelerators that ramp from idle to full and back in milliseconds in lockstep across a synchronous step, imposing power transients and harmonic content that a resistive bank never produces, and dumping that heat into a liquid loop the load bank never touches. The facility track can prove the power chain is wired correctly and redundant; it cannot prove the chain absorbs the dynamic load swing of a real all-reduce, because the tool it uses to emulate load cannot generate that swing. This is the dynamic-load realism gap, and it is canonicalized in Chapter 13.6; the transient physics behind it lives in Chapter 4.5. The mitigation is sequencing: the proxy training run of Chapter 13.9 is the first and only acceptance test that closes the gap, which is why it must come before — not after — go-live.
Deep dive: why facility Cx and cluster Cx cannot simply run independently
The intuitive program-management instinct is to treat the two tracks as independent workstreams — let the MEP CxA finish the building, hand over a 'powered shell with cooling,' and then let the ML-infra team bring up the cluster on top. For a low-density, air-cooled enterprise hall that instinct is mostly fine. For a liquid-cooled AI factory it produces three specific failures.
One: the load realism handoff is silent. The facility team signs off cooling against load banks that reject to air. The first time the liquid loop sees a real cold-plate heat-flux transient is when a $40M cluster is already racked and running — which means the worst-case-branch thermal-hydraulic behavior, the CDU control-loop stability under a real step change, and the leak-detection response under real pressure cycling are all being discovered in production. The fix is the explicit mechanical-Cx ↔ burn-in overlap (Chapter 13.5): the loop is accepted with real nodes drawing heat, not with load banks, by sequencing GPU burn-in to begin while mechanical Cx is still open.
Two: the power-transient handoff is silent. Electrical acceptance proves the redundancy topology under a smooth load. The dynamic-load-swing tolerance — the UPS/BESS and any rack-level energy storage absorbing a synchronized GPU power step — is never exercised until the proxy run. A cluster that passed every facility gate can still trip protection or sag a bus the first time 10,000 GPUs enter a collective in lockstep. The mitigation stack (BBU → BESS → on-board capacitance under realistic dynamics) is only validated by the proxy run, per Chapter 13.6.
Three: ownership gaps become finger-pointing. When a node throttles, is it a GPU fault (cluster team), a coolant-temperature excursion (facility team), or a CDU control-tuning issue (the seam)? Without an interlocked program and a shared deficiency log (Chapter 13.2), each track's commissioning record shows 'pass' and the defect lives in the gap between them. The governance answer is a single integrated commissioning schedule with named interlock gates and a CxA whose scope explicitly spans the seam.
Roles and the governing documents (OPR / BOD / SOO)
Commissioning is only as good as the requirements it tests against, and those requirements live in a short chain of governing documents that every acceptance script must trace back to. The chain is deliberately a chain — each document derives from the one before it, so that a pass/fail gate in an L5 script can be followed all the way back to a stated owner intent.
- Owner's Project Requirements (OPR). The owner's intent in measurable terms: the availability target (and therefore the Tier or BICSI class), the density and ramp the building must accommodate, the environmental envelopes, the maintainability expectations, and the acceptance criteria the owner will hold the project to. Everything downstream is an answer to the OPR. For an AI factory the OPR must state the workload intent (training-shaped vs inference-shaped, per Chapter 1.1), because that is what makes the difference between commissioning for goodput and commissioning for availability.
- Basis of Design (BOD). The design engineer's documented explanation of how the proposed systems satisfy each OPR requirement — the topology, the redundancy scheme, the cooling architecture, the setpoints, and the assumptions. ASHRAE Guideline 1.1 governs the BOD specifically. The CxA verifies the BOD answers the OPR before construction, not after.
- Sequence of Operations (SOO). The control-logic specification — exactly how the BMS/EPMS/DCIM is supposed to behave in every normal, maintenance, and failure mode. The SOO is the script-writer's bible: an L4/L5 functional test is, in essence, a line-by-line proof that the real controls match the SOO under real conditions. A vague or incomplete SOO is the most common root cause of a failed IST, because there is nothing precise to test against.
The role that owns this chain is the Commissioning Authority (CxA) — and the single most important governance decision is to engage the CxA at the design phase, as an independent party, not as a general-contractor self-check bolted on at the end. ASHRAE Guideline 0 (the process), ASHRAE Standard 202 (the formalized commissioning process / Cx-Process), and the data-center-specific ASHRAE Guideline 1.6 all assume an independent CxA verifying the OPR→BOD→SOO chain across the whole project lifecycle, from design review (L0) through post-occupancy. An owner who engages the CxA only to 'witness the IST' has already lost the design-phase reviews where the cheapest defects are caught.
Why the AI factory breaks the conventional program
Pull the threads together and the shape of the problem is clear. A conventional data-center commissioning program is a single, well-standardized ladder culminating in an IST against load banks, governed end-to-end by ASHRAE/Uptime/BICSI, signed by an independent CxA. That program, applied unchanged to an AI factory, certifies a building that has never seen its actual load and a cluster the standards never mention.
The three structural breaks are: (1) two tracks, not one — facility and cluster, on different schedules and standards, requiring explicit interlock gates; (2) load realism — the load banks that accept the facility cannot reproduce the dynamic power and liquid-side thermal behavior of real GPUs, so the proxy training run becomes a mandatory acceptance gate rather than an optional nicety; and (3) density ramp — the facility is commissioned today against a 130 kW/rack generation while the design must accommodate the 600 kW-class racks of the next one, so the IST must validate not only the steady state but the headroom the ramp will consume (floor, water, electrical, cooling-plant turndown). A program that handles these three is a Part 13 program. A program that ignores them is a conventional program that will hand a long deficiency list to day-2 operations — exactly the failure environment the 419-interruptions-in-54-days reality of a real training run cannot afford.
Deep dive: the commissioning schedule is a redundancy problem, not just a calendar
One subtlety that separates AI-factory commissioning from a fresh greenfield enterprise build: much AI capacity is energized in stages, with live blocks already carrying load while later blocks are still being commissioned. That turns the Cx schedule into a redundancy-engineering problem. An L5 IST that pulls utility or fails a generator on a shared distribution path can threaten the live blocks unless the test boundary is drawn to preserve their redundancy throughout. This is why staged energization (Chapter 13.10) and the IST sequence (Chapter 13.6) are co-designed: you cannot run a destructive integrated test on a path that a revenue-bearing block depends on without first proving you can isolate it.
The governance consequence is that the CxA and operations must agree, in writing, on which faults the IST is permitted to inject at which stage, and what the live-block protection is during each. A program that treats commissioning as a pre-occupancy gate that finishes before any load arrives does not match how AI capacity is actually brought online — it arrives block by block, against a power-bound interconnection clock that rewards energizing early and commissioning around live load. → staged ramp and the Operational Readiness gate in Chapter 13.10.