Chapter 13.1

Commissioning Fundamentals, Levels & Program Governance

Commissioning is where the building's design intent is converted into evidence. For an AI factory that evidence must span two parallel, interlocked tracks (facility and cluster) whose acceptance gates either you sequence deliberately or the schedule sequences for you, badly and at the worst possible time.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

How far up the L1–L5 ladder you actually intend to test — and specifically whether you commit to a true Level 5 Integrated Systems Test (IST) with a black-building / pull-the-plug demonstration, or stop at functional performance and accept the residual integration risk.
Which resilience model the IST must prove — concurrent maintainability (Tier III: take any one path down for service with no IT load loss) versus fault tolerance (Tier IV: survive an unplanned single fault, including a fire or flood in one path) — because that choice writes the failure-mode test matrix.
How the two parallel tracks — facility Cx and IT/cluster validation — interlock, and where the explicit overlapping gates sit (mechanical-Cx ↔ GPU burn-in; electrical acceptance ↔ load-bank IST ↔ first real workload).
Who owns the program: an independent Commissioning Authority (CxA) engaged at design, not a general-contractor afterthought — and which governing documents (OPR, BOD, SOO) the whole acceptance chain traces back to.
Which standards spine you commission against — ASHRAE Guideline 0 / 1.1 / 1.6 for process and documentation, Uptime Tier or BICSI 002 for the resilience target — and how you reconcile the facility-centric standards with a cluster the standards do not yet describe.

Commissioning (Cx) is the discipline that refuses to take the design's word for it. Every prior part of this guide makes claims — that the power chain will ride through a utility dip, that the cold plates will hold the GPUs under sustained load, that the fabric will carry a non-blocking all-reduce, that the standby plant will pick up the building before the UPS runs flat. Commissioning is the structured process of generating the evidence that those claims are true under the conditions that matter, before the facility carries revenue load. It is the bridge between Part 2's design and Part 14's day-2 operations, and it is the last point in the project where a latent integration defect costs hours to fix rather than a multi-day outage on a live cluster.

For a conventional enterprise data center this is a mature, well-codified ritual. For an AI factory it is not — and the gap is the subject of this entire Part. The reason is structural: an AI facility is two machines pretending to be one. There is the facility — substation, switchgear, generators, UPS/BESS, CDUs, chillers, piping, BMS — which the commissioning industry knows how to accept. And there is the cluster — tens of thousands of GPUs, the scale-up and scale-out fabric, storage, and the scheduler — which the facility commissioning standards barely acknowledge exists. The two tracks run in parallel, share critical resources (power and cooling), and must be interlocked at specific gates, or each track signs off in isolation against a load the other half cannot actually deliver. This chapter establishes the fundamentals — the levels, the two tracks, the roles, and the governing documents — that the rest of Part 13 builds on.

The L1–L5 ladder

Mission-critical commissioning is organized as a ladder of increasing integration, conventionally Level 1 through Level 5 (with some programs adding a Level 0 design-review and a Level 6 post-occupancy / seasonal stage). Each rung tests a wider boundary than the last; you do not climb to the next rung until the current one is signed and its deficiencies closed. The discipline of the ladder is that integration faults are found at the lowest level at which they can possibly appear — a mis-wired CT caught at L3 standalone test is a morning's rework; the same fault discovered at L5 during a black-building test invalidates the run and resets days of scripted work.

The names below are the most common form; vendors and owners vary the labels, but the boundary each rung tests is stable. The two parallel tracks (next section) each have their own analogue of this ladder, and Chapter 13.2 turns each rung into a quantitative script with explicit pass/fail gates.

The commissioning ladder — boundary tested at each level

Level	Name	Boundary tested	Where it happens	Cluster-track analogue
L1	Factory Acceptance Test (FAT) / factory witness	A single piece of equipment meets spec before it ships	Vendor factory / integrator floor	Rack/node integration test & firmware baseline at the integrator
L2	Site Acceptance / installation verification (SAT)	Equipment arrived undamaged and is installed correctly	On site, pre-energization	Inventory, cabling/optics verification, link-up checks
L3	Pre-functional / standalone (PFT)	One system works on its own, energized	On site, per-system	Single-node burn-in & DCGM diagnostics (13.8)
L4	Functional performance (FPT)	One system performs across its operating envelope, including its own failure modes	On site, per-discipline	Point-to-point fabric BW/latency, single-rail validation (13.7)
L5	Integrated Systems Test (IST)	All systems behave correctly together under load and fault — the building as one machine	On site, whole-facility	Cluster-scale NCCL/collective acceptance + proxy training run (13.9)

Conventional L1–L5 mission-critical commissioning. Labels vary by owner/CxA; the integration boundary each level tests is the stable concept. Cluster-track analogues per the right column are developed across Chapters 13.7–13.9.

Two things about the ladder are easy to get wrong and expensive to recover from. First, L4 already includes failure-mode testing within a discipline — a UPS must demonstrate its own transfer and its battery ride-through at L4, not wait for L5. L5 then tests failure modes across disciplines, where the interesting cascades live (a generator that starts fine and a cooling plant that rides through, but a BMS sequence that does not hand the load off in the right order). Second, the temptation under schedule pressure is to truncate the ladder — to skip a real L5 and call functional testing 'good enough.' That truncation pushes the integration risk onto day two.

Commit to a true Level 5, or relocate the integration risk onto a live cluster

A real L5 IST is expensive, schedule-hungry, and the thing most likely to slip under deadline pressure. It requires the building substantially complete, temporary load (load banks) staged at scale, every subsystem at L4 sign-off, and a multi-day scripted sequence that deliberately breaks things — pulling utility, failing a generator, dropping a CDU. The cheap alternative is to stop at L4: every system passed its own functional test, so 'the building works.' It does not follow. The faults L5 exists to find — control-sequence ordering, shared-resource contention, the BMS handing off a load in the wrong order during a real transient — are precisely the ones that L4's per-system isolation cannot surface. Skipping L5 does not remove the integration risk; it relocates it onto a live, revenue-bearing cluster, where the same fault is a multi-day goodput outage instead of a scripted afternoon. Decide this at scoping. If the answer is 'we will do a real IST,' the load-bank logistics, the schedule float, and the CxA scope all have to be funded now, not negotiated later. → IST scope in Chapter 13.6.

Concurrent maintainability vs fault tolerance

The L5 test matrix is dictated by the resilience claim the facility was designed to meet, and the two claims that matter are the Uptime Institute's concurrent maintainability (Tier III) and fault tolerance (Tier IV). They sound similar and they are not, and commissioning is where the difference becomes physical.

Concurrent maintainability (Tier III) means any single capacity component or distribution path can be removed from service — for planned maintenance — with no impact on the IT load. The IST must therefore prove that you can take down each path in turn, one at a time, deliberately, and the load never notices. It does not require surviving an unplanned simultaneous fault. Fault tolerance (Tier IV) is the stronger claim: the facility survives any single unplanned worst-case failure — including the loss of an entire distribution path to fire or flood — with no load impact, while still being concurrently maintainable. Tier IV adds compartmentalization and continuous cooling requirements that Tier III does not, and the IST must demonstrate the auto-response to a fault, not just an orderly manual switchover.

The consequence for the commissioning program is direct: the resilience target you committed to in design writes your failure-mode demonstration script. A Tier III program scripts orderly, one-path-at-a-time maintenance scenarios. A Tier IV program scripts unannounced faults, dual-path compartment loss, and continuous-cooling ride-through — a materially larger, riskier, longer IST. Choosing the tier is a design and capital decision (Chapter 12.1); but it is at commissioning that you discover whether the building actually earns it, and that is not a decision to defer to the IST.

The AI twist: you may be commissioning to a tier the workload does not value

Tier IV carries a ~20–40% capital premium over Tier III, and the IST to prove it is correspondingly heavier. For a synchronous training cluster, much of that premium buys availability the workload does not reward: a training job already checkpoints and restarts, so it tolerates a brief facility interruption far better than an always-on inference business does. This is the goodput-vs-availability rethink of Chapter 12.2 arriving at the commissioning gate. The practical implication is not 'commission to a lower tier' — it is that the failure-mode script should be weighted toward the faults that actually destroy goodput (a thermal excursion that throttles the whole cluster, a power transient the UPS cannot absorb) rather than mechanically chasing every Tier-IV compartment scenario for a load that would happily ride a checkpoint. Commission to what the workload values, not only to what the plaque says.

The two parallel tracks and how they interlock

Here is the structural fact that makes AI-factory commissioning different from every commissioning program that came before it. There are two acceptance ladders running at once, on different schedules, owned by different organizations, governed by different standards — and they are coupled by shared physical resources.

The facility track (Chapters 13.3–13.6) is the classical mission-critical Cx program: utility energization, switchgear, generators, UPS/BESS, CDUs, chillers, piping, controls, culminating in the L5 IST. It is owned by the owner's CxA and the MEP trades, governed by ASHRAE Guideline 0/1.1/1.6, Uptime, and BICSI 002. The cluster track (Chapters 13.7–13.9) is the IT validation program: fabric commissioning (BER, link-flap, topology, bandwidth/latency), GPU node burn-in and diagnostics, cluster-scale NCCL/collective acceptance, and a reference/proxy training run. It is owned by the platform/SRE/ML-infra team, governed by vendor deployment guides and de-facto standards like SemiAnalysis ClusterMAX — none of which the facility CxA typically touches.

If those two tracks each sign off in isolation, you have certified nothing useful. The facility track accepts a power-and-cooling envelope using load banks — resistive heaters that reject to air and draw a smooth, steady load. The cluster track accepts compute using real GPUs that draw a violent, synchronized, switching load and reject heat into liquid. Neither track, alone, exercises the seam between them. The interlock gates are where that seam is tested.

The interlock gates between the facility and cluster tracks

Interlock gate	Facility track delivers	Cluster track delivers	What the seam actually tests	Chapter
Mechanical-Cx ↔ GPU burn-in	Flushed, filled, leak-tested liquid loop; CDU at flow/temperature setpoint	Nodes drawing real heat flux into the cold plates	Whether the loop, CDU controls and worst-case branch hold under real transient GPU heat — load banks cannot do this	13.5
Electrical acceptance ↔ load-bank IST	Energized power chain proven concurrently maintainable / fault tolerant	(Not yet present — emulated by load banks)	Redundancy topology under steady load; but NOT the dynamic load-swing the GPUs will impose	13.3
Load-bank IST ↔ first real workload	Building proven against resistive/reactive/AI-emulating load banks	Proxy training run imposing real synchronized power and thermal dynamics	Power smoothing, dynamic-load-swing tolerance, thermal ride-through under the only true emulator: a real run	13.6 / 13.9

The explicit overlapping, sequenced gates where the two tracks must hand off. Each is developed in the cited chapter; the realism limits at each gate are the recurring theme of Part 13.

The middle row is the defining limitation of facility commissioning for AI and the reason this Part exists. A load bank is a deliberately boring load: a bank of resistive elements that draws a smooth, constant current and rejects its heat straight to the room air. A frontier GPU cluster is the opposite — tens of thousands of accelerators that ramp from idle to full and back in milliseconds in lockstep across a synchronous step, imposing power transients and harmonic content that a resistive bank never produces, and dumping that heat into a liquid loop the load bank never touches. The facility track can prove the power chain is wired correctly and redundant; it cannot prove the chain absorbs the dynamic load swing of a real all-reduce, because the tool it uses to emulate load cannot generate that swing. This is the dynamic-load realism gap, and it is canonicalized in Chapter 13.6; the transient physics behind it lives in Chapter 4.5. The mitigation is sequencing: the proxy training run of Chapter 13.9 is the first and only acceptance test that closes the gap, which is why it must come before — not after — go-live.

Deep dive: why facility Cx and cluster Cx cannot simply run independently

The intuitive program-management instinct is to treat the two tracks as independent workstreams — let the MEP CxA finish the building, hand over a 'powered shell with cooling,' and then let the ML-infra team bring up the cluster on top. For a low-density, air-cooled enterprise hall that instinct is mostly fine. For a liquid-cooled AI factory it produces three specific failures.

One: the load realism handoff is silent. The facility team signs off cooling against load banks that reject to air. The first time the liquid loop sees a real cold-plate heat-flux transient is when a $40M cluster is already racked and running — which means the worst-case-branch thermal-hydraulic behavior, the CDU control-loop stability under a real step change, and the leak-detection response under real pressure cycling are all being discovered in production. The fix is the explicit mechanical-Cx ↔ burn-in overlap (Chapter 13.5): the loop is accepted with real nodes drawing heat, not with load banks, by sequencing GPU burn-in to begin while mechanical Cx is still open.

Two: the power-transient handoff is silent. Electrical acceptance proves the redundancy topology under a smooth load. The dynamic-load-swing tolerance — the UPS/BESS and any rack-level energy storage absorbing a synchronized GPU power step — is never exercised until the proxy run. A cluster that passed every facility gate can still trip protection or sag a bus the first time 10,000 GPUs enter a collective in lockstep. The mitigation stack (BBU → BESS → on-board capacitance under realistic dynamics) is only validated by the proxy run, per Chapter 13.6.

Three: ownership gaps become finger-pointing. When a node throttles, is it a GPU fault (cluster team), a coolant-temperature excursion (facility team), or a CDU control-tuning issue (the seam)? Without an interlocked program and a shared deficiency log (Chapter 13.2), each track's commissioning record shows 'pass' and the defect lives in the gap between them. The governance answer is a single integrated commissioning schedule with named interlock gates and a CxA whose scope explicitly spans the seam.

Roles and the governing documents (OPR / BOD / SOO)

Commissioning is only as good as the requirements it tests against, and those requirements live in a short chain of governing documents that every acceptance script must trace back to. The chain is deliberately a chain — each document derives from the one before it, so that a pass/fail gate in an L5 script can be followed all the way back to a stated owner intent.

Owner's Project Requirements (OPR). The owner's intent in measurable terms: the availability target (and therefore the Tier or BICSI class), the density and ramp the building must accommodate, the environmental envelopes, the maintainability expectations, and the acceptance criteria the owner will hold the project to. Everything downstream is an answer to the OPR. For an AI factory the OPR must state the workload intent (training-shaped vs inference-shaped, per Chapter 1.1), because that is what makes the difference between commissioning for goodput and commissioning for availability.
Basis of Design (BOD). The design engineer's documented explanation of how the proposed systems satisfy each OPR requirement — the topology, the redundancy scheme, the cooling architecture, the setpoints, and the assumptions. ASHRAE Guideline 1.1 governs the BOD specifically. The CxA verifies the BOD answers the OPR before construction, not after.
Sequence of Operations (SOO). The control-logic specification — exactly how the BMS/EPMS/DCIM is supposed to behave in every normal, maintenance, and failure mode. The SOO is the script-writer's bible: an L4/L5 functional test is, in essence, a line-by-line proof that the real controls match the SOO under real conditions. A vague or incomplete SOO is the most common root cause of a failed IST, because there is nothing precise to test against.

The role that owns this chain is the Commissioning Authority (CxA) — and the single most important governance decision is to engage the CxA at the design phase, as an independent party, not as a general-contractor self-check bolted on at the end. ASHRAE Guideline 0 (the process), ASHRAE Standard 202 (the formalized commissioning process / Cx-Process), and the data-center-specific ASHRAE Guideline 1.6 all assume an independent CxA verifying the OPR→BOD→SOO chain across the whole project lifecycle, from design review (L0) through post-occupancy. An owner who engages the CxA only to 'witness the IST' has already lost the design-phase reviews where the cheapest defects are caught.

The standards describe the building, not the cluster

ASHRAE Guideline 0/1.1/1.6, ASHRAE Standard 202, Uptime's Tier Standard, and BICSI 002 are excellent, mature, and — for the cluster track — almost silent. None of them tells you what BER threshold accepts an InfiniBand link, what NCCL busbw gate accepts the fabric, how long a GPU burn-in soak should run, or what goodput a proxy run must hit to accept the cluster. That body of practice is carried by vendor deployment guides (NVIDIA DGX/BasePOD), benchmark methodologies (MLPerf, OSU), and de-facto industry standards (SemiAnalysis ClusterMAX). The governance gap is real: an owner who commissions strictly 'to standard' commissions only half the machine. The CxA scope, the acceptance documentation, and the deficiency log must be deliberately extended to cover the cluster track — because no standard body will do it for you. → fabric gates in Chapter 13.7, burn-in in Chapter 13.8, cluster acceptance in Chapter 13.9.

L1–L5

the commissioning ladder: FAT → SAT → pre-functional → functional → Integrated Systems Test (IST)

2025Construct & Commission; BMP MEP; CxPlanner

Tier III vs IV

concurrent maintainability (any path serviceable, no load impact) vs fault tolerance (survive any single unplanned fault)

2025Uptime Institute Tier Standard

99.982% / 99.995%

Tier III (~1.6 hr/yr) vs Tier IV (~26 min/yr) availability; ~20–40% capital premium for IV

2025Uptime Institute (% figures Uptime-disavowed)

Gd 0 / 1.1 / 1.6

ASHRAE commissioning-process / Basis-of-Design / data-center-specific Cx guidelines; Std 202 formalizes the Cx-Process

2025ASHRAE; ACHR News

0.5–2%

commissioning as a share of construction cost; CxAs now locked in 12–18 months ahead of energization

2025CxPlanner; iRecruit / industry practice

~$14.2M/mo

lost-revenue cost of delaying commissioning a 60 MW facility — the schedule pressure that tempts truncating L5

2025Mastt / industry build-cost analyses

419 / 54 days

unplanned interruptions on a 16,384-GPU Llama 3 run (~1 every 3 hr); the day-2 reality a thin Cx program hands forward

2024Meta (Llama 3 paper) / Tom's Hardware

~575 pp

ANSI/BICSI 002-2024 — the most comprehensive lifecycle design+implementation standard; 2024 ed. expanded liquid/immersion

2024BICSI

Why the AI factory breaks the conventional program

Pull the threads together and the shape of the problem is clear. A conventional data-center commissioning program is a single, well-standardized ladder culminating in an IST against load banks, governed end-to-end by ASHRAE/Uptime/BICSI, signed by an independent CxA. That program, applied unchanged to an AI factory, certifies a building that has never seen its actual load and a cluster the standards never mention.

The three structural breaks are: (1) two tracks, not one — facility and cluster, on different schedules and standards, requiring explicit interlock gates; (2) load realism — the load banks that accept the facility cannot reproduce the dynamic power and liquid-side thermal behavior of real GPUs, so the proxy training run becomes a mandatory acceptance gate rather than an optional nicety; and (3) density ramp — the facility is commissioned today against a 130 kW/rack generation while the design must accommodate the 600 kW-class racks of the next one, so the IST must validate not only the steady state but the headroom the ramp will consume (floor, water, electrical, cooling-plant turndown). A program that handles these three is a Part 13 program. A program that ignores them is a conventional program that will hand a long deficiency list to day-2 operations — exactly the failure environment the 419-interruptions-in-54-days reality of a real training run cannot afford.

Deep dive: the commissioning schedule is a redundancy problem, not just a calendar

One subtlety that separates AI-factory commissioning from a fresh greenfield enterprise build: much AI capacity is energized in stages, with live blocks already carrying load while later blocks are still being commissioned. That turns the Cx schedule into a redundancy-engineering problem. An L5 IST that pulls utility or fails a generator on a shared distribution path can threaten the live blocks unless the test boundary is drawn to preserve their redundancy throughout. This is why staged energization (Chapter 13.10) and the IST sequence (Chapter 13.6) are co-designed: you cannot run a destructive integrated test on a path that a revenue-bearing block depends on without first proving you can isolate it.

The governance consequence is that the CxA and operations must agree, in writing, on which faults the IST is permitted to inject at which stage, and what the live-block protection is during each. A program that treats commissioning as a pre-occupancy gate that finishes before any load arrives does not match how AI capacity is actually brought online — it arrives block by block, against a power-bound interconnection clock that rewards energizing early and commissioning around live load. → staged ramp and the Operational Readiness gate in Chapter 13.10.

This chapter is the spine of Part 13. The L1–L5 scripts and quantitative pass/fail gates are built in Chapter 13.2. The facility track runs through electrical acceptance in Chapter 13.3, microgrid/generation in Chapter 13.4, cooling in Chapter 13.5, and the L5 IST and failure-mode demonstration in Chapter 13.6. The cluster track runs through fabric commissioning in Chapter 13.7, node burn-in in Chapter 13.8, and cluster-scale benchmarking and the proxy training run in Chapter 13.9; the staged ramp and handover close it out in Chapter 13.10. The resilience targets this chapter commissions against are designed in Chapter 12.1 and reframed for AI goodput in Chapter 12.2; the transient physics behind the load-realism gap is in Chapter 4.5; the workload intent that the OPR must capture is in Chapter 1.1; and the day-2 program this hands forward begins in Chapter 14.1.