Guide › Part 13

Part 13

Commissioning & Go-Live

10 chapters

Commissioning Fundamentals, Levels & Program Governance

Commissioning is where the building's design intent is converted into evidence. For an AI factory that evidence must span two parallel, interlocked tracks (facility and cluster) whose acceptance gates either you sequence deliberately or the schedule sequences for you, badly and at the worst possible time.

Documentation, Scripts & Acceptance Test Plans

A commissioning script is a contract written in numbers: every test either has a pre-agreed quantitative pass/fail gate and a witnessed signature, or it is theatre — and on an AI factory the only test that proves the building works is one a facility load bank physically cannot run.

Electrical Power Acceptance (L3/L4)

Electrical acceptance is the gate where the paper power chain — the interconnect, the switchgear, the generators, the UPS/BESS — is forced to prove, with instrumented evidence, that it will hold the most violent load on the planet: a synchronized cluster of GPUs that can swing tens of megawatts in milliseconds.

Commissioning On-Site Generation & Microgrid Controls

When you build behind the meter you are no longer a customer of a grid — you are a grid, and commissioning is the only point at which you prove your generation, controls, and storage can actually keep a gigawatt-class AI load alive when the utility lets go.

Cooling Acceptance: Air, Liquid-to-Chip & CDU Commissioning

Cooling acceptance is the one part of commissioning where the facility cannot test what it is built to do — a load bank rejects heat to air, never into a cold plate — so the liquid loop, the CDU controls, and the worst-case branch only ever see realistic transient heat-flux when real GPUs arrive, which makes mechanical Cx and GPU burn-in a single overlapping gate, not two sequential ones.

Level 5 Integrated Systems Testing (IST) & Failure-Mode Demonstration

Integrated Systems Testing is the last and only chance to fail the building on your own terms — but a load bank can prove the power and cooling chains survive a fault while completely failing to reproduce the millisecond electrical dynamics and worst-case cold-plate heat flux that a real GPU cluster imposes, so IST acceptance must be written as an explicit bridge from what the load bank can demonstrate to what only the first real workload can.

Network Fabric Commissioning & Validation

A GPU fabric does not fail loudly at commissioning — it fails quietly, one marginal optic and one mis-cabled rail at a time, and every defect you do not screen out at layer 1 reappears as a straggler, a stalled all-reduce, or an uncorrelatable incident once the cluster is earning depreciation.

GPU Node Burn-In, Diagnostics & Stress Validation

A GPU node that boots and passes a smoke test is not a commissioned node — burn-in is the deliberate, time-bounded campaign that converts a hall full of accelerators into a fleet whose failures have already happened on your clock instead of mid-training-run on the customer's.

Cluster-Scale Benchmarking, Reference Training & Storage/Scheduler Validation

The cluster is not accepted when every component passes its own test — it is accepted when ten thousand GPUs, a non-blocking fabric, a parallel filesystem, and a gang-scheduler together hold a measured goodput number under a real training load, and that number, not a sum of green checkmarks, is what you sign.

Staged Power/Load Ramp, Go-Live & Handover to Operations

Go-live is not a switch you throw. It is a staged ramp of megawatts and synchronized GPU load through an operational-readiness gate, and the two ways operators get it wrong are energizing faster than the grid (or the cooling plant) can absorb the swing, and declaring a facility 'live' before the people, procedures, and telemetry that keep it alive have been handed over.