The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 14.14

Continuous & Re-Commissioning on a Live Campus

Commissioning is not a one-time gate you pass at go-live; on a live AI campus that adds a new accelerator generation every 12–18 months, the proof that the building still does what its drawings claim decays continuously — so re-commissioning becomes a standing operational program, triggered by drift and density steps, executed against a running revenue factory you cannot turn off.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

  1. Which integrated tests (black-building, failover, thermal ride-through) you re-run on a live campus, and at what cadence — annual calendar, drift-triggered, or both — given that every re-test puts revenue-earning load at risk.
  2. Which re-commissioning triggers you treat as mandatory full re-validation (density step-up, topology change) versus partial re-test (equipment swap, firmware push), and who has authority to sign the difference.
  3. How you isolate the test scope so a re-commissioning event stresses one block or fault domain at a time, never the whole factory — and whether your topology even permits that isolation.
  4. How the drift-detection loop from your DCIM and telemetry (Chapters 14.2 / 14.8) closes into a re-validation trigger automatically, rather than waiting for an outage to reveal the gap.
  5. Which Appendix F failure modes you actively demonstrate on the live campus on a recurring schedule, versus the ones you only model — and the cost of being wrong about that line.

The commissioning program in Part 13 ended at handover: the L5 integrated systems test passed, the black-building test demonstrated, the proxy training run hit its goodput gate, and operations took the keys. That is the moment the building was proven. It is also the moment the proof started to expire. An AI campus is not a static asset that drifts slowly like a legacy enterprise hall — it is a continuously mutating machine. Racks get denser, CDUs get re-piped, firmware gets pushed fleet-wide, a UPS module gets RMA'd, a fault domain gets re-cabled, a setpoint gets nudged to chase PUE. Each of those changes invalidates some slice of the commissioning evidence, and none of them announce that they have done so. Re-commissioning is the discipline of re-proving the building against its own design intent, on a cadence and on triggers, while it is full of paying load.

Re-commissioning runs under a constraint the original Cx team never faced: you cannot take the factory down to test it. The original IST happened in an empty building with load banks and a sacrificial proxy run; the re-test happens on a live campus where the load banks are real GPUs running real jobs worth real money per hour. That single difference reorders everything — what you test, how you isolate it, when you accept the risk, and what you simply decline to demonstrate because the demonstration itself is more dangerous than the failure mode it would reveal.

Why the proof decays: the half-life of a commissioning result

A commissioning result is a statement of the form "under these conditions, this system behaves this way." It stays true exactly as long as the conditions hold. On a frontier campus the conditions never hold for long. Three forces erode the proof, and they erode it at very different rates.

Physical drift is the slow one. Coolant chemistry degrades and biofouls the cold plates; quick-disconnects seep; UPS and BESS cells age and lose ride-through margin; generator fuel polymerizes; CRAH coils foul; breaker contacts and bus connections develop resistance. None of this is visible on a single-line diagram, and all of it moves the real failover and ride-through behavior away from the commissioned number. This is the regime that classic retro-commissioning was invented for, and the DOE/LBNL evidence base is unambiguous that the savings and reliability left on the table are large — but that program was built for buildings that barely change, not for ones that double in density every other year.

Configuration drift is the fast one and the dangerous one. Every change ticket — a firmware bump, a setpoint edit, a re-cabled rail, a swapped PDU, a BMS logic tweak to silence a nuisance alarm — moves the system off the configuration that was commissioned. Individually each is trivial. In aggregate, after eighteen months of a live campus, the running configuration and the as-commissioned configuration have diverged so far that the original test evidence describes a building that no longer exists. The Uptime Institute's outage analysis keeps finding the same thing: the dominant root cause of serious outages is human error in changes and procedures, not equipment failure — which is precisely configuration drift coming due.

Step changes are the violent ones. A density step-up from one accelerator generation to the next does not drift the building off its proof — it invalidates the proof wholesale. A hall commissioned for 40 kW air racks that now hosts 132 kW liquid-cooled NVL72, or a 132 kW hall absorbing a ~600 kW Kyber-class generation, is a different thermal and electrical machine. The original IST said nothing about how this building behaves; nobody has ever black-building-tested this configuration. → Chapter 5.1 on the density wall the step crosses.

Periodic re-testing on a live campus

Three integrated tests defined the original IST, and all three have to be re-run periodically because they prove things that physically degrade: the black-building (pull-the-plug) test, the redundancy failover test, and the thermal ride-through test. The hard part is not the test — it is running it without taking the factory down. The discipline is fault-domain isolation: you never re-test the whole campus at once; you re-test one block, one fault domain, or one concurrent-maintainability boundary at a time, while the rest of the campus carries load and stands ready to absorb the block under test if it fails the test for real.

Black-building re-test. The original test cut utility to an empty building and watched the UPS/BESS-to-generator handoff carry imaginary load. The live re-test cuts utility to one block carrying real GPU load, and the stakes invert: if the BBU/BESS ride-through or the generator start now falls short — because cells aged, because fuel degraded, because a setpoint drifted — you do not lose a test, you drop a live training run and breach an SLA. This is why on a concurrently-maintainable (Tier III–class) campus the live black-building re-test is run per-block on a rolling schedule, and why on a fault-tolerant (Tier IV–class) campus it can be run more aggressively because a second autonomous path is always live. The topology you commissioned at go-live is the topology that decides whether you can afford to re-test at all.

Failover re-test. Redundant paths that are never exercised silently fail to redundant — the standby CDU whose isolation valve was left shut after maintenance, the N+1 generator that will not parallel because a control card drifted, the 2N feed whose static-transfer switch sticks. The only way to know a redundancy claim is still true is to force the transfer under load and watch it hold. The consequence of not re-testing is the worst kind of outage: the one where the redundancy you paid a 20–40% capital premium for does not engage when the primary actually fails, turning a maintainable event into a full block drop.

Thermal ride-through re-test. This is the AI-specific one, and the one the original commissioning could never fully prove. Facility load banks reject heat to air; they do not reject it into cold plates and CDUs at realistic transient heat flux. So the question "how long does the liquid loop hold temperature when a CDU trips or a pump fails, with real GPUs at full draw?" was only ever partially answered at go-live (→ Chapter 13.5 on the load-realism limit; Chapter 13.6 on the dynamic-load gap). On a live campus you finally have the real heat source — and you also have the real revenue at risk. Thermal ride-through is the re-test most worth running on live load precisely because it is the one load banks lied about, and the one a density step-up changes most.

Live-campus re-test: what you prove, what it risks, how you isolate it
Re-testWhat it re-provesWhat degrades itLive-campus riskIsolation strategyTypical cadence
Black-building (pull-the-plug)Utility-loss ride-through and generator handoff under real loadBBU/BESS cell aging; fuel degradation; setpoint driftDropped run + SLA breach if handoff falls shortOne block at a time; rest of campus absorbs the blockAnnual per block; sooner after BESS/genset work
Redundancy failoverN+1 / 2N paths still engage on demandStuck STS, shut valves, control drift, latent standby faultsFailed transfer escalates a maintainable event to an outageForce one transfer under load; keep alternate path hotSemi-annual to annual per fault domain
Thermal ride-throughLiquid-loop hold time on CDU/pump loss at real heat fluxCoolant fouling, flow drift, control-loop retuning, density stepThermal throttle or trip cascades across racksSingle CDU/branch; cap affected racks; pre-stage coolingAnnual; mandatory after any density step-up
Cadence figures are practitioner ranges; the isolation column is the load-bearing decision. On a concurrently-maintainable campus all three are run per-block on a rolling schedule; on a fault-tolerant campus the standby path lets you re-test more aggressively.

Re-commissioning triggers, ranked by risk

Re-commissioning fires on triggers, and the triggers are not equal. Ranking them by risk is the core decision of the program, because the rank sets the scope of re-validation, the sign-off authority required, and the amount of live load you are willing to put at risk to get the proof. Three triggers dominate, in descending order of how much of the original commissioning evidence they invalidate.

1 — Density step-up (highest risk). A new accelerator generation that raises rack power is the single most invalidating event on the campus, because it changes the electrical and thermal machine simultaneously and at the rack scale where the original proof was most specific. The power chain that was commissioned for 132 kW racks now sees a different load profile, different transient (NVL72-class power swings and the BBU→BESS→GPU-capacitance smoothing stack), different heat flux, different CDU branch loading, different floor mass. None of the original IST evidence transfers. A density step-up demands the closest thing to a full re-IST that a live campus permits: re-validate the power chain transient behavior (→ Chapter 13.3 electrical acceptance), re-flush and re-accept the liquid loop for the new heat load (→ Chapter 13.5), re-run the thermal ride-through, re-baseline goodput. This is why Chapter 1.1 calls the density ramp the most expensive irreversible mistake of the era — and why the substrate (floor, water, electrical headroom) had to be provisioned for it at scoping time, because re-commissioning cannot create headroom that was never built.

2 — Topology change (high risk). Re-cabling a fault domain, splitting or merging blocks, re-piping a cooling distribution, changing a redundancy boundary, or repurposing a hall changes which failure is contained where — and containment is the whole point of the redundancy topology. The original commissioning proved a specific set of fault domains held; a topology change can silently merge two domains that were supposed to be independent, so a single fault now takes both. Topology changes demand a re-run of the failover tests across every boundary the change touched, plus a documentation re-baseline so the as-built single-lines match reality. The consequence of skipping it is a redundancy claim that is false on paper-true, paper-false — the diagram says independent, the copper says shared.

3 — Equipment replacement (managed risk). Swapping a UPS module, a CDU, a generator, a PDU, or a pump under RMA is the most frequent trigger and the most procedural. The replaced unit was never commissioned in this building; the manufacturer's factory test is not your site acceptance. Equipment replacement demands a unit-level re-acceptance (the L3/L4 script for that component) plus a re-test of the redundancy path it sits in, because the act of swapping it is itself a configuration change that can leave an isolation valve shut or a control setpoint at default. This is a partial re-test, signable at a lower authority than a density step — but only if the change-management discipline (→ Chapter 14.12) actually enforces the re-test rather than treating the swap as done when the unit powers on.

Re-commissioning trigger → required re-validation scope and authority
TriggerRisk rankEvidence invalidatedRequired re-validationSign-off authority
Density step-up (new generation)1 — highestPower-chain transient, thermal, fabric, goodput baseline — nearly allNear-full re-IST on the affected block: electrical transient + liquid re-accept + thermal ride-through + goodput re-baselineCx authority + ops leadership; design-basis change
Topology change (re-cable / re-pipe / re-boundary)2 — highFault-domain containment and redundancy claimsFailover re-test on every boundary touched + as-built single-line re-baselineCx authority + reliability owner
Equipment replacement (RMA / swap)3 — managedUnit-level acceptance + the redundancy path it sits inComponent L3/L4 re-acceptance + path failover re-testOperations change-board
Firmware / software push (fleet-wide)Cross-cuttingControl behavior, power/thermal response, fabric timingCanary block soak + behavioral diff vs baseline fingerprint before fleet rolloutOperations + fleet-software owner
The rank is by how much original commissioning evidence the trigger invalidates. Firmware/software pushes are included as the fast-moving fourth trigger; they are partial-scope but fleet-wide, which makes blast radius the dominant concern. Cross-reference the L1–L5 ladder in Chapter 13.1.
70–80%
share of serious data center outages with human error in changes/procedures as a root cause — i.e. configuration drift coming due
2025Uptime Institute Global Outage Analysis 2025
~45%
share of significant outages caused by on-site power problems — the systems re-commissioning re-proves
2025Uptime Institute Global Data Center Survey 2025
10–20%
average energy savings from retro-/re-commissioning of an existing facility (often more)
2023U.S. DOE FEMP / LBNL retro-commissioning studies
~7 days
MTBF per 512 H100 GPUs at a best-in-class operator — the failure cadence that keeps re-test triggers firing
2025SemiAnalysis (100k H100 clusters)
industry ~90% / best ~96%
training goodput — the baseline a density-step re-commissioning must re-establish
2025SemiAnalysis ClusterMAX / CoreWeave
99.982% / 99.995%
Tier III vs Tier IV availability (~1.6 hr vs ~26 min/yr) — sets whether you can re-test on live load at all
2025Uptime Institute (Tier classes, % disavowed)
20–40%
capital premium for Tier IV over Tier III — the redundancy a failover re-test confirms still engages
2025Uptime Institute
132 kW → ~600 kW
per-rack density step (NVL72 → Kyber-class) that fully invalidates prior commissioning evidence
2025–H2 2027 (announced)NVIDIA GB200 NVL72 / GTC Rubin Ultra Kyber

The drift-detection → re-validation loop

Triggering re-commissioning on the calendar alone is necessary but not sufficient — it catches the slow physical drift but misses the fast configuration drift that does the real damage between scheduled tests. The mature program closes a loop: the same DCIM, telemetry, and observability stack that runs the campus (→ Chapter 14.2) continuously compares the live configuration and live behavior against the commissioning fingerprint — the baseline captured at go-live (→ Chapter 13.2) — and fires a re-validation trigger when the gap crosses a threshold.

The loop has four stages. Baseline: the as-commissioned fingerprint — power-chain transient signatures, CDU flow/thermal curves, fabric BER and timing, redundancy-path states, firmware inventory. Observe: the live telemetry stream, including the firmware/software lifecycle state from fleet management (→ Chapter 14.8). Diff: the automated comparison that flags when the running configuration has drifted off the fingerprint — a setpoint that no longer matches the SOO, a firmware version that diverged from the approved baseline, a failover path that has not been exercised within its window, a thermal margin that has eroded. Trigger: the diff escalates into a re-test scope sized to what drifted — a firmware divergence triggers a canary soak; an eroded thermal margin triggers a ride-through re-test; an unexercised path triggers a failover re-test.

The decision this loop forces is where you set the threshold. Too tight and you re-commission constantly, putting live load at risk for noise and burning the operations team out. Too loose and you re-discover the gap as an outage. The right setting is workload-aware: a campus running checkpointable training (which already tolerates interruption) can run looser thresholds and accept more drift before re-testing; a campus running latency-bound inference under a hard SLA must run tighter, because for it the outage the drift would cause is unrecoverable revenue. → the goodput-vs-availability framing in Chapter 12.2.

Deep dive: the firmware push as a fleet-wide re-commissioning event in disguise

A fleet firmware update looks like a software operation and is treated like one — schedule a maintenance window, push the image, confirm version. But a firmware change to a GPU, a BMC, a PDU controller, a CDU controller, or a UPS module changes the behavior the campus was commissioned around: power-capping response, thermal throttle curves, transient draw under NVL72-class swings, fabric timing, alarm logic. A push that subtly alters how thousands of accelerators respond to a power excursion has, in effect, re-configured the load the electrical system was commissioned against — across the entire fleet at once. The original IST's evidence about transient stability no longer describes the running building.

This is why a disciplined fleet-software program treats every behavior-affecting firmware push as a partial re-commissioning event with a canary: roll to one isolated block, soak it, and diff its power/thermal/fabric behavior against the commissioning fingerprint before the fleet rollout proceeds. The blast radius is what makes this trigger uniquely dangerous — equipment replacement touches one unit, a density step touches one block, but a bad firmware push touches the whole campus simultaneously and removes the option of isolating the failure after the fact. The canary is the only place to catch it. The OCP GPU firmware update specification (Redfish, PLDM-over-MCTP, secure out-of-band) gives the mechanism; the canary-and-diff discipline gives the safety. → firmware lifecycle in Chapter 14.8; change control in Chapter 14.12.

Tie-in: DR drills and the FMEA catalog

Re-commissioning does not live alone — it is one of three recurring proof activities that share the same machinery and should be planned together. The other two are disaster-recovery drills and FMEA-driven failure-mode demonstration.

DR drills (→ Chapter 12.3) are re-commissioning at the campus-and-region scale. A failover drill that fails workload from one site to another, or exercises geographic redundancy, is testing the same kind of claim a black-building re-test does — "the redundant path engages on demand" — just at a larger fault domain. Co-scheduling them is efficient and honest: the DR drill that proves cross-site failover should be the same event that re-proves the local black-building handoff feeding it, because a DR failover that lands on a site whose own generator handoff has drifted is a drill that proves the wrong thing.

The FMEA catalog (Appendix F) is the master register that decides what gets demonstrated, how, and how often. Every failure mode in the catalog carries a re-commissioning treatment: live-demonstrable on a single block (re-test it on a schedule), sub-scale or next-empty-block demonstrable (prove it before that block goes live), or model-only (too dangerous to demonstrate on live load — prove it analytically and via the quantitative reliability model in Chapter 12.5). The catalog is also dual-use: each failure mode is treated as both random and attacker-induced, so the re-commissioning that proves you survive a CDU trip is the same evidence that proves you survive a CDU trip an adversary caused. The discipline that keeps this from becoming theater is the same one that governs the whole chapter — demonstrate what you safely can on live load, model what you cannot, and never let a checklist push you across that line on a running factory.

This chapter closes the lifecycle loop opened in Part 13: the original program in Chapter 13.1 (L1–L5 ladder), the fingerprint baseline in Chapter 13.2, electrical acceptance in Chapter 13.3, cooling acceptance and its load-realism limit in Chapter 13.5, and the L5 IST / dynamic-load gap in Chapter 13.6 are all the evidence this chapter re-proves. The drift it tracks comes from the DCIM/telemetry stack in Chapter 14.2 and the firmware lifecycle in Chapter 14.8; the human-error and change-control discipline that makes re-testing safe is in Chapter 14.12; the density step that triggers the highest-risk re-commissioning is scoped in Chapter 1.1 and engineered in Chapter 5.1. The DR drills it co-schedules with are in Chapter 12.3; the goodput-vs-availability tradeoff that sets its thresholds is in Chapter 12.2; and the quantitative model for the failure modes too dangerous to demonstrate live is in Chapter 12.5.