Guide › Day-2 Operations, Upgrades & Lifecycle › 14.14

Chapter 14.14

Continuous & Re-Commissioning on a Live Campus

Commissioning is not a one-time gate you pass at go-live; on a live AI campus that adds a new accelerator generation every 12–18 months, the proof that the building still does what its drawings claim decays continuously — so re-commissioning becomes a standing operational program, triggered by drift and density steps, executed against a running revenue factory you cannot turn off.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Which integrated tests (black-building, failover, thermal ride-through) you re-run on a live campus, and at what cadence — annual calendar, drift-triggered, or both — given that every re-test puts revenue-earning load at risk.
Which re-commissioning triggers you treat as mandatory full re-validation (density step-up, topology change) versus partial re-test (equipment swap, firmware push), and who has authority to sign the difference.
How you isolate the test scope so a re-commissioning event stresses one block or fault domain at a time, never the whole factory — and whether your topology even permits that isolation.
How the drift-detection loop from your DCIM and telemetry (Chapters 14.2 / 14.8) closes into a re-validation trigger automatically, rather than waiting for an outage to reveal the gap.
Which Appendix F failure modes you actively demonstrate on the live campus on a recurring schedule, versus the ones you only model — and the cost of being wrong about that line.

The commissioning program in Part 13 ended at handover: the L5 integrated systems test passed, the black-building test demonstrated, the proxy training run hit its goodput gate, and operations took the keys. That is the moment the building was proven. It is also the moment the proof started to expire. An AI campus is not a static asset that drifts slowly like a legacy enterprise hall — it is a continuously mutating machine. Racks get denser, CDUs get re-piped, firmware gets pushed fleet-wide, a UPS module gets RMA'd, a fault domain gets re-cabled, a setpoint gets nudged to chase PUE. Each of those changes invalidates some slice of the commissioning evidence, and none of them announce that they have done so. Re-commissioning is the discipline of re-proving the building against its own design intent, on a cadence and on triggers, while it is full of paying load.

Re-commissioning runs under a constraint the original Cx team never faced: you cannot take the factory down to test it. The original IST happened in an empty building with load banks and a sacrificial proxy run; the re-test happens on a live campus where the load banks are real GPUs running real jobs worth real money per hour. That single difference reorders everything — what you test, how you isolate it, when you accept the risk, and what you simply decline to demonstrate because the demonstration itself is more dangerous than the failure mode it would reveal.

Why the proof decays: the half-life of a commissioning result

A commissioning result is a statement of the form "under these conditions, this system behaves this way." It stays true exactly as long as the conditions hold. On a frontier campus the conditions never hold for long. Three forces erode the proof, and they erode it at very different rates.

Physical drift is the slow one. Coolant chemistry degrades and biofouls the cold plates; quick-disconnects seep; UPS and BESS cells age and lose ride-through margin; generator fuel polymerizes; CRAH coils foul; breaker contacts and bus connections develop resistance. None of this is visible on a single-line diagram, and all of it moves the real failover and ride-through behavior away from the commissioned number. This is the regime that classic retro-commissioning was invented for, and the DOE/LBNL evidence base is unambiguous that the savings and reliability left on the table are large — but that program was built for buildings that barely change, not for ones that double in density every other year.

Configuration drift is the fast one and the dangerous one. Every change ticket — a firmware bump, a setpoint edit, a re-cabled rail, a swapped PDU, a BMS logic tweak to silence a nuisance alarm — moves the system off the configuration that was commissioned. Individually each is trivial. In aggregate, after eighteen months of a live campus, the running configuration and the as-commissioned configuration have diverged so far that the original test evidence describes a building that no longer exists. The Uptime Institute's outage analysis keeps finding the same thing: the dominant root cause of serious outages is human error in changes and procedures, not equipment failure — which is precisely configuration drift coming due.

Step changes are the violent ones. A density step-up from one accelerator generation to the next does not drift the building off its proof — it invalidates the proof wholesale. A hall commissioned for 40 kW air racks that now hosts 132 kW liquid-cooled NVL72, or a 132 kW hall absorbing a ~600 kW Kyber-class generation, is a different thermal and electrical machine. The original IST said nothing about how this building behaves; nobody has ever black-building-tested this configuration. → Chapter 5.1 on the density wall the step crosses.

Periodic re-testing on a live campus

Three integrated tests defined the original IST, and all three have to be re-run periodically because they prove things that physically degrade: the black-building (pull-the-plug) test, the redundancy failover test, and the thermal ride-through test. The hard part is not the test — it is running it without taking the factory down. The discipline is fault-domain isolation: you never re-test the whole campus at once; you re-test one block, one fault domain, or one concurrent-maintainability boundary at a time, while the rest of the campus carries load and stands ready to absorb the block under test if it fails the test for real.

Black-building re-test. The original test cut utility to an empty building and watched the UPS/BESS-to-generator handoff carry imaginary load. The live re-test cuts utility to one block carrying real GPU load, and the stakes invert: if the BBU/BESS ride-through or the generator start now falls short — because cells aged, because fuel degraded, because a setpoint drifted — you do not lose a test, you drop a live training run and breach an SLA. This is why on a concurrently-maintainable (Tier III–class) campus the live black-building re-test is run per-block on a rolling schedule, and why on a fault-tolerant (Tier IV–class) campus it can be run more aggressively because a second autonomous path is always live. The topology you commissioned at go-live is the topology that decides whether you can afford to re-test at all.

Failover re-test. Redundant paths that are never exercised silently fail to redundant — the standby CDU whose isolation valve was left shut after maintenance, the N+1 generator that will not parallel because a control card drifted, the 2N feed whose static-transfer switch sticks. The only way to know a redundancy claim is still true is to force the transfer under load and watch it hold. The consequence of not re-testing is the worst kind of outage: the one where the redundancy you paid a 20–40% capital premium for does not engage when the primary actually fails, turning a maintainable event into a full block drop.

Thermal ride-through re-test. This is the AI-specific one, and the one the original commissioning could never fully prove. Facility load banks reject heat to air; they do not reject it into cold plates and CDUs at realistic transient heat flux. So the question "how long does the liquid loop hold temperature when a CDU trips or a pump fails, with real GPUs at full draw?" was only ever partially answered at go-live (→ Chapter 13.5 on the load-realism limit; Chapter 13.6 on the dynamic-load gap). On a live campus you finally have the real heat source — and you also have the real revenue at risk. Thermal ride-through is the re-test most worth running on live load precisely because it is the one load banks lied about, and the one a density step-up changes most.

Live-campus re-test: what you prove, what it risks, how you isolate it

Re-test	What it re-proves	What degrades it	Live-campus risk	Isolation strategy	Typical cadence
Black-building (pull-the-plug)	Utility-loss ride-through and generator handoff under real load	BBU/BESS cell aging; fuel degradation; setpoint drift	Dropped run + SLA breach if handoff falls short	One block at a time; rest of campus absorbs the block	Annual per block; sooner after BESS/genset work
Redundancy failover	N+1 / 2N paths still engage on demand	Stuck STS, shut valves, control drift, latent standby faults	Failed transfer escalates a maintainable event to an outage	Force one transfer under load; keep alternate path hot	Semi-annual to annual per fault domain
Thermal ride-through	Liquid-loop hold time on CDU/pump loss at real heat flux	Coolant fouling, flow drift, control-loop retuning, density step	Thermal throttle or trip cascades across racks	Single CDU/branch; cap affected racks; pre-stage cooling	Annual; mandatory after any density step-up

Cadence figures are practitioner ranges; the isolation column is the load-bearing decision. On a concurrently-maintainable campus all three are run per-block on a rolling schedule; on a fault-tolerant campus the standby path lets you re-test more aggressively.

The test that is more dangerous than the failure it demonstrates

Not every commissioning test should be re-run on a live campus. A full concurrent-fault demonstration — pulling utility and failing a generator and losing a CDU simultaneously, the kind of cascading scenario the original L5 IST might have included — can be more destructive on live load than the compound failure it is meant to prove you survive. The discipline is to demonstrate single faults on live load and reserve compound/cascading faults for modeling, sub-scale rigs, or the next empty block before it goes live. The FMEA catalog in Appendix F is the register that records this line: each failure mode is tagged as live-demonstrable, sub-scale-demonstrable, or model-only. Crossing that line because a checklist says "demonstrate" is how a re-commissioning event becomes the incident it was supposed to prevent. → human-error control in Chapter 14.12.

Re-commissioning triggers, ranked by risk

Re-commissioning fires on triggers, and the triggers are not equal. Ranking them by risk is the core decision of the program, because the rank sets the scope of re-validation, the sign-off authority required, and the amount of live load you are willing to put at risk to get the proof. Three triggers dominate, in descending order of how much of the original commissioning evidence they invalidate.

1 — Density step-up (highest risk). A new accelerator generation that raises rack power is the single most invalidating event on the campus, because it changes the electrical and thermal machine simultaneously and at the rack scale where the original proof was most specific. The power chain that was commissioned for 132 kW racks now sees a different load profile, different transient (NVL72-class power swings and the BBU→BESS→GPU-capacitance smoothing stack), different heat flux, different CDU branch loading, different floor mass. None of the original IST evidence transfers. A density step-up demands the closest thing to a full re-IST that a live campus permits: re-validate the power chain transient behavior (→ Chapter 13.3 electrical acceptance), re-flush and re-accept the liquid loop for the new heat load (→ Chapter 13.5), re-run the thermal ride-through, re-baseline goodput. This is why Chapter 1.1 calls the density ramp the most expensive irreversible mistake of the era — and why the substrate (floor, water, electrical headroom) had to be provisioned for it at scoping time, because re-commissioning cannot create headroom that was never built.

2 — Topology change (high risk). Re-cabling a fault domain, splitting or merging blocks, re-piping a cooling distribution, changing a redundancy boundary, or repurposing a hall changes which failure is contained where — and containment is the whole point of the redundancy topology. The original commissioning proved a specific set of fault domains held; a topology change can silently merge two domains that were supposed to be independent, so a single fault now takes both. Topology changes demand a re-run of the failover tests across every boundary the change touched, plus a documentation re-baseline so the as-built single-lines match reality. The consequence of skipping it is a redundancy claim that is false on paper-true, paper-false — the diagram says independent, the copper says shared.

3 — Equipment replacement (managed risk). Swapping a UPS module, a CDU, a generator, a PDU, or a pump under RMA is the most frequent trigger and the most procedural. The replaced unit was never commissioned in this building; the manufacturer's factory test is not your site acceptance. Equipment replacement demands a unit-level re-acceptance (the L3/L4 script for that component) plus a re-test of the redundancy path it sits in, because the act of swapping it is itself a configuration change that can leave an isolation valve shut or a control setpoint at default. This is a partial re-test, signable at a lower authority than a density step — but only if the change-management discipline (→ Chapter 14.12) actually enforces the re-test rather than treating the swap as done when the unit powers on.

Re-commissioning trigger → required re-validation scope and authority

Trigger	Risk rank	Evidence invalidated	Required re-validation	Sign-off authority
Density step-up (new generation)	1 — highest	Power-chain transient, thermal, fabric, goodput baseline — nearly all	Near-full re-IST on the affected block: electrical transient + liquid re-accept + thermal ride-through + goodput re-baseline	Cx authority + ops leadership; design-basis change
Topology change (re-cable / re-pipe / re-boundary)	2 — high	Fault-domain containment and redundancy claims	Failover re-test on every boundary touched + as-built single-line re-baseline	Cx authority + reliability owner
Equipment replacement (RMA / swap)	3 — managed	Unit-level acceptance + the redundancy path it sits in	Component L3/L4 re-acceptance + path failover re-test	Operations change-board
Firmware / software push (fleet-wide)	Cross-cutting	Control behavior, power/thermal response, fabric timing	Canary block soak + behavioral diff vs baseline fingerprint before fleet rollout	Operations + fleet-software owner

The rank is by how much original commissioning evidence the trigger invalidates. Firmware/software pushes are included as the fast-moving fourth trigger; they are partial-scope but fleet-wide, which makes blast radius the dominant concern. Cross-reference the L1–L5 ladder in Chapter 13.1.

70–80%

share of serious data center outages with human error in changes/procedures as a root cause — i.e. configuration drift coming due

2025Uptime Institute Global Outage Analysis 2025

~45%

share of significant outages caused by on-site power problems — the systems re-commissioning re-proves

2025Uptime Institute Global Data Center Survey 2025

10–20%

average energy savings from retro-/re-commissioning of an existing facility (often more)

2023U.S. DOE FEMP / LBNL retro-commissioning studies

~7 days

MTBF per 512 H100 GPUs at a best-in-class operator — the failure cadence that keeps re-test triggers firing

2025SemiAnalysis (100k H100 clusters)

industry ~90% / best ~96%

training goodput — the baseline a density-step re-commissioning must re-establish

2025SemiAnalysis ClusterMAX / CoreWeave

99.982% / 99.995%

Tier III vs Tier IV availability (~1.6 hr vs ~26 min/yr) — sets whether you can re-test on live load at all

2025Uptime Institute (Tier classes, % disavowed)

20–40%

capital premium for Tier IV over Tier III — the redundancy a failover re-test confirms still engages

2025Uptime Institute

132 kW → ~600 kW

per-rack density step (NVL72 → Kyber-class) that fully invalidates prior commissioning evidence

2025–H2 2027 (announced)NVIDIA GB200 NVL72 / GTC Rubin Ultra Kyber

The drift-detection → re-validation loop

Triggering re-commissioning on the calendar alone is necessary but not sufficient — it catches the slow physical drift but misses the fast configuration drift that does the real damage between scheduled tests. The mature program closes a loop: the same DCIM, telemetry, and observability stack that runs the campus (→ Chapter 14.2) continuously compares the live configuration and live behavior against the commissioning fingerprint — the baseline captured at go-live (→ Chapter 13.2) — and fires a re-validation trigger when the gap crosses a threshold.

The loop has four stages. Baseline: the as-commissioned fingerprint — power-chain transient signatures, CDU flow/thermal curves, fabric BER and timing, redundancy-path states, firmware inventory. Observe: the live telemetry stream, including the firmware/software lifecycle state from fleet management (→ Chapter 14.8). Diff: the automated comparison that flags when the running configuration has drifted off the fingerprint — a setpoint that no longer matches the SOO, a firmware version that diverged from the approved baseline, a failover path that has not been exercised within its window, a thermal margin that has eroded. Trigger: the diff escalates into a re-test scope sized to what drifted — a firmware divergence triggers a canary soak; an eroded thermal margin triggers a ride-through re-test; an unexercised path triggers a failover re-test.

The decision this loop forces is where you set the threshold. Too tight and you re-commission constantly, putting live load at risk for noise and burning the operations team out. Too loose and you re-discover the gap as an outage. The right setting is workload-aware: a campus running checkpointable training (which already tolerates interruption) can run looser thresholds and accept more drift before re-testing; a campus running latency-bound inference under a hard SLA must run tighter, because for it the outage the drift would cause is unrecoverable revenue. → the goodput-vs-availability framing in Chapter 12.2.

Deep dive: the firmware push as a fleet-wide re-commissioning event in disguise

A fleet firmware update looks like a software operation and is treated like one — schedule a maintenance window, push the image, confirm version. But a firmware change to a GPU, a BMC, a PDU controller, a CDU controller, or a UPS module changes the behavior the campus was commissioned around: power-capping response, thermal throttle curves, transient draw under NVL72-class swings, fabric timing, alarm logic. A push that subtly alters how thousands of accelerators respond to a power excursion has, in effect, re-configured the load the electrical system was commissioned against — across the entire fleet at once. The original IST's evidence about transient stability no longer describes the running building.

This is why a disciplined fleet-software program treats every behavior-affecting firmware push as a partial re-commissioning event with a canary: roll to one isolated block, soak it, and diff its power/thermal/fabric behavior against the commissioning fingerprint before the fleet rollout proceeds. The blast radius is what makes this trigger uniquely dangerous — equipment replacement touches one unit, a density step touches one block, but a bad firmware push touches the whole campus simultaneously and removes the option of isolating the failure after the fact. The canary is the only place to catch it. The OCP GPU firmware update specification (Redfish, PLDM-over-MCTP, secure out-of-band) gives the mechanism; the canary-and-diff discipline gives the safety. → firmware lifecycle in Chapter 14.8; change control in Chapter 14.12.

Tie-in: DR drills and the FMEA catalog

Re-commissioning does not live alone — it is one of three recurring proof activities that share the same machinery and should be planned together. The other two are disaster-recovery drills and FMEA-driven failure-mode demonstration.

DR drills (→ Chapter 12.3) are re-commissioning at the campus-and-region scale. A failover drill that fails workload from one site to another, or exercises geographic redundancy, is testing the same kind of claim a black-building re-test does — "the redundant path engages on demand" — just at a larger fault domain. Co-scheduling them is efficient and honest: the DR drill that proves cross-site failover should be the same event that re-proves the local black-building handoff feeding it, because a DR failover that lands on a site whose own generator handoff has drifted is a drill that proves the wrong thing.

The FMEA catalog (Appendix F) is the master register that decides what gets demonstrated, how, and how often. Every failure mode in the catalog carries a re-commissioning treatment: live-demonstrable on a single block (re-test it on a schedule), sub-scale or next-empty-block demonstrable (prove it before that block goes live), or model-only (too dangerous to demonstrate on live load — prove it analytically and via the quantitative reliability model in Chapter 12.5). The catalog is also dual-use: each failure mode is treated as both random and attacker-induced, so the re-commissioning that proves you survive a CDU trip is the same evidence that proves you survive a CDU trip an adversary caused. The discipline that keeps this from becoming theater is the same one that governs the whole chapter — demonstrate what you safely can on live load, model what you cannot, and never let a checklist push you across that line on a running factory.

The standing re-commissioning program: three decisions to lock

Operationalizing this chapter is three commitments made once and held. First, cadence vs trigger: run a calendar re-test schedule (per-block, rolling) and a drift-triggered one; the calendar catches physical decay, the trigger catches configuration drift. Second, the demonstration line: tag every Appendix F failure mode as live-demonstrable, sub-scale, or model-only, and bind your sign-off authority to that tag so no one re-runs a compound-fault test on live load to satisfy a checklist. Third, the density-step gate: make the next density step-up a mandatory near-full re-IST of the affected block, planned and funded as part of the refresh — not a surprise discovered when the new racks throttle. Lock these three and re-commissioning is a program; leave them implicit and it is whatever the last incident scared you into doing.

This chapter closes the lifecycle loop opened in Part 13: the original program in Chapter 13.1 (L1–L5 ladder), the fingerprint baseline in Chapter 13.2, electrical acceptance in Chapter 13.3, cooling acceptance and its load-realism limit in Chapter 13.5, and the L5 IST / dynamic-load gap in Chapter 13.6 are all the evidence this chapter re-proves. The drift it tracks comes from the DCIM/telemetry stack in Chapter 14.2 and the firmware lifecycle in Chapter 14.8; the human-error and change-control discipline that makes re-testing safe is in Chapter 14.12; the density step that triggers the highest-risk re-commissioning is scoped in Chapter 1.1 and engineered in Chapter 5.1. The DR drills it co-schedules with are in Chapter 12.3; the goodput-vs-availability tradeoff that sets its thresholds is in Chapter 12.2; and the quantitative model for the failure modes too dangerous to demonstrate live is in Chapter 12.5.