Chapter 13.6

Level 5 Integrated Systems Testing (IST) & Failure-Mode Demonstration

Integrated Systems Testing is the last and only chance to fail the building on your own terms — but a load bank can prove the power and cooling chains survive a fault while completely failing to reproduce the millisecond electrical dynamics and worst-case cold-plate heat flux that a real GPU cluster imposes, so IST acceptance must be written as an explicit bridge from what the load bank can demonstrate to what only the first real workload can.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

What the IST master sequence proves and in what order — the black-building/pull-the-plug test, the cascading and concurrent-fault matrix, and the thermal ride-through window — and which of these are pass/fail gates versus instrumented observations.
Which load-bank technology you commission against — resistive, reactive, or AI-emulating dynamic load banks — and, deliberately, which failure modes that choice leaves un-demonstrated until the proxy training run.
How far down the mitigation stack (BBU → facility BESS → in-rack capacitance → workload power-smoothing) you require IST to exercise, given that a static load bank cannot trigger the very transients that stack exists to absorb.
Which Appendix-F failure modes are demonstrated live on the real building versus signed off by analysis or vendor witness test — and who carries the residual risk of the ones you choose not to trip.
The acceptance criteria that bridge load-bank IST to first-real-workload: the conditions under which you energize GPUs, the instrumented thresholds that gate the ramp, and the deficiencies you are knowingly carrying into the proxy run.

Every level of commissioning before this one tested a subsystem in isolation against its own design intent: the switchgear in Chapter 13.3, the generators and microgrid in Chapter 13.4, the cooling plant and CDUs in Chapter 13.5. Level 5 Integrated Systems Testing (IST) is the first and last time the whole building is run as one machine and then deliberately broken. Its job is not to confirm that things work — that was L1 through L4 — but to confirm that when something fails, the response works: that the automatic transfer fires, the UPS rides through, the cooling does not lapse while power transfers, the BMS and DCIM see the event correctly, and the load survives. IST is the rehearsal of every bad day the facility will ever have, performed once, on purpose, before any revenue-bearing GPU is at risk.

And it is built on a lie of convenience. IST is run against load banks — heaters that consume the design power so the building has something to cool and power while you trip its redundancy. Load banks are how you can pull the plug on 100 MW without owning 100 MW of irreplaceable accelerators. But a load bank is a fundamentally different electrical and thermal object than a GPU cluster. It draws a smooth, steady, controllable load; a GPU fleet draws a violent, synchronized, millisecond-scale sawtooth. This chapter is the canonical treatment of that gap. After the IST master sequence and the failure-mode demonstration, it sets out what each load-bank technology can and cannot reproduce, across both the electrical and the thermal dimensions, and why the proxy training run of Chapter 13.9 is the only true emulator the building will ever see.

IST scope and the master sequence

IST does not begin until every feeding subsystem has its own L4 acceptance signed and its baseline fingerprint captured (Chapter 13.2). The reason is diagnostic, not bureaucratic: if a cooling pump fails during a black-building test, you must already know it passed its standalone test, or you cannot tell whether IST found an integration fault or merely an un-commissioned component. IST tests seams, not parts. Sequencing it after the parts pass is what makes a failure during IST interpretable.

The master sequence is a planned escalation, not a single event. It typically runs steady-state proving first (the building holds design load at design conditions for a sustained soak), then single-fault scenarios (lose one utility feed, one generator, one UPS module, one CDU), then the marquee test — the black-building / pull-the-plug test, a hard, unannounced loss of all utility power with the facility at full load — then the concurrent and cascading-fault matrix that proves the redundancy topology actually delivers the tier it was sold as. Planning a full-facility IST takes weeks to months; the execution window for a hyperscale hall is commonly a continuous multi-day campaign so that thermal soak and battery-recharge behavior are observed across real time, not inferred. The deliverable is a witnessed, time-synchronized data record of how the integrated building behaved at every transition, not a pass/fail stamp.

The black-building test: the one transition you cannot model your way out of

The pull-the-plug test is the IST's reason for existing. At full load you open the utility breaker with no warning and watch a chain of automatic responses that must all fire correctly within their windows: the UPS/BESS picks up the IT and critical-cooling load with zero dropped cycles; the generators start, reach voltage and frequency, and the ATS transfers the mechanical plant before the UPS depletes or the chilled-water buffer runs dry; the CDUs and pumps re-energize before coolant temperature drifts past the throttle threshold; and the BMS narrates the entire sequence correctly. The decision IST forces is whether you trust this chain enough to run it live at full load. Operators who skip or de-rate the black-building test — testing at partial load, or with the IT load simulated only on paper — are deferring the failure to the first real utility event, when the load is GPUs and the audience is customers. The downstream cost of a deferred pull-the-plug test is a real outage discovered during a real fault: the single most expensive way to learn that an ATS was mis-coordinated.

Cascading and concurrent faults; thermal ride-through

A redundancy claim is a claim about concurrent failures, and IST is where it is cashed. A 2N electrical topology is not proven by losing one feed — that is the easy case. It is proven by losing one feed and then, while the building is running on the survivor, inducing a fault on the path that should still be carrying load, to confirm there is no hidden single point of failure where two nominally independent systems share a breaker, a controller, a cable tray, or a cooling loop. The cascading-fault matrix is built directly from the facility FMEA (the consolidated catalog lives in Appendix F): each high-severity failure mode becomes an IST scenario, executed in the order most likely to surface a shared dependency.

For AI factories, the dominant integration risk is no longer electrical — it is thermal ride-through, the seam that did not meaningfully exist in legacy IT. In an air-cooled hall, a brief cooling lapse during power transfer is absorbed by the thermal mass of the room and the air; you have minutes. In a direct-to-chip liquid hall, the thermal mass at the die is almost nothing. A GB200-class rack must hold coolant inlet near 20-25 °C; deviation throttles the GPUs by up to ~50%, and a sustained loss of flow risks the cold plates within seconds, not minutes. So the IST question becomes brutally specific: during the worst-case power transition, does coolant flow and temperature stay inside the envelope, or does the cluster throttle or trip? Critical cooling — CDU pumps, primary loop pumps, the heat-rejection path — must be on the protected (UPS/generator-backed) bus and must re-energize fast enough to bridge the gap. IST is where you measure that gap against a stopwatch, with a load bank standing in for the heat the GPUs would have produced. This is the explicit interlock between facility Cx and cluster burn-in flagged in Chapter 13.5.

BMS / DCIM / SCADA integration as a first-class test object

The control and monitoring stack is one of the things under test, not a witness to IST. Three layers must be proven to agree: the BMS (mechanical/electrical building automation), the SCADA / power-management system (switchgear, generators, the microgrid controller of Chapter 13.4), and the DCIM that operations will actually watch on day 2 (Chapter 14.2). The classic IST finding is not a hardware failure at all — it is that the building did the right thing while the DCIM displayed the wrong thing, or raised forty alarms for one event, or missed the event entirely because a Modbus/BACnet mapping was transposed during integration. An alarm flood is itself a failure: an operator who cannot find the root-cause alarm under a cascade of consequential ones will mis-diagnose the next real incident. IST validates the alarm hierarchy, the automatic control sequences (not just manual operation), and the point-to-point mapping from physical sensor to operator screen — the data path that the entire day-2 reliability program (Chapter 14.1) is built on top of.

The dynamic-load realism gap (the canonical treatment)

Here is the central limitation of every IST ever run, stated plainly: the load you test against is not the load you will run. A load bank exists to consume power and reject heat on command. A GPU cluster running a synchronous training step does something a load bank was never built to do — it swings its entire draw, in lockstep, across thousands of accelerators, on the cadence of the collective-communication pattern. When the all-reduce stalls compute, tens of megawatts can fall in milliseconds; when compute resumes, it returns just as fast. This is the transient physics made canonical in Chapter 4.5, and it is the reason the GB300 NVL72 ships with ~65 J/GPU of in-shelf energy storage to smooth a ~30% peak-grid reduction, and why facility BESS designs add their own power-smoothing role. The fork at IST is: which load-bank technology do you commission against, and therefore which of these dynamics do you leave un-demonstrated?

Load-bank technology vs. what it can and cannot reproduce

Load-bank type	What it emulates	Electrical realism	Thermal realism	What it CANNOT demonstrate
Resistive (air-rejecting)	Real power (kW) at unity power factor; steady-state heat into air	Magnitude only; smooth, no di/dt, no power-factor stress	Heats the room/air, not cold plates; no secondary-loop heat flux	Any transient; reactive/harmonic behavior; the entire liquid loop and CDU thermal-hydraulics
Reactive (R + L/C)	Real + reactive power; lagging/leading PF; some inrush/ripple	Adds PF and ripple stress on UPS/gen/AVR; still not GPU di/dt	Same as resistive — rejects to air	The synchronized collective sawtooth; cold-plate/CDU dynamics; worst-case-branch thermal-hydraulics
Dynamic / AI-emulating (transistor or DC programmable)	Programmed step-load and ramp profiles approximating workload swings	Best available proxy for di/dt and step loads; still a scripted approximation	Most reject to air; coolant-loop dynamic loads emerging but rare and partial	The exact, cross-rack-synchronized power pattern of a real model on a real fabric; true cold-plate transient heat flux
Real GPUs (proxy training run)	The actual workload on the actual fabric	Complete — by definition the real electrical dynamics	Complete — real heat flux into cold plates, real CDU and worst-case-branch behavior	Nothing — but it puts irreplaceable, supply-constrained hardware at risk to do so

The decision fork at the heart of IST. 'Electrical realism' = ability to reproduce real-power magnitude and the millisecond di/dt swings of synchronized GPU collectives. 'Thermal realism' = ability to impose realistic worst-case heat flux into cold plates and the CDU/secondary loop. Capabilities are 2026-current practitioner ranges; see keynumbers for sources.

The table is a ladder of fidelity bought at rising cost and risk. Resistive load banks are cheap, ubiquitous, and prove the steady-state power and cooling capacity — they are the right tool for the bulk of L4 and the soak portion of IST. Reactive banks add power-factor and ripple stress that resistive banks miss, exercising the UPS, the generator AVR, and protection devices closer to real conditions. Dynamic / AI-emulating banks — transistor-based or, for 800 VDC architectures, programmable DC loads — are the newest tier and the only load-bank class that even attempts the millisecond step-load that defines AI draw; they let you fire a scripted swing at the BBU and BESS mitigation stack and watch it absorb. But every load bank shares one fatal limitation for liquid-cooled facilities: almost all of them reject heat to air, not into cold plates. The liquid loop, the CDU control response, and the worst-case-branch thermal-hydraulics — the things Chapter 13.5 flagged as un-testable without real silicon — stay un-exercised at realistic transient heat flux no matter how good your electrical emulation is. You can perfect the electrical side of the lie and the thermal side remains unproven.

The two-dimensional gap: do not let electrical realism flatter you into skipping thermal proof

The realism gap has two axes and they fail independently. A dynamic load bank can give you genuinely good electrical realism — real di/dt against the power chain — while giving you essentially zero thermal realism, because it dumps its heat into a fan, not a cold plate. The trap is to see a successful dynamic-load IST, conclude the building is proven, and energize the cluster — only to discover that the CDU's PID loop overshoots on the first real synchronized heat pulse, or that the worst-case branch starves under a flow pattern no air-rejecting bank ever produced. Electrical IST success says nothing about secondary-loop thermal-hydraulic stability under real transient heat flux. The only instrument that closes the thermal axis is real GPUs producing real heat into the real loop — which is why the proxy training run is not an optional benchmark but the terminal commissioning gate.

The mitigation stack under realistic dynamics

Modern AI facilities defend against GPU transients with a layered stack, and IST is where you decide how much of it you exercise on the real building versus accept on vendor witness test. The layers, from grid inward: the facility-level BESS (and any synchronous condenser) absorbing multi-megawatt swings and supporting ride-through (Chapter 4.5, Chapter 13.4); the BBU / UPS layer bridging power transfers; the in-rack / in-shelf capacitance (GB300's ~65 J/GPU, Vera Rubin's larger reservoir) catching the fastest edges; and the workload-side power smoothing — ramp-rate limiting and power capping via SMI/Redfish — that shaves the peak before it reaches the wire. The problem is recursive: a static load bank cannot create the transient the stack exists to absorb, so an IST run on resistive or reactive banks proves the stack is installed and healthy but never proves it does its job. Only a dynamic load bank (partially) or a real workload (fully) closes that loop. The decision is explicit risk allocation: which mitigation layers you require to be demonstrated absorbing a real swing during IST, and which you sign off by analysis and recharge-test, knowing the first true exercise comes during the proxy run.

65 J/GPU

GB300 NVL72 in-shelf energy storage for power smoothing; ~30% peak-grid reduction on Megatron training

2025NVIDIA Developer (GB300 steady power)

~400 J/GPU

Vera Rubin power-smoothing reservoir target; facility BESS roles for transient/ride-through/DR

2025NVIDIA (production-ready BESS for AI factories)

~1,500 MW

single-event large-load loss on a 230 kV fault; 1.5 GW dropped in 82 s (VA, 2024) — the ride-through problem IST must prove against

2026NERC Level 3 Alert / Utility Dive

20-25 °C

GB200/GB300 NVL72 coolant inlet window; deviation throttles GPUs up to ~50% — the thermal ride-through envelope

2025NVIDIA OCP / Introl

3% vs 21%

power-oversubscription headroom training vs inference — why transient behavior differs by workload IST cannot run

2025Uptime Institute Journal

~55%

single-phase direct-to-chip share of liquid-cooling market — the loop IST load banks cannot exercise at real heat flux

2026DCD / IDTechEx

weeks-to-months

typical IST planning horizon before a full-facility Level 5 campaign

2025Construct & Commission (L5 IST guide)

Deep dive: why the proxy training run is the only true emulator

Everything upstream of the proxy run is an approximation chosen for safety and cost. The proxy run — a real, scaled model training on the real fabric, treated in Chapter 13.9 — is the only event that simultaneously closes both axes of the realism gap, and it does so because it is the workload, not a stand-in for it. Electrically, it produces the exact cross-rack-synchronized sawtooth that the collective-communication pattern dictates, including the di/dt edges no scripted load profile fully captures, because the timing is set by NCCL and the network, not by a test engineer. Thermally, it dumps real heat into real cold plates, driving the CDU control loop, the secondary-loop pumps, and the worst-case branch through the transient regime that air-rejecting load banks structurally cannot reach.

The consequence for sequencing: IST and the proxy run are not redundant, and you cannot substitute one for the other. IST proves the building survives faults at full magnitude with a safe, smooth load — you can pull the plug because the load is heaters, not GPUs. The proxy run proves the building survives the workload's own dynamics with no induced fault. A facility that passed IST brilliantly can still fail its first proxy run because the CDU was never asked to track a real heat pulse, or because the power-smoothing config was tuned against a load profile that did not match the real model. The correct posture is to treat the proxy run as the final commissioning gate, not a day-2 benchmark — and to accept GPUs into the building only under the staged, instrumented ramp of Chapter 13.10 so that the first real dynamics arrive against a known-good electrical and thermal baseline.

Failure-mode demonstration: live trip vs. witnessed vs. analyzed

Not every failure mode in Appendix F can or should be physically tripped on the real building. Some are too destructive (you do not deliberately rupture a coolant line on a live hall to watch the leak-detection-and-isolation sequence), some are impractical (you cannot induce a real utility-side 230 kV fault on demand), and some are covered acceptably by a witnessed factory or vendor test plus an installed-and-healthy check. IST forces an explicit triage of the FMEA catalog into three buckets, and the allocation is a risk decision the owner signs, not the Cx agent.

FMEA demonstration triage at IST

Demonstration method	Typical failure modes	What it proves	Residual risk carried
Live trip during IST	Loss of utility (black-building); loss of one feed/UPS/generator/CDU; ATS transfer; concurrent-fault matrix	The integrated response fires correctly at full magnitude, in real time	Low — the actual transition was observed
Live trip with safe load (heaters)	Thermal ride-through; cooling-failover; critical-bus continuity; alarm-hierarchy under cascade	Power/cooling chains survive the fault; controls narrate it	Medium — survives the fault, but under smooth load, not GPU dynamics
Witnessed factory / vendor test	BESS cell-level behavior; breaker interruption ratings; generator load-acceptance curves	Component meets spec under controlled conditions	Medium - integration seam still unproven on site
Analysis + installed-and-healthy check	Coolant-line rupture cascade; real utility fault waveform; multi-MW grid-side ride-through	Design intent is sound; protection is present and configured	Higher - first real exercise is the live event (or proxy run)

How each Appendix-F failure mode is proven. The owner accepts residual risk on everything not in the 'live trip' column. Allocation is project-specific; this is the decision framework, not a prescription.

The table's second column is the uncomfortable one. Thermal ride-through, cooling failover, and the alarm hierarchy can be tripped live — but only against load banks, so they land in the 'medium residual risk' bucket no matter how rigorously you run them. That is the structural consequence of the realism gap: the failure modes most specific to AI factories are precisely the ones a load bank can demonstrate against a fault but not against the workload. This is why the acceptance criteria for IST cannot be a clean binary. They must read as a bridge.

Acceptance criteria: bridging load-bank IST to first-real-workload

A defensible IST acceptance package does three things the legacy IT version never had to. First, it states what was proven and against what load — every passed scenario annotated with the load-bank class used, so the residual-risk register is honest about which results carry the realism caveat. Second, it defines the instrumented thresholds that gate GPU energization: coolant temperature and flow stability under the worst observed transition, critical-bus continuity with zero dropped cycles, CDU re-energization time inside the throttle window, and a clean, de-duplicated alarm record. Third, it names the deficiencies knowingly carried into the proxy run — the thermal-hydraulic dynamics and workload-synchronized electrical behavior that no load bank reached — so the proxy run is explicitly chartered to close them rather than being treated as a victory lap.

Gate to energize GPUs: black-building and concurrent-fault matrix passed at full load; thermal ride-through measured inside the coolant envelope; critical cooling confirmed on protected bus with re-energization time bounded; BMS/DCIM/SCADA agreement and a clean alarm hierarchy demonstrated.
Carried to the proxy run (Chapter 13.9): CDU and secondary-loop control response to real transient heat flux; the cross-rack-synchronized power swing against the full mitigation stack; worst-case-branch thermal-hydraulics under realistic, not air-rejected, load.
Carried to staged ramp (Chapter 13.10): validation of power-smoothing and ramp-rate-limit configuration against the actual model, preserving live-block redundancy as load builds.

The acceptance gate is therefore not 'IST passed' but 'IST passed, residual risk named, and the conditions for safely meeting the real workload are written down.' That handoff — from a building proven against faults to a building about to meet its first real dynamics — is the seam IST exists to make safe.

IST is a goodput investment, not a compliance cost

It is tempting to read IST as the last box to tick before go-live — a contractual hurdle owned by the Cx agent. Reframe it through the goodput lens of Chapter 12.2. Every failure mode you trip on purpose during IST, against safe load, is a failure you do not discover during a real workload against irreplaceable GPUs at full revenue exposure. The Llama 3 405B run logged a hardware-attributable interruption roughly every three hours on 16,384 GPUs; an AI factory will live inside a near-continuous failure environment from day one. IST's value is that it moves the integration failures — the mis-coordinated ATS, the starved CDU, the alarm flood — out of the goodput-bearing period and into a controlled rehearsal. The honest accounting is that thorough IST plus a real proxy run buys you the right to trust the building's automatic responses when the load is finally GPUs and the clock is running.

IST sits inside the commissioning program governed by Chapter 13.1 and scripted per Chapter 13.2, downstream of electrical acceptance (Chapter 13.3), microgrid commissioning (Chapter 13.4), and cooling acceptance (Chapter 13.5). The transient physics it cannot fully reproduce is canonical in Chapter 4.5; the mitigation stack it exercises is engineered in Chapter 4.5 and Chapter 13.4. The FMEA catalog it demonstrates against is consolidated in Appendix F. The realism gap it leaves open is closed only by the proxy training run in Chapter 13.9 and the staged go-live ramp in Chapter 13.10; the goodput framing that justifies the whole exercise lives in Chapter 12.2, with the DCIM handoff in Chapter 14.2.