The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 5.12

Cooling-Controls Transient Dynamics & Setpoint Stability

A direct-to-chip loop has almost no thermal inertia to hide behind, so when thousands of GPUs slam their power in unison the cooling controls must answer in seconds — and the line between a stable answer and a self-oscillating one is set at design time by slew limits, loop tuning, and a dew-point margin you cannot tune away.

POWER-BOUNDGOODPUT

What you'll decide here

  1. How fast your CDU control valves and pump VFDs are allowed to slew — fast enough to chase a synchronized GPU load step before the chip throttles, slow enough that the loop does not hunt — and where that tuning lives (CDU PLC vs facility BMS vs a predictive supervisor).
  2. Whether you ride a synchronized power slam with stored thermal/electrical inertia (BBU/BESS power-smoothing, a warm-water buffer volume, a flywheel-like reserve) or with a faster control response — and which is cheaper for your density tier.
  3. The dew-point margin and the condensation-risk policy on a rapid load drop: how many degrees above space dew point you hold secondary supply, and what the controls do when load collapses faster than the setpoint can rise.
  4. How cooling controls coordinate with GPU power-capping and the chip-to-BBU-to-BESS electrical spine so that a thermal excursion throttles compute gracefully instead of tripping it — the goodput-vs-trip decision.
  5. What you commission and demonstrate at Level 5: a real synchronized load-step test, an anti-hunting margin, and a dew-point excursion test — not just a steady-state capacity proof.

Chapter 4.5 told the electrical side of this story: a frontier training cluster is the worst load the grid has ever been asked to serve, because tens of thousands of synchronized GPUs ramp from idle to full power and back on millisecond timescales, and the power chain has to absorb the di/dt with stored energy before it propagates upstream. This chapter is the thermal twin of that problem. The same synchronized slam that stresses the busbar also lands, a few hundred milliseconds later, as a step change in heat flux at the cold plate — and a direct-to-chip loop has almost none of the thermal inertia that a legacy chilled-water plant used to ride through. The decision this chapter forces is not how much cooling you install (Chapter 5.1 owns the capacity wall) but how the controls behave in the seconds around a transient: how fast they may move, how they avoid oscillating, and what they do when the load disappears as fast as it arrived.

Four forks structure the chapter: slew-limit tuning, inertia-vs-response, dew-point margin, and capping coordination. Each has a downstream cost when set wrong — a throttled GPU (lost goodput), a hunting loop (mechanical wear and unstable die temperatures), or condensation on a cold manifold inside an energized rack (the failure mode that ends up in Appendix F). None of these are exotic. They are the ordinary consequence of running a low-inertia loop under a load profile that swings harder and faster than any cooling plant was historically asked to follow.

Why DLC has no inertia to ride through a slam

Start with the physics, because it is the whole reason this chapter exists. A legacy air-cooled hall was a giant thermal flywheel. Tens of thousands of cubic metres of room air, a raised-floor plenum, the thermal mass of the slab and the racks themselves, plus a chilled-water plant carrying hundreds of thousands of litres in buffer tanks and pipe volume — all of it stored enough sensible heat that a control loop could be sluggish and the room would barely notice a load step. A CRAH that took thirty seconds to ramp a fan was fine, because the air temperature could not move quickly in a volume that large. The plant rode through transients on stored thermal mass.

A direct-to-chip loop deletes that buffer. The coolant in contact with the die is a thin film in a microchannel cold plate; the technology-cooling-loop (TCS) volume serving an NVL72-class rack is on the order of ~200 L, and the transport time from cold plate to CDU heat exchanger and back is seconds, not minutes. The thermal resistance from junction to coolant is deliberately tiny — that is the entire point of liquid cooling — which means the die temperature tracks coolant supply temperature almost rigidly. When GPU power steps by more than half of TDP in milliseconds (measured: power variations exceeding 50% of TDP within a few AC cycles; current swings from a 5–10 A idle baseline to 20–25 A in under 200 ms on a single inference node), the heat arriving at the cold plate steps with it, and there is no large mass of warm water to soak it up. The coolant supply temperature is now the only thing standing between the load step and a junction-temperature excursion — and that supply temperature is whatever your control loop is making it, right now.

This is the inversion that catches operators migrating from air: in the old world, the cooling plant was slow and that was safe; in the DLC world, the cooling plant is the fast element in a tightly-coupled loop, and slowness is what gets you throttled. The GB200 envelope is unforgiving precisely because the coupling is tight — coolant inlet held near 20–25 °C, ~80 L/min flow per rack, a small rise across the cold plates — and deviation throttles the GPUs by up to ~50%. The control system is not a comfort function anymore. It is in the goodput path.

CDU slew limits and the lag-vs-stability fork

The CDU is where the control decision physically happens. Its two actuators are the pump VFD (setting flow) and a modulating control valve — usually a three-way mixing valve on the facility-water side of the heat exchanger, which sets how much heat the TCS loop rejects and therefore the secondary supply temperature. Both actuators have slew limits: the maximum rate at which they are permitted to change. Vendor-typical envelopes for a predictive deployment sit around ±3% pump RPM per minute and ≤10% valve travel per minute, and a well-built CDU holds secondary supply within ±0.5 °C of setpoint in steady state.

The slew limit is the fork, and it cuts both ways. Set it too slow and the loop cannot chase a synchronized load step: secondary supply temperature lags the heat input, the coolant arriving at the cold plate is warmer than the die can tolerate, and the GPU throttles to protect itself — lost goodput, the exact thing a frontier training run cannot afford. Set it too fast and you invite the opposite failure: the valve and pump overshoot the correction, the over-corrected temperature transports back to the sensor a few seconds later, the controller over-corrects the other way, and the loop hunts — a sustained oscillation in flow, temperature, and pressure that wears valve seats and pump bearings, fatigues quick-disconnects and hoses with pressure cycling, and swings die temperatures around the throttle threshold so the GPUs themselves oscillate between full and capped. Hunting is the cooling-controls failure mode catalogued in Appendix F, and it is almost always a tuning fault, not a hardware one.

The reason this is hard is the transport lag in the callout above: any loop with significant dead time between actuation and feedback is prone to oscillation if its gain is too high. Classical fixes apply — lower proportional gain, lengthen the integral time, add a setpoint deadband so small excursions do not chase the actuator, and enable anti-windup so the integrator does not saturate while the valve is slew-limited at its travel cap. But the more durable answer in 2026 practice is to stop relying on feedback alone.

Three control strategies for a low-inertia loop under synchronized slams
StrategyHow it answers a slamThrottle riskAdded capital / complexityBest fit
Reactive feedback (PID only)Senses temperature rise, then slews valve/pump within limitsHigh — transport lag means coolant reacts after the die has already heatedLowest; native CDU PLCLoosely-coupled inference, modest density, gentle load profiles
Feedforward / predictive supervisorReads the GPU power/utilization signal and pre-positions valve and pump before the heat arrivesLow — cooling moves with the load, not after it; die held within ~2 °C of throttle on steep spikesModerate; telemetry integration + supervisory controller; ~22–28% CDU energy savings reportedDense training/RL with synchronized, predictable steps
Absorb with inertia (buffer + smoothing)Rides the step on stored energy — warm-water buffer volume thermally, BBU/BESS power-smoothing electricallyLow — the step is blunted before controls must respondHighest; buffer tankage, BESS/BBU sized for transient dutyHighest-density racks where slew limits cannot keep up at all
The fork is feedback-only reactive control vs feedforward/predictive vs absorb-with-inertia. Figures are 2026 practitioner/vendor ranges; see keynumbers for sources and vintages. Each row trades response speed, capital, and stability differently.

The three strategies are a spectrum, and real facilities blend them. The center of gravity in 2026 has moved decisively from the top row toward the middle: a purely reactive PID loop is fine for an inference hall with gentle load, but it is increasingly inadequate for a dense training cluster whose load steps are large, fast, and — critically — synchronized and predictable. Predictability is the lever. A training step is a repeating waveform: the cluster slams to full power during compute, drops during a collective or a checkpoint, slams again. If the controls can see that waveform — via the GPU power-management telemetry, the scheduler, or even the BMC power signal — they can feedforward the valve and pump position so the coolant is already cold when the heat arrives, instead of chasing it. Predictive deployments that do this report holding die temperatures within ~2 °C of the throttle limit through the steepest spikes while cutting CDU energy 22–28% (because the loop stops over-pumping and over-cooling defensively). The bottom row — buy your way out with inertia — is the fallback when even feedforward cannot keep up, and it is where the cooling and electrical problems converge.

The electrical spine: capping, BBU, and BESS as transient absorbers

The cleanest way to make a cooling-controls problem tractable is to make the load step smaller before it ever reaches the cold plate — which is exactly what the chip-to-BBU-to-BESS spine of Chapter 4.5 does. Three coordinated levers blunt the slam:

GPU power-capping is the fastest and cheapest. The accelerator's own power-management firmware can clamp or ramp-limit board power, smoothing the di/dt at the source. Vendors have moved this on-die: NVIDIA's Rubin generation builds in power-smoothing on the order of a few hundred joules per GPU to flatten the worst transients before they leave the board. A cap is also the graceful-degradation lever for cooling: when a thermal excursion is developing, throttling compute a few percent to hold die temperature is almost always cheaper than letting the loop chase it into instability or, worse, letting a chip trip. This is the goodput decision — a small, deliberate cap costs a sliver of throughput; an uncontrolled excursion costs a restart.

Battery-backed units (BBU) and facility BESS absorb the electrical transient so the grid — and indirectly the thermal load profile — sees a smoother draw. NVIDIA's production-BESS guidance for AI factories names transient absorption and power-smoothing as first-class BESS roles alongside ride-through, with closed-loop state-of-charge control sized to the workload's transient signature. A facility that smooths its electrical transient also smooths the thermal one a beat later, because flatter power means flatter heat flux — the cooling controls inherit an easier waveform to follow.

The coordination decision is who arbitrates. If GPU capping, BBU smoothing, and CDU controls each react independently to the same slam, they can fight: the cap reduces heat just as the CDU finishes slewing colder, and now the loop overshoots cold (toward the dew-point floor, below). The mature pattern is a supervisory layer that knows the load forecast and sequences the responses — cap and smooth first, feedforward the cooling to the post-cap heat profile — rather than three reactive loops racing each other. → Chapter 4.5 owns the electrical-spine design; Chapter 14.7 owns the in-operation power-and-thermal management policy.

The dew-point floor interacts with the warm-water trend in an awkward way. Everything in Part 5 pushes coolant warmer — 30 °C-plus facility water to maximize free cooling and heat reuse (Chapter 5.1) — and warmer supply is comfortably above any realistic dew point, so condensation risk falls as supply temperature rises. Good. But warmer supply also means a smaller temperature margin between coolant and the die's throttle limit, which makes the loop less forgiving of a transient overshoot: there is less thermal headroom to absorb a momentary lag before the GPU throttles. So the warm-water decision is itself a transient-stability decision — it trades condensation risk down for throttle-margin risk up, and the control loop has to be tuned for the regime you actually run, near the middle of the operating band, not at a corner.

>50% TDP
GPU power swing within a few AC cycles (~40–80 ms); idle-to-full step under ~200 ms on an inference node
2025arXiv 2502.01647 (AI Load Dynamics — A Power Electronics Perspective)
30–50 ms
PSU "inertia" lag between GPU current step and AC-input response; upstream converter bandwidth only a few kHz
2025arXiv 2502.01647
±3% RPM/min, ≤10% valve/min
vendor-typical CDU pump and valve slew limits in predictive deployments
2026ProphetStor, Predictive Liquid Cooling for AI Data Centers
22–28%
CDU energy saved by predictive control while holding die temps within ~2 °C of throttle on steep spikes
2026ProphetStor; corroborated by digital-twin studies
±0.5 °C
secondary-supply temperature a well-tuned CDU holds at setpoint in steady state
2025Vertiv / CDU vendor control specs
~2 °C above dew point
minimum secondary-supply margin CDUs hold to prevent condensation on the TCS loop
2025ASHRAE TC 9.9; CDU control-method patents (US 12,225,688)
20–25 °C / ~80 L/min
GB200 NVL72 coolant inlet and flow; deviation throttles GPUs up to ~50%
2025NVIDIA OCP GB200 NVL72 contribution / Introl
~400 J/GPU
on-die power-smoothing energy (Rubin generation) flattening transients at the source
2025NVIDIA, Designing Production-Ready BESS for AI Factories

Anti-hunting: tuning the loop so it does not eat itself

Hunting deserves its own treatment because it is the most common preventable failure in a commissioned liquid plant, and because it is invisible at steady state — it only shows up under the synchronized-load conditions the workload actually creates. The mechanism is dead-time-driven oscillation: a loop with several seconds of transport delay between actuator and sensor will oscillate if its loop gain is high enough that a correction returns to the sensor still large enough to provoke an opposite correction. Three knobs and one architecture decision keep it stable.

  • Gain and integral time. Lower proportional gain so a single correction does not overshoot; lengthen integral time so the controller does not wind up faster than the loop can respond. The classic symptom of too-aggressive tuning here is a regular sinusoidal swing in supply temperature with a period of roughly 4–6x the transport dead time.
  • Deadband. A small setpoint deadband (e.g. ±0.3–0.5 °C) stops the loop from chasing sensor noise and micro-excursions, which is what drives high-frequency valve dithering and seat wear. The cost is a slightly looser steady-state band — an easy trade.
  • Anti-windup. Because the actuators are slew-limited, the integrator will try to demand more correction than the valve can deliver during a fast transient. Without anti-windup the integrator saturates, and when the transient passes the loop dumps the accumulated error as a giant overshoot — straight toward the dew-point floor on a load drop. Anti-windup (clamping or back-calculation) is not optional on a slew-limited loop.
  • Architecture: feedforward beats feedback for dead-time loops. No amount of feedback tuning beats the transport lag; the structural fix is to add a feedforward term from the load signal so the loop is mostly open-loop-correct and the PID only trims the residual. This is the same insight as the predictive-supervisor row in the table — it is the difference between chasing the load and leading it.

One more anti-pattern worth naming: controls fighting across layers. If the CDU PLC, the facility BMS, and a GPU-side thermal manager each run their own loop against overlapping setpoints, they interact like coupled oscillators and the whole system hunts even when each loop is individually stable. Decide which layer owns the secondary-supply setpoint and let the others observe, not actuate. This is the thermal analogue of the electrical arbitration problem above — same disease, same cure: one arbiter, clear ownership.

Deep dive: sizing a warm-water buffer to ride a synchronized step

When slew limits genuinely cannot keep up — the highest-density racks, the steepest steps — the absorb-with-inertia row of the table becomes the answer, and the question is how much buffer volume you need. The arithmetic is a thermal-capacitance balance. A buffer of volume V (litres) of water (specific heat ~4.18 kJ/kg·°C, density ~1 kg/L) absorbing a heat-load step ΔQ (kW) over the loop's response time t (seconds) before the controls catch up will rise in temperature by roughly ΔT ≈ (ΔQ · t) / (4.18 · V). Invert it: to cap the supply-temperature excursion at, say, 2 °C while a 60 kW step is absorbed over a 10-second control-response window, you need V ≈ (60 · 10) / (4.18 · 2) ≈ 72 L of dedicated buffer beyond the loop's working volume.

That is a modest tank for one rack and a large one for a hall — which is exactly why buffering is the fallback, not the default. It also explains why warmer facility water helps twice: a larger absolute temperature band to the throttle limit means you can tolerate a bigger ΔT excursion, so the same buffer volume rides a bigger step. The competing approach is to move the inertia onto the electrical side instead — BBU/BESS power-smoothing flattens the load step so ΔQ itself shrinks, which is often cheaper than tankage because the BESS is already there for ride-through (Chapter 4.5). The design choice between thermal buffer and electrical smoothing is a capital comparison, and it is density-dependent: below the slew-limit-keeps-up threshold you need neither; above it, electrical smoothing usually wins until the very top of the density curve, where you do both. → heat-rejection and loop sizing in Chapter 5.1; DLC loop architecture in Chapter 5.4.

Commissioning the transient, not just the capacity

Here is the operational consequence that most directly costs money: a cooling plant that passes a steady-state capacity test can still fail catastrophically under the workload's real transient profile, because the two are different physics. A Level 4 capacity proof says the plant can reject the rated heat at the rated flow indefinitely. It says nothing about whether the loop hunts when the load steps, whether the controls catch a slam before the die throttles, or whether the setpoint clamps above dew point on a fast drop. Those have to be demonstrated explicitly — which is why the integrated-systems-test (IST) for a liquid-cooled hall has to include transient cases, not just the load-bank-at-full sequence.

The transient acceptance set is short and specific: a synchronized step test (drive the load from low to full and back as fast as the load banks or a real workload allow, and confirm the controls catch it without throttling and without overshoot), an anti-hunting demonstration (hold the load at the worst-case operating point and confirm the loop settles inside its band without sustained oscillation), and a dew-point excursion test (drop the load fast and confirm the setpoint clamp holds secondary supply above the measured dew point throughout). Pair these with the metering and telemetry that make them observable in operation — the per-rack power, flow, supply-temperature, and dew-point signals are exactly the ones cooling controls need and the ones Chapter 4.9 meters. → cooling acceptance and CDU commissioning in Chapter 13.5; the full Level 5 IST and failure-mode demonstration in Chapter 13.6.

This chapter is the thermal twin of the electrical-transient problem in Chapter 4.5 (UPS/BBU/BESS ride-through and transient absorption), and the two must be designed together. The capacity wall and warm-water rationale that set the throttle and dew-point margins live in Chapter 5.1; the DLC loop architecture (TCS/FWS separation, CDU, flow and delta-T) in Chapter 5.4; thermal reliability, leak detection, and the consolidated coolant-leak/thermal-runaway catalog in Chapter 5.11 and Appendix F. The telemetry that makes transients observable is metered in Chapter 4.9; the goodput-vs-availability framing behind the capping decision in Chapter 12.2; cooling/CDU commissioning in Chapter 13.5 and Level-5 IST in Chapter 13.6; and in-operation capacity, power, and thermal management in Chapter 14.7.