Guide › Cooling & Thermal Management › 5.12

Chapter 5.12

Cooling-Controls Transient Dynamics & Setpoint Stability

A direct-to-chip loop has almost no thermal inertia to hide behind, so when thousands of GPUs slam their power in unison the cooling controls must answer in seconds — and the line between a stable answer and a self-oscillating one is set at design time by slew limits, loop tuning, and a dew-point margin you cannot tune away.

POWER-BOUNDGOODPUT

What you'll decide here

How fast your CDU control valves and pump VFDs are allowed to slew — fast enough to chase a synchronized GPU load step before the chip throttles, slow enough that the loop does not hunt — and where that tuning lives (CDU PLC vs facility BMS vs a predictive supervisor).
Whether you ride a synchronized power slam with stored thermal/electrical inertia (BBU/BESS power-smoothing, a warm-water buffer volume, a flywheel-like reserve) or with a faster control response — and which is cheaper for your density tier.
The dew-point margin and the condensation-risk policy on a rapid load drop: how many degrees above space dew point you hold secondary supply, and what the controls do when load collapses faster than the setpoint can rise.
How cooling controls coordinate with GPU power-capping and the chip-to-BBU-to-BESS electrical spine so that a thermal excursion throttles compute gracefully instead of tripping it — the goodput-vs-trip decision.
What you commission and demonstrate at Level 5: a real synchronized load-step test, an anti-hunting margin, and a dew-point excursion test — not just a steady-state capacity proof.

Chapter 4.5 told the electrical side of this story: a frontier training cluster is the worst load the grid has ever been asked to serve, because tens of thousands of synchronized GPUs ramp from idle to full power and back on millisecond timescales, and the power chain has to absorb the di/dt with stored energy before it propagates upstream. This chapter is the thermal twin of that problem. The same synchronized slam that stresses the busbar also lands, a few hundred milliseconds later, as a step change in heat flux at the cold plate — and a direct-to-chip loop has almost none of the thermal inertia that a legacy chilled-water plant used to ride through. The decision this chapter forces is not how much cooling you install (Chapter 5.1 owns the capacity wall) but how the controls behave in the seconds around a transient: how fast they may move, how they avoid oscillating, and what they do when the load disappears as fast as it arrived.

Four forks structure the chapter: slew-limit tuning, inertia-vs-response, dew-point margin, and capping coordination. Each has a downstream cost when set wrong — a throttled GPU (lost goodput), a hunting loop (mechanical wear and unstable die temperatures), or condensation on a cold manifold inside an energized rack (the failure mode that ends up in Appendix F). None of these are exotic. They are the ordinary consequence of running a low-inertia loop under a load profile that swings harder and faster than any cooling plant was historically asked to follow.

Why DLC has no inertia to ride through a slam

Start with the physics, because it is the whole reason this chapter exists. A legacy air-cooled hall was a giant thermal flywheel. Tens of thousands of cubic metres of room air, a raised-floor plenum, the thermal mass of the slab and the racks themselves, plus a chilled-water plant carrying hundreds of thousands of litres in buffer tanks and pipe volume — all of it stored enough sensible heat that a control loop could be sluggish and the room would barely notice a load step. A CRAH that took thirty seconds to ramp a fan was fine, because the air temperature could not move quickly in a volume that large. The plant rode through transients on stored thermal mass.

A direct-to-chip loop deletes that buffer. The coolant in contact with the die is a thin film in a microchannel cold plate; the technology-cooling-loop (TCS) volume serving an NVL72-class rack is on the order of ~200 L, and the transport time from cold plate to CDU heat exchanger and back is seconds, not minutes. The thermal resistance from junction to coolant is deliberately tiny — that is the entire point of liquid cooling — which means the die temperature tracks coolant supply temperature almost rigidly. When GPU power steps by more than half of TDP in milliseconds (measured: power variations exceeding 50% of TDP within a few AC cycles; current swings from a 5–10 A idle baseline to 20–25 A in under 200 ms on a single inference node), the heat arriving at the cold plate steps with it, and there is no large mass of warm water to soak it up. The coolant supply temperature is now the only thing standing between the load step and a junction-temperature excursion — and that supply temperature is whatever your control loop is making it, right now.

This is the inversion that catches operators migrating from air: in the old world, the cooling plant was slow and that was safe; in the DLC world, the cooling plant is the fast element in a tightly-coupled loop, and slowness is what gets you throttled. The GB200 envelope is unforgiving precisely because the coupling is tight — coolant inlet held near 20–25 °C, ~80 L/min flow per rack, a small rise across the cold plates — and deviation throttles the GPUs by up to ~50%. The control system is not a comfort function anymore. It is in the goodput path.

The lag stack: where the seconds go

A synchronized load step propagates through a chain of delays, and stability lives in the gaps. The electrical side moves first and fastest: local board capacitance buffers microseconds, the PSU exhibits a ~30–50 ms "inertia" between the GPU current step and its AC-input response, and the upstream converters have a control bandwidth of only a few kilohertz. The thermal side is slower by an order of magnitude or more: heat reaches the coolant in milliseconds, but the temperature signal has to transport from cold plate to the CDU sensor (seconds), the CDU PID has to react, the control valve or pump VFD has to slew (seconds, by design limit), and the new coolant temperature has to transport back to the rack (seconds again). Add it up and the loop's closed-loop response to a step is on the order of 5–30 seconds — against a load that moved in milliseconds. You cannot close that gap by reacting faster; the transport lag is physical. You close it by anticipating (feedforward from the power signal) or by absorbing (electrical and thermal buffering) — which is the central fork of this chapter.

CDU slew limits and the lag-vs-stability fork

The CDU is where the control decision physically happens. Its two actuators are the pump VFD (setting flow) and a modulating control valve — usually a three-way mixing valve on the facility-water side of the heat exchanger, which sets how much heat the TCS loop rejects and therefore the secondary supply temperature. Both actuators have slew limits: the maximum rate at which they are permitted to change. Vendor-typical envelopes for a predictive deployment sit around ±3% pump RPM per minute and ≤10% valve travel per minute, and a well-built CDU holds secondary supply within ±0.5 °C of setpoint in steady state.

The slew limit is the fork, and it cuts both ways. Set it too slow and the loop cannot chase a synchronized load step: secondary supply temperature lags the heat input, the coolant arriving at the cold plate is warmer than the die can tolerate, and the GPU throttles to protect itself — lost goodput, the exact thing a frontier training run cannot afford. Set it too fast and you invite the opposite failure: the valve and pump overshoot the correction, the over-corrected temperature transports back to the sensor a few seconds later, the controller over-corrects the other way, and the loop hunts — a sustained oscillation in flow, temperature, and pressure that wears valve seats and pump bearings, fatigues quick-disconnects and hoses with pressure cycling, and swings die temperatures around the throttle threshold so the GPUs themselves oscillate between full and capped. Hunting is the cooling-controls failure mode catalogued in Appendix F, and it is almost always a tuning fault, not a hardware one.

The reason this is hard is the transport lag in the callout above: any loop with significant dead time between actuation and feedback is prone to oscillation if its gain is too high. Classical fixes apply — lower proportional gain, lengthen the integral time, add a setpoint deadband so small excursions do not chase the actuator, and enable anti-windup so the integrator does not saturate while the valve is slew-limited at its travel cap. But the more durable answer in 2026 practice is to stop relying on feedback alone.

Three control strategies for a low-inertia loop under synchronized slams

Strategy	How it answers a slam	Throttle risk	Added capital / complexity	Best fit
Reactive feedback (PID only)	Senses temperature rise, then slews valve/pump within limits	High — transport lag means coolant reacts after the die has already heated	Lowest; native CDU PLC	Loosely-coupled inference, modest density, gentle load profiles
Feedforward / predictive supervisor	Reads the GPU power/utilization signal and pre-positions valve and pump before the heat arrives	Low — cooling moves with the load, not after it; die held within ~2 °C of throttle on steep spikes	Moderate; telemetry integration + supervisory controller; ~22–28% CDU energy savings reported	Dense training/RL with synchronized, predictable steps
Absorb with inertia (buffer + smoothing)	Rides the step on stored energy — warm-water buffer volume thermally, BBU/BESS power-smoothing electrically	Low — the step is blunted before controls must respond	Highest; buffer tankage, BESS/BBU sized for transient duty	Highest-density racks where slew limits cannot keep up at all

The fork is feedback-only reactive control vs feedforward/predictive vs absorb-with-inertia. Figures are 2026 practitioner/vendor ranges; see keynumbers for sources and vintages. Each row trades response speed, capital, and stability differently.

The three strategies are a spectrum, and real facilities blend them. The center of gravity in 2026 has moved decisively from the top row toward the middle: a purely reactive PID loop is fine for an inference hall with gentle load, but it is increasingly inadequate for a dense training cluster whose load steps are large, fast, and — critically — synchronized and predictable. Predictability is the lever. A training step is a repeating waveform: the cluster slams to full power during compute, drops during a collective or a checkpoint, slams again. If the controls can see that waveform — via the GPU power-management telemetry, the scheduler, or even the BMC power signal — they can feedforward the valve and pump position so the coolant is already cold when the heat arrives, instead of chasing it. Predictive deployments that do this report holding die temperatures within ~2 °C of the throttle limit through the steepest spikes while cutting CDU energy 22–28% (because the loop stops over-pumping and over-cooling defensively). The bottom row — buy your way out with inertia — is the fallback when even feedforward cannot keep up, and it is where the cooling and electrical problems converge.

The electrical spine: capping, BBU, and BESS as transient absorbers

The cleanest way to make a cooling-controls problem tractable is to make the load step smaller before it ever reaches the cold plate — which is exactly what the chip-to-BBU-to-BESS spine of Chapter 4.5 does. Three coordinated levers blunt the slam:

GPU power-capping is the fastest and cheapest. The accelerator's own power-management firmware can clamp or ramp-limit board power, smoothing the di/dt at the source. Vendors have moved this on-die: NVIDIA's Rubin generation builds in power-smoothing on the order of a few hundred joules per GPU to flatten the worst transients before they leave the board. A cap is also the graceful-degradation lever for cooling: when a thermal excursion is developing, throttling compute a few percent to hold die temperature is almost always cheaper than letting the loop chase it into instability or, worse, letting a chip trip. This is the goodput decision — a small, deliberate cap costs a sliver of throughput; an uncontrolled excursion costs a restart.

Battery-backed units (BBU) and facility BESS absorb the electrical transient so the grid — and indirectly the thermal load profile — sees a smoother draw. NVIDIA's production-BESS guidance for AI factories names transient absorption and power-smoothing as first-class BESS roles alongside ride-through, with closed-loop state-of-charge control sized to the workload's transient signature. A facility that smooths its electrical transient also smooths the thermal one a beat later, because flatter power means flatter heat flux — the cooling controls inherit an easier waveform to follow.

The coordination decision is who arbitrates. If GPU capping, BBU smoothing, and CDU controls each react independently to the same slam, they can fight: the cap reduces heat just as the CDU finishes slewing colder, and now the loop overshoots cold (toward the dew-point floor, below). The mature pattern is a supervisory layer that knows the load forecast and sequences the responses — cap and smooth first, feedforward the cooling to the post-cap heat profile — rather than three reactive loops racing each other. → Chapter 4.5 owns the electrical-spine design; Chapter 14.7 owns the in-operation power-and-thermal management policy.

The condensation window on a rapid load drop

The dangerous transient is not the slam — it is the drop. When a synchronized job finishes a step, checkpoints, or simply fails, cluster heat load can collapse from full to near-idle in under a second. The CDU was holding cold supply to feed the slam; now the heat is gone but the coolant is still cold and the loop's setpoint can only rise at its slew limit. For a window of seconds, secondary supply is colder than the load needs — and if it drops below the dew point of the air around the rack, water condenses on the cold manifold, the quick-disconnects, and the supply hoses, inside an energized rack. That is a direct path to a short, corrosion, or a leak-detection trip. The defense is a hard floor: hold secondary supply at least ~2 °C above the measured space dew point at all times, clamp the setpoint so no control action — reactive, feedforward, or operator — can drive it below that floor, and measure dew point locally (it varies across a hall) rather than trusting a single facility sensor. The dew-point floor is the one setpoint in this chapter you do not get to tune for performance.

The dew-point floor interacts with the warm-water trend in an awkward way. Everything in Part 5 pushes coolant warmer — 30 °C-plus facility water to maximize free cooling and heat reuse (Chapter 5.1) — and warmer supply is comfortably above any realistic dew point, so condensation risk falls as supply temperature rises. Good. But warmer supply also means a smaller temperature margin between coolant and the die's throttle limit, which makes the loop less forgiving of a transient overshoot: there is less thermal headroom to absorb a momentary lag before the GPU throttles. So the warm-water decision is itself a transient-stability decision — it trades condensation risk down for throttle-margin risk up, and the control loop has to be tuned for the regime you actually run, near the middle of the operating band, not at a corner.

>50% TDP

GPU power swing within a few AC cycles (~40–80 ms); idle-to-full step under ~200 ms on an inference node

2025arXiv 2502.01647 (AI Load Dynamics — A Power Electronics Perspective)

30–50 ms

PSU "inertia" lag between GPU current step and AC-input response; upstream converter bandwidth only a few kHz

2025arXiv 2502.01647

±3% RPM/min, ≤10% valve/min

vendor-typical CDU pump and valve slew limits in predictive deployments

2026ProphetStor, Predictive Liquid Cooling for AI Data Centers

22–28%

CDU energy saved by predictive control while holding die temps within ~2 °C of throttle on steep spikes

2026ProphetStor; corroborated by digital-twin studies

±0.5 °C

secondary-supply temperature a well-tuned CDU holds at setpoint in steady state

2025Vertiv / CDU vendor control specs

~2 °C above dew point

minimum secondary-supply margin CDUs hold to prevent condensation on the TCS loop

2025ASHRAE TC 9.9; CDU control-method patents (US 12,225,688)

20–25 °C / ~80 L/min

GB200 NVL72 coolant inlet and flow; deviation throttles GPUs up to ~50%

2025NVIDIA OCP GB200 NVL72 contribution / Introl

~400 J/GPU

on-die power-smoothing energy (Rubin generation) flattening transients at the source

2025NVIDIA, Designing Production-Ready BESS for AI Factories

Anti-hunting: tuning the loop so it does not eat itself

Hunting deserves its own treatment because it is the most common preventable failure in a commissioned liquid plant, and because it is invisible at steady state — it only shows up under the synchronized-load conditions the workload actually creates. The mechanism is dead-time-driven oscillation: a loop with several seconds of transport delay between actuator and sensor will oscillate if its loop gain is high enough that a correction returns to the sensor still large enough to provoke an opposite correction. Three knobs and one architecture decision keep it stable.

Gain and integral time. Lower proportional gain so a single correction does not overshoot; lengthen integral time so the controller does not wind up faster than the loop can respond. The classic symptom of too-aggressive tuning here is a regular sinusoidal swing in supply temperature with a period of roughly 4–6x the transport dead time.
Deadband. A small setpoint deadband (e.g. ±0.3–0.5 °C) stops the loop from chasing sensor noise and micro-excursions, which is what drives high-frequency valve dithering and seat wear. The cost is a slightly looser steady-state band — an easy trade.
Anti-windup. Because the actuators are slew-limited, the integrator will try to demand more correction than the valve can deliver during a fast transient. Without anti-windup the integrator saturates, and when the transient passes the loop dumps the accumulated error as a giant overshoot — straight toward the dew-point floor on a load drop. Anti-windup (clamping or back-calculation) is not optional on a slew-limited loop.
Architecture: feedforward beats feedback for dead-time loops. No amount of feedback tuning beats the transport lag; the structural fix is to add a feedforward term from the load signal so the loop is mostly open-loop-correct and the PID only trims the residual. This is the same insight as the predictive-supervisor row in the table — it is the difference between chasing the load and leading it.

One more anti-pattern worth naming: controls fighting across layers. If the CDU PLC, the facility BMS, and a GPU-side thermal manager each run their own loop against overlapping setpoints, they interact like coupled oscillators and the whole system hunts even when each loop is individually stable. Decide which layer owns the secondary-supply setpoint and let the others observe, not actuate. This is the thermal analogue of the electrical arbitration problem above — same disease, same cure: one arbiter, clear ownership.

Deep dive: sizing a warm-water buffer to ride a synchronized step

When slew limits genuinely cannot keep up — the highest-density racks, the steepest steps — the absorb-with-inertia row of the table becomes the answer, and the question is how much buffer volume you need. The arithmetic is a thermal-capacitance balance. A buffer of volume V (litres) of water (specific heat ~4.18 kJ/kg·°C, density ~1 kg/L) absorbing a heat-load step ΔQ (kW) over the loop's response time t (seconds) before the controls catch up will rise in temperature by roughly ΔT ≈ (ΔQ · t) / (4.18 · V). Invert it: to cap the supply-temperature excursion at, say, 2 °C while a 60 kW step is absorbed over a 10-second control-response window, you need V ≈ (60 · 10) / (4.18 · 2) ≈ 72 L of dedicated buffer beyond the loop's working volume.

That is a modest tank for one rack and a large one for a hall — which is exactly why buffering is the fallback, not the default. It also explains why warmer facility water helps twice: a larger absolute temperature band to the throttle limit means you can tolerate a bigger ΔT excursion, so the same buffer volume rides a bigger step. The competing approach is to move the inertia onto the electrical side instead — BBU/BESS power-smoothing flattens the load step so ΔQ itself shrinks, which is often cheaper than tankage because the BESS is already there for ride-through (Chapter 4.5). The design choice between thermal buffer and electrical smoothing is a capital comparison, and it is density-dependent: below the slew-limit-keeps-up threshold you need neither; above it, electrical smoothing usually wins until the very top of the density curve, where you do both. → heat-rejection and loop sizing in Chapter 5.1; DLC loop architecture in Chapter 5.4.

Commissioning the transient, not just the capacity

Here is the operational consequence that most directly costs money: a cooling plant that passes a steady-state capacity test can still fail catastrophically under the workload's real transient profile, because the two are different physics. A Level 4 capacity proof says the plant can reject the rated heat at the rated flow indefinitely. It says nothing about whether the loop hunts when the load steps, whether the controls catch a slam before the die throttles, or whether the setpoint clamps above dew point on a fast drop. Those have to be demonstrated explicitly — which is why the integrated-systems-test (IST) for a liquid-cooled hall has to include transient cases, not just the load-bank-at-full sequence.

The transient acceptance set is short and specific: a synchronized step test (drive the load from low to full and back as fast as the load banks or a real workload allow, and confirm the controls catch it without throttling and without overshoot), an anti-hunting demonstration (hold the load at the worst-case operating point and confirm the loop settles inside its band without sustained oscillation), and a dew-point excursion test (drop the load fast and confirm the setpoint clamp holds secondary supply above the measured dew point throughout). Pair these with the metering and telemetry that make them observable in operation — the per-rack power, flow, supply-temperature, and dew-point signals are exactly the ones cooling controls need and the ones Chapter 4.9 meters. → cooling acceptance and CDU commissioning in Chapter 13.5; the full Level 5 IST and failure-mode demonstration in Chapter 13.6.

Where does the setpoint live?

The single most consequential architecture decision in this chapter is which layer owns the secondary-supply setpoint and the actuator commands. Three candidates: the CDU PLC (fast, local, vendor-supplied, but blind to the load forecast), the facility BMS (sees the whole hall, but slow and not built for sub-minute control), or a predictive supervisor that ingests GPU power telemetry and the schedule to feedforward the loop. The reactive default — CDU PLC alone — is adequate for inference but throttles dense training under synchronized slams. The 2026 trajectory is a thin predictive supervisor sitting above the CDU PLC, feeding it feedforward setpoints while the PLC retains the fast local safety loop (including the dew-point clamp, which must live where it cannot be overridden). Decide this before you commission: retrofitting a feedforward path onto a plant tuned for reactive control means re-tuning every loop and re-running the transient acceptance set.

This chapter is the thermal twin of the electrical-transient problem in Chapter 4.5 (UPS/BBU/BESS ride-through and transient absorption), and the two must be designed together. The capacity wall and warm-water rationale that set the throttle and dew-point margins live in Chapter 5.1; the DLC loop architecture (TCS/FWS separation, CDU, flow and delta-T) in Chapter 5.4; thermal reliability, leak detection, and the consolidated coolant-leak/thermal-runaway catalog in Chapter 5.11 and Appendix F. The telemetry that makes transients observable is metered in Chapter 4.9; the goodput-vs-availability framing behind the capping decision in Chapter 12.2; cooling/CDU commissioning in Chapter 13.5 and Level-5 IST in Chapter 13.6; and in-operation capacity, power, and thermal management in Chapter 14.7.