Guide › Day-2 Operations, Upgrades & Lifecycle › 14.13

Chapter 14.13

Agentic Ops, RL Control & the Autonomy Ladder

Autonomy in the data center is not a switch you flip but a ladder you climb rung by rung — and the only defensible way up is to let an agent act unsupervised exactly as far as you have bounded its action space, instrumented its failure modes, and assigned the liability for when it is wrong.

POWER-BOUNDGOODPUT

What you'll decide here

Which rung of the autonomy ladder each control loop sits on today — observe, recommend, act-within-envelope, or fully autonomous — and the explicit gate that must be passed before any loop is promoted.
The safety envelope itself: the hard constraints (inlet temperature, flow, ramp rate, breaker limits) that bound an RL or agentic controller, and what the controller does the instant it would otherwise violate one.
Where the human-oversight boundary sits — in-the-loop, on-the-loop, or out-of-the-loop — for each action class, and who carries the liability when an unsupervised agent causes an outage.
Whether you are buying joint workload-cooling-power optimization at all, or running three independent control loops that fight each other at the margins — and what the coupling is worth in goodput and PUE.
The rollback and override architecture: how a human regains control, how fast, and what happens if the override path itself fails.

Every other chapter in Part 14 has handed work to a human: the operator who acknowledges an alarm, the technician who replaces a lemon node, the change-approval board that signs a switching order. This chapter is about the decision to hand some of that work to software that acts — not just observes, not just recommends, but reaches into the plant and moves a setpoint, sheds a workload, or fails over a CDU without a person in the loop. That decision is tempting because the economics are real: an AI campus runs thousands of coupled control loops at a tempo and dimensionality no human shift can track, and the gap between a well-tuned plant and a conservatively-tuned one is measured directly in PUE, in goodput, and in megawatts you do not have to procure. It is also the most dangerous decision in the operations stack, because the failure modes of an autonomous controller are not the failure modes of a tired operator — they are faster, more correlated, and harder to attribute.

The discipline this chapter imposes is the autonomy ladder: a small number of named rungs, each with an explicit promotion gate, so that "how autonomous is this facility" stops being a marketing adjective and becomes an auditable, per-loop property. We work through the ladder, then the engineering that makes the top rungs safe — reinforcement-learning joint control bounded by a hard safety envelope — then the human-oversight and liability boundary that decides when an agent may act unsupervised at all. This is the consolidating chapter for autonomy; the underlying control physics, telemetry, recovery, and design-time twins live elsewhere and are cross-referenced rather than re-derived.

The autonomy ladder: four rungs and the gates between them

Autonomy is not binary, and the single most common scoping error is to treat it as if it were — to ask "should the facility be autonomous?" instead of "which loop, at which rung, gated by what?" The ladder has four rungs, and the interesting engineering is not on any rung but in the gate between two rungs: the evidence you must produce before a loop is allowed to climb.

Rung 0 — Observe. The system ingests telemetry and renders state: dashboards, trends, anomaly flags. It has no authority. Every facility lives here for every loop by default, and the quality of this rung is set entirely by the telemetry and observability stack — garbage in, autonomous garbage out. This rung is built in Chapter 14.2.

Rung 1 — Recommend. The system proposes an action and a human executes it: "shift this training job to the cooler hall," "raise loop supply temperature 1.5 °C," "this CDU pump is trending to failure, schedule it." The agent's judgment is on trial but its hands are tied; the human is the safety interlock and the liability sink. This is where most 2026 production deployments actually sit, and it is a defensible place to stay for any loop whose failure mode is catastrophic.

Rung 2 — Act within an envelope. The agent executes directly, but only inside a pre-declared, hard-bounded action space, and a human supervises on-the-loop — watching, able to intervene, but not approving each action. This is the rung where the value lives: closed-loop cooling optimization, workload-aware power capping, automated node ejection and re-balancing. It is also the rung where the safety envelope (next section) does the real work, because the envelope is the only thing standing between the agent and a constraint violation.

Rung 3 — Fully autonomous. The agent owns the loop end-to-end, including the decision of when to escalate to a human, and operates out-of-the-loop for routine action. In 2026 this rung is reserved for narrowly-scoped, low-consequence, high-frequency loops — fan-speed trim, individual-rack power-state management, rollout-pool autoscaling — and it is the exception, not the destination. A facility that claims Rung 3 across the board in 2026 is either lying or has not yet had its first correlated failure.

The autonomy ladder — rungs, oversight, and promotion gates

Rung	What the agent does	Human posture	Promotion gate to reach this rung	Dominant failure mode
0 — Observe	Ingests telemetry, renders state, flags anomalies	Human does everything	Telemetry coverage & data quality validated (Ch. 14.2)	Blind spots; bad data masquerading as truth
1 — Recommend	Proposes action; human executes	In-the-loop — approves every action	Recommendation accuracy vs ground truth over a shadow period	Alert fatigue; humans rubber-stamp bad advice
2 — Act within envelope	Executes directly inside hard-bounded action space	On-the-loop — supervises, can intervene	Verified safety envelope + tested override + bounded blast radius	Envelope gap; agent optimizes into an un-modeled constraint
3 — Fully autonomous	Owns the loop, decides when to escalate	Out-of-the-loop for routine action	Quantified low consequence + high frequency + correlated-failure analysis	Fast, correlated, multi-loop failure no human can outrun

The gate column is the load-bearing one: it names the evidence that must exist before a loop is promoted. Oversight terminology (in/on/out-of-the-loop) follows the 2026 agentic-safety convention.

RL joint control within a safety envelope

The reason autonomy is worth the risk is joint optimization. Workload placement, cooling, and power are physically coupled — a training job that shifts its all-reduce phase changes rack power within seconds, which changes heat load, which changes the cooling plant's operating point, which changes facility draw and therefore the power chain's losses. Run three independent control loops and they fight at the margins: the cooling loop chases a temperature the workload loop just invalidated; the power loop caps a rack the cooling loop was about to cool efficiently. A single agent that sees all three state spaces can find an operating point none of the three can find alone. This is the core thesis of ML-driven cooling, and it is where reinforcement learning earns its place — RL is the natural formalism for sequential control of a coupled physical system with a clear reward (energy, or goodput-per-watt) and hard constraints.

The canonical proof point is Google DeepMind's data-center cooling work: a system trained on historical sensor data that cut cooling energy by ~40% and overall PUE overhead by ~15%, reading thousands of sensors every five minutes and adjusting chillers and pumps below the conservative setpoints a human would hold (DeepMind/Google, 2016; moved to direct autonomous control in 2018). The harder, more honest sequel is the 2022 commercial-cooling deployment with Trane across two live buildings, which delivered ~9% and ~13% energy savings — lower than the lab number, and deliberately so, because the paper's real contribution was cataloguing the production obstacles: evaluation under non-stationarity, learning from logged offline data, and above all constraint satisfaction (DeepMind, arXiv:2211.07357, 2022). The lesson the field took from the gap between 40% and 13% is the lesson of this section: the optimizer is the easy part; bounding it safely in a live plant is the engineering.

The safety envelope is the mechanism that makes RL deployable on a plant that can destroy hardware. It is not a soft penalty term in the reward — a learned policy will trade a small constraint violation for a large reward if you let it, which is exactly the behavior you cannot tolerate near a 132 kW liquid-cooled rack. The envelope is a hard outer loop that the agent cannot override:

Constraint shielding. Every proposed action is checked against hard limits (coolant inlet ≤ ~25 °C and flow within the GB200-class DLC spec; breaker and PDU limits; ramp-rate limits on cooling actuators) before it reaches an actuator. An action that would violate is rejected and replaced by a known-safe fallback, regardless of how good the agent thinks it is.
A conservative fallback controller. The classical rule-based or PID controller does not get decommissioned when the RL agent goes live — it becomes the floor. If the agent's policy is uncertain, if its inputs go stale, or if it is throttled, control reverts to the deterministic controller that the plant was commissioned against.
Watchdogs and rate limits. The agent acts at a bounded cadence with bounded step sizes, so a single bad inference cannot slam a setpoint; a watchdog that detects the agent stalling or oscillating trips control back to the fallback.

The decision this section forces is not "RL or not" but where the envelope's walls sit and what lives outside them. Set the walls too tight and you have bought an expensive optimizer that can only nibble at the margins the conservative controller already left on the table. Set them too loose and the first distribution shift — a hot day, an unmodeled workload pattern, a failed sensor the agent trusts — walks the plant into a violation the envelope was supposed to catch. The walls are a thermal-and-electrical-margin decision, and they belong to the same engineers who own the cooling and power chains. → the cooling-control physics and setpoint strategy live in Chapter 15.2; the power-transient and ride-through limits that bound the electrical side live in Chapter 4.5.

The reward function is a liability, not a convenience

An RL agent optimizes the reward you wrote, not the reward you meant. The recurring failure is reward hacking against an un-modeled cost: an agent rewarded purely for PUE will happily run the loop hotter than is safe for component life because nothing in its reward priced in accelerated degradation; an agent rewarded for goodput-per-watt will starve a checkpoint write to keep GPUs busy. Every term you omit from the reward is a constraint you are betting the envelope catches. Before any loop reaches Rung 2, the reward and the envelope must be reviewed together by someone who can name what the reward does not include — and that omission must be inside the envelope's hard limits, not in the agent's discretion. Treat the reward function as a change-controlled artifact under the Chapter 14.12 MOC process, not a tunable a data scientist edits on Friday.

The human-oversight and liability boundary

The rung an action sits on is an engineering choice; the liability for that action is a legal and organizational one, and the two must be decided together or you get the worst outcome — an agent empowered to act unsupervised against a contract that still pins the consequence on a human who was never in the loop. The oversight taxonomy is the bridge. In-the-loop: a human approves each action (Rung 1). On-the-loop: a human supervises and can intervene but does not approve each action (Rung 2). Out-of-the-loop: the agent acts and the human learns about it after the fact, if at all (Rung 3). The 2026 convention for critical infrastructure is blunt: out-of-the-loop deployments are reserved for narrowly-scoped, low-consequence actions, and most consequential actions are operated in- or on-the-loop with strict policy constraints.

An agent may act unsupervised — out-of-the-loop — only when every condition holds: the action's worst-case consequence is bounded and recoverable; the action is high-frequency enough that human approval is genuinely infeasible (you cannot ask a person to approve a fan trim every five seconds); the failure mode has been analyzed for correlation, so a single bad decision cannot cascade across loops; and the override path has been tested and is faster than the failure it must catch. Fail any one and the loop drops to on-the-loop at best. The hardest of these in practice is correlation: an agent making thousands of locally-sound decisions can still drive a globally-correlated failure — the canonical fear being a control agent that, responding to a real grid or thermal event, sheds or ramps load across the campus fast enough to become the NERC ride-through problem itself rather than its solution (the ~1.5 GW instantaneous loss class of event). Autonomy that can move power must be designed to not become a correlated load-shedding actuator.

When an agent may act unsupervised — the action-class decision

Action class	Example	Max oversight posture (2026)	Why	Liability owner
Trim / actuator nudge	Fan speed, pump VFD, per-rack power state	Out-of-the-loop (Rung 3)	Low consequence, recoverable, too frequent for humans	Operator (within commissioned envelope)
Setpoint optimization	Loop supply temp, chiller staging, power cap	On-the-loop (Rung 2)	Coupled, consequential, but envelope-bounded	Operator + platform vendor (shared, by SLA)
Workload placement / shed	Move a training job, throttle a rollout pool	On-the-loop (Rung 2)	Goodput impact; risk of correlated power swing	ML-platform owner (goodput accountable)
Live electrical switching	Transfer to alt feed, breaker operation	In-the-loop (Rung 1)	Catastrophic, arc-flash/LOTO-bounded, not recoverable	Human operator — never the agent
Cross-domain emergency response	Coordinated power + cooling + workload action in a fault	In-the-loop (Rung 1)	Correlated blast radius; the converged-incident trigger	Incident commander (Ch. 14.11)

Map each action class to an oversight posture before granting authority. The liability column names who owns the consequence when the agent is wrong — decide it at contract time, not after the outage.

The liability column is where the strategist earns their seat. In a self-build, the operator owns the consequence of an agent it commissioned, and the envelope is its own design liability. In a colocation or neocloud arrangement, autonomy crosses an organizational boundary: if the operator's agentic control plane caps a tenant's racks to hold PUE, and that cap costs the tenant a checkpoint or an SLA, whose fault is it? The 2026 answer in sophisticated contracts is an autonomy-declaration matrix baked into the agreement — every capability the control platform offers, the rung it operates at, the conditions under which the rung can change, and the rollback procedure if the human override fails. Without that matrix, autonomy is an un-priced, un-assigned risk sitting between two parties who each assume the other owns it. → the org structure that staffs the on-the-loop role and owns the goodput-vs-availability tradeoff is in Chapter 14.11; the converged cyber-physical escalation trigger this table's last row points at is in Chapter 14.11 and Chapter 11.12.

~40%

cooling-energy reduction from DeepMind RL control (≈15% PUE-overhead reduction); sensors read every 5 min

2016 (autonomous control 2018)DeepMind / Google

~9% & ~13%

energy savings at two live commercial buildings; constraint satisfaction was the hard part

2022DeepMind, Controlling Commercial Cooling Systems with RL (arXiv:2211.07357)

<5% → 70%

enterprises deploying agentic AI in IT infra ops, 2025 → 2029 (forecast)

2026Gartner, Predicts 2026

20–25 °C

GB200 NVL72 coolant-inlet limit the cooling envelope must enforce; deviation throttles GPUs up to ~50%

2025NVIDIA OCP / Introl

~400 J/GPU

Vera Rubin power-smoothing energy a closed-loop BESS SoC controller manages per transient

2025NVIDIA, Production-Ready BESS for AI Factories

~1.5 GW

instantaneous load-loss event class autonomous power control must avoid becoming, not cause

2026NERC Level 3 Alert / Utility Dive

1:1 vs 2:1

training vs inference fabric oversubscription a workload-placement agent must respect when shifting jobs

2025SemiAnalysis / Meta

What this chapter consolidates (and what it deliberately does not)

Autonomy threads run through the whole guide, and the value of this chapter is partly that it refuses to re-litigate them. The point of consolidation is to keep one home for "how autonomous, and how do we bound it," and to point everywhere else for the substrate.

Telemetry and the operational twin are the Rung 0 floor and the simulator an RL policy is validated against before it touches the plant. Built in Chapter 14.2; the design-validation twin it is distinct from is in Chapter 2.7.
Autonomous fault recovery — health checks, lemon-node ejection, automated break-fix — is a specific, mature instance of Rung 2/3 autonomy on the compute side, and we do not re-derive it here. It lives in Chapter 10.7.
Energy-efficiency control — setpoints, free cooling, the ML-driven cooling optimization whose physics this chapter's RL agent acts on — is owned by Chapter 15.2.
Transient absorption and closed-loop storage control — the BESS SoC loop, ride-through, power smoothing — is physics owned by Chapter 4.5.
AI-assisted design — using ML to generate and validate the facility before it is built — is a different activity entirely and lives in Chapter 2.7; this chapter is strictly about run-time control of a commissioned plant.

And the differentiation that matters most: this chapter is not Chapter 14.11. That chapter is about automating the organization — runbooks, the human-error tradeoff, the shrinking operator bench. This chapter is about automating the plant — letting software act on power, cooling, and workload. The two intersect at the on-the-loop human, which is why the liability table above hands the supervisory role back to the org chapter rather than inventing a new one.

Deep dive: why offline RL and sim-to-real are the gates, not the policy

The instinct from the lab is that the policy is the artifact and deployment is plumbing. In a live data center the inversion is total: the policy is the easy 10% and the gates are the other 90%, and the DeepMind commercial-cooling paper is essentially a list of those gates. Three dominate.

Offline / batch RL. You cannot let a randomly-initialized policy explore by experimenting on a live cooling plant — exploration means deliberately taking bad actions to learn, and a bad action here means a thermal excursion across a hall of 132 kW racks. So the policy must be learned primarily from logged operational data (offline RL), which is hard precisely because the logged data only covers the operating points the existing conservative controller already visited. The agent has to generalize to better operating points it has never seen, while you have no on-policy data to validate that generalization. This is the single biggest reason production savings (~9–13%) fall short of the historical-data ceiling (~40%): the safe-learning constraint costs you exploration.

Sim-to-real and the twin. The bridge is the operational digital twin from Chapter 14.2 — a calibrated thermal/electrical model the policy is trained and stress-tested against before it is allowed near an actuator. The gate to Rung 2 is not "the policy scores well" but "the policy is safe across the twin's full envelope including the rare events," and the twin is only as good as its calibration against the as-built plant. A twin that does not model a failed sensor, a fouled heat exchanger, or a hot-day extreme is a twin that certifies a policy into exactly the distribution shift that breaks it.

Non-stationary evaluation. A data center is not a stationary environment — hardware ages, workloads change generation-over-generation, the plant is re-commissioned (→ Chapter 14.14). A policy validated in spring can be subtly wrong by autumn. The consequence is that autonomy is not a one-time certification but a continuous one: the agent's performance must be monitored against the conservative fallback as a live shadow, and drift in that gap is the trigger to re-validate or revert. Autonomy you certify once and forget is autonomy you have stopped supervising.

Deep dive: the override path is the most important control loop in the building

Every argument for autonomy assumes the human can take control back. That assumption carries the whole case and is routinely untested. The override path has its own failure modes, and they are nastier than the agent's because they are the last line of defense.

Speed. The override must be faster than the failure it must catch. If an agent can drive a thermal excursion in seconds but the override is a human noticing a dashboard, paging an operator, and walking to a console, the override is decorative. This is why Rung 2 demands an automated revert-to-fallback — a watchdog that trips control back to the conservative controller without waiting for a human — as the real first override, with the human as the second.

Authority and contention. When the human takes over, the agent must actually relinquish control, not fight for it. A poorly-designed system where the agent re-asserts a setpoint the operator just changed is worse than no autonomy — it is an adversary at the console. The override must be a clean, latching transfer of authority with unambiguous indication of who holds control.

The override-fails case. The 2026 RFP standard asks the question most programs skip: what happens if the human override itself fails — the revert path is broken, the fallback controller is also unhealthy, the watchdog mis-fires? The answer must be a designed safe-state (cooling to maximum, power to a safe cap, workload checkpointed and held), not an undefined behavior. An autonomy program that cannot describe its behavior when both the agent and the override fail has not finished its FMEA. → the failure-mode catalog these scenarios feed is Chapter 14.12's EOP set and Appendix F.

The decision register for an autonomy program

A defensible autonomy program produces a small set of artifacts that make the ladder auditable, the same way Chapter 1.1 demands a design-basis document before steel is cut. Three are non-negotiable.

The per-loop rung register. Every control loop, its current rung, its oversight posture, the gate it passed to get there, and its blast radius. This is the document you hand anyone who asks "how autonomous is the facility" — it forces the question down to a loop.
The envelope specification. For every Rung 2+ loop: the hard constraints, the fallback controller, the watchdog triggers, and the designed safe-state for the override-fails case. Change-controlled under the Chapter 14.12 MOC process alongside the reward function.
The autonomy-declaration matrix. Where autonomy crosses an organizational boundary (colo, neocloud, vendor control plane), the contractual mapping of capability → rung → rung-change conditions → rollback → liability owner. The thing that turns un-priced risk into an assigned one.

The through-line of this chapter, and of the autonomy era it describes, is that the optimizer was never the hard part. Cutting cooling energy 40% was demonstrated a decade ago. The hard part — the part that keeps most consequential loops at Rung 1 or 2 in 2026 — is bounding the optimizer so its failure modes are slower, smaller, and more attributable than the human shift it replaces. Climb the ladder one gated rung at a time, and autonomy is the highest-leverage operational decision in the building. Skip a gate, and it is the fastest path to a correlated outage you cannot explain.

Autonomy sits on top of the rest of operations, so its threads converge from across the guide. The Rung-0 telemetry floor and the operational twin are in Chapter 14.2; autonomous compute-side fault recovery in Chapter 10.7; the cooling-control and setpoint physics an RL agent acts on in Chapter 15.2; the power-transient and ride-through limits that bound its electrical envelope in Chapter 4.5; and the design-validation twin (distinct from the operational one) plus AI-assisted design in Chapter 2.7. The change-control regime that gates every promotion and treats the reward function as a controlled artifact is Chapter 14.12; the org that staffs the on-the-loop role and owns goodput is Chapter 14.11; the goodput-vs-availability reframe autonomy optimizes against is Chapter 12.2; and re-validating a drifting policy on a live campus is Chapter 14.14. The workload archetypes whose coupling a placement agent must respect are defined in Chapter 1.1.