Guide › Part 14

Part 14

Day-2 Operations, Upgrades & Lifecycle

14 chapters

Operational KPIs, Goodput & the Reliability Economics of AI Factories

An AI factory does not earn money when it is 'up' — it earns money when accelerators are doing useful work on the critical path, so the number that governs day-2 economics is not facility availability but goodput, and every reliability dollar must be justified against the goodput it buys, not the nines it adds.

DCIM, Telemetry & Observability for GPU-Dense, Liquid-Cooled Facilities

A GPU-dense liquid-cooled facility runs two telemetry universes — sub-second IT health and minute-scale OT plant — and the operator's defining decision is whether to correlate them into one goodput-aware control plane or leave them as two siloed pipelines that each blame the other when a cluster stalls.

Component Failure Modes, Failure Rates & Fleet Reliability Data

A GPU fleet does not fail like a traditional data center — it fails constantly, in three distinct ways (hard, transient, silent), and the only defensible design is to measure the per-component failure rate, accept that the mean cluster lifetime between interruptions is hours not months, and engineer detection and recovery around the failures you cannot prevent.

Reliability Engineering for Training (Operational)

At day-2 scale a frontier training job fails not as an exception but as a baseline rate — one interruption every few hours — so the operational job is not preventing failures, it is detecting, isolating, and recovering from them faster than they accumulate, because every minute of detect-to-recover comes straight out of goodput.

Predictive & Preventive Maintenance of Power and Cooling Plant

In a 24/7 synchronous AI factory the maintenance question is not 'is the plant reliable?' but 'can you service it without dropping the job?' — and the answer is set years earlier by whether you bought concurrent maintainability and built the condition-based program that lets you intervene on the equipment's schedule instead of the failure's.

Spares Strategy, RMA Logistics & Repair Operations

At a fleet of hundreds of thousands of accelerators, spares are not an inventory line — they are an availability instrument: the depth of the on-site pool and the speed of the hot-swap, not the OEM warranty, set how much goodput the cluster actually earns back from every failure.

Capacity, Power & Thermal Management in Operation

In a power-bound facility the megawatts you energized are a fixed, capital-intensive ceiling — the operational job is to fill that ceiling with goodput without tripping it, and every fork in this chapter is a trade between how full you run the budget and how violently the workload can swing it.

Firmware & Software Lifecycle Management at Fleet Scale

Firmware and software are the only fleet variable you change thousands of times a year on hardware that costs $30k+/GPU and earns ~$10-12B/GW/yr — so the discipline is not whether to update but how to roll change across a synchronized estate without sacrificing goodput, blowing a maintenance window, or shipping a bad bit to 100,000 GPUs at once.

Hardware Refresh, Depreciation Strategy, Decommissioning & ITAD

Refresh is the moment the depreciation assumption you underwrote in Chapter 1.8 stops being an accounting choice and becomes a physical operation — and how you execute it (when you pull the part, where it cascades, how you sanitize it, who you sell or shred it through) decides whether the residual value the whole financing case rests on is real money or a write-down with extra steps.

Facility Decommissioning, Repowering & Site Remediation

The facility outlives the silicon by a decade or more, and its end-of-life is a financial and environmental fork decided years before the last server leaves: repower the shell and keep the interconnection, demolish and restore the dirt, or convert to another use — and in a power-bound market the energized shell you are tempted to tear down is often the most valuable asset on the books.

Operations Organization, Workforce, Talent & Incident Command

The AI factory is a machine for converting megawatts into goodput, and it is run by people — so the org chart, the talent bench, and the incident-command model are not HR concerns downstream of the engineering, they are first-order reliability infrastructure, and the single largest controllable cause of outage is who is on shift and what procedure they are holding.

Operational Procedures, Change Management & Human-Error Control

Human error is the plurality root cause of preventable outages, and almost none of it is carelessness — it is the predictable output of bad procedure design, so the lever that moves availability most is not hiring better people but engineering the MOP/SOP/EOP regime and the change process that put those people in front of live plant.

Agentic Ops, RL Control & the Autonomy Ladder

Autonomy in the data center is not a switch you flip but a ladder you climb rung by rung — and the only defensible way up is to let an agent act unsupervised exactly as far as you have bounded its action space, instrumented its failure modes, and assigned the liability for when it is wrong.

Continuous & Re-Commissioning on a Live Campus

Commissioning is not a one-time gate you pass at go-live; on a live AI campus that adds a new accelerator generation every 12–18 months, the proof that the building still does what its drawings claim decays continuously — so re-commissioning becomes a standing operational program, triggered by drift and density steps, executed against a running revenue factory you cannot turn off.