The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 0.1

Orientation: The AI Data Center as a Single Co-Designed Machine

An AI data center is not a building that happens to contain computers — it is one co-designed machine whose compute, fabric, power, cooling, and software are a single optimization, and the only honest objective function for that machine is TCO-per-unit-of-useful-work, not the local efficiency of any one subsystem.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

  1. Whether you will reason about the facility as one co-designed system — power, cooling, compute, fabric, software, all optimized jointly for goodput-per-dollar — or as five procurement silos that meet for the first time at integration, which is the most expensive place to discover they disagree.
  2. Which objective you are actually maximizing: subsystem efficiency (PUE, switch utilization, rack fill) or system goodput (useful tokens or training-steps per dollar per watt) — they are not the same target, and optimizing the first routinely degrades the second.
  3. Where in the two interlocking tracks — the facility track (grid → power → cooling → building) and the IT track (compute → fabric → storage → scheduler) — your organization actually has authority, because the seams between them are where projects fail.
  4. How you will hold the grid-to-chip power spine and the chip-to-atmosphere heat spine as continuous chains rather than handoffs, since every interface in those chains is a place to strand capacity or lose a nine.
  5. Whether you have internalized that the binding constraint is now megawatts and time-to-power, not chips — and therefore that the whole design must be read backward from the substation, not forward from the GPU.

This guide makes one claim before all others, and the rest of it is an unpacking of that claim: an AI data center is a single machine. The accelerators, the back-end fabric that lashes them into one computer, the power chain that feeds them, the cooling plant that carries their heat to the sky, and the software that schedules work across all of it are not five products that get assembled on a slab. They are five faces of one object, and the object has one job — to convert megawatts and capital into useful AI work as cheaply as possible. The traditional data center could afford to treat these as separate disciplines with separate vendors and separate org charts, because a 10 kW rack of web servers is loosely coupled to its building. A 132 kW liquid-cooled rack of accelerators running a synchronous training job is not. It is welded to its power chain, its water loop, and its fabric by physics, and a decision in any one of them propagates into all the others within milliseconds and across the entire capital stack.

The consequence of getting this wrong is incoherence, not mere inefficiency. A facility scoped as separate subsystems, each locally optimized by a different team to a different metric, produces a building where the power chain is sized for a density the cooling plant cannot remove, where the fabric is non-blocking for a workload that never needed it, where the redundancy is gold-plated for jobs that checkpoint anyway, and where the PUE is excellent and the goodput is mediocre. Every subsystem passes its own acceptance test and the machine as a whole underperforms its TCO model by double digits. This chapter sets up the lens — co-design, one objective function, two interlocking tracks, two narrative spines — through which the sixteen Parts that follow are read. It is short on numbers and long on orientation by design; the numbers and their derivations live in the chapters this one points to.

The thesis: one machine, one objective function

The discipline that separates an AI factory from a server hall is that it optimizes a system objective, and that objective is goodput-per-dollar-per-watt — the useful work (training-steps that survive, tokens that meet their SLO) extracted per unit of capital and energy across the asset's economic life. Almost every subsystem metric you can name is a proxy for this, and every proxy lies in a different direction. PUE rewards spending less energy on cooling — but the lowest-PUE choice can throttle the GPUs and destroy more goodput than it saves in overhead. Rack fill rewards packing accelerators densely — but past the cooling cliff it strands power you paid to interconnect. Switch utilization rewards oversubscribing the fabric — but on a synchronous training job it collapses model-FLOPs-utilization and the whole cluster idles waiting on a straggler. Each local optimum is a real number a team can be measured against, and each one, pursued alone, can move the system objective the wrong way.

Co-design means the forks in this guide are coupled: the voltage you step down to, the temperature of the water you push to the cold plate, the oversubscription ratio on the back-end, the redundancy tier you commission, the county you site in — none of them can be decided in isolation without quietly pessimizing the others. The right mental model is a constrained optimization with one objective and many coupled variables, not a checklist of independently-graded subsystems. When you read a decision in a later chapter, the question to keep asking is not "is this subsystem efficient?" but "does this choice raise system goodput-per-dollar, net of what it costs the subsystems it touches?" → the metric stack that makes this measurable is defined in Chapter 0.3 and built out in Chapter 15.1; the reliability reframing of goodput-over-availability is in Chapter 12.2.

Why 2024–2026 is the inflection: chip-bound to power-bound

The co-design thesis would have been true but academic in 2020, when racks drew 10–15 kW, air cooling was universal, and the binding constraint was how many accelerators you could buy. Three things broke simultaneously between 2024 and 2026 and turned co-design from good practice into survival.

First, density exploded. A 2023 H100 rack drew roughly 40 kW; a 2024 GB200 NVL72 draws ~120–132 kW; a 2025 GB300 lands near ~142 kW; and NVIDIA's roadmap puts the 2027 Rubin Ultra Kyber rack at ~600 kW on an 800 VDC power architecture (SemiAnalysis / NVIDIA roadmap, 2026). That is a roughly 15x climb in per-rack power in four years. Air cooling saturates around 41 kW/rack, so this trajectory does not merely strain the cooling plant — it crosses a discontinuity that makes direct-to-chip liquid mandatory and rewrites the building, the floor loading, and the water strategy. The density step is a wall the facility either was or was not designed to climb, not a slope you can ride.

Second, the bottleneck moved from the chip to the substation. The industry is now power-bound: the scarce input is not accelerators but energizable megawatts and the time to get them. US large-load grid interconnection now runs roughly 3–7+ years end-to-end, with the worst queues quoted past a decade (ERCOT/PJM filings synthesis, 2025), and HV power transformers carry ~128-week-plus lead times that often dominate the schedule (Wood Mackenzie, 2025). When power and its long-lead apparatus are the constraint, the whole design must be read backward from the point of interconnection, not forward from the GPU — a reversal of the traditional design order that Part 3 and Part 4 make concrete.

Third, the workload mix tipped from training to inference. Inference is now estimated at roughly two-thirds of AI compute in 2026 (up from about half in 2025 and a third in 2023), and 80–90% of the compute draw at large operators (Deloitte TMT Predictions 2026; McKinsey). The center of gravity of the build-out is shifting from a few enormous synchronous training machines toward many always-on, latency-bound, geo-distributed serving fleets — a different machine with a different objective, which is exactly why the archetype question in Chapter 1.1 is the master variable downstream of this orientation.

~10–15 kW → 120–600 kW
rack power across the inflection: legacy → GB200 NVL72 (~132 kW) → Rubin Ultra Kyber (~600 kW, 2027 roadmap)
an order-of-magnitude density jump that obsoletes the building you'd build to today's spec
2026SemiAnalysis / NVIDIA roadmap
~41 kW
practical air-cooling ceiling per rack — the discontinuity that forces liquid and rewrites the building
the wall that forces a billion-dollar liquid-cooling commitment you can't retrofit cheaply
2025ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
~2/3
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80–90% of draw at large operators
demand has shifted to serving — build for revenue throughput, not training prestige
2026Deloitte TMT Predictions 2026; McKinsey
~3–7+ yr
US large-load grid interconnection lead time end-to-end; up to ~10 yr in the worst queues — the binding constraint
power, not capital or chips, is what gates your time-to-revenue
2025ERCOT / PJM filings synthesis
~128 wk
HV/substation power transformer lead time (standard); up to ~60 months in constrained markets — often the schedule's long pole
a single long-lead part can slip your go-live by years if you order it late
2025Wood Mackenzie / pv magazine
approaching ~$1T
global data center capex in 2026 (~21% CAGR through 2029; GPUs ~1/3 of capex)
the spend pool you're competing inside for scarce power, parts, and labor
2026Dell'Oro Group
~$6.7T
cumulative global data center capex by 2030 (~$5.2T AI-capable) — the scale that makes mis-coordination catastrophic
the stakes that turn a coordination mistake into a multi-billion-dollar stranded asset
2025McKinsey, 'The cost of compute'
>92% vs ~61–87.5%
end-to-end electrical-chain efficiency, 800VDC/DC chain vs legacy AC (utility-to-VRM) — a system gain only co-design captures
lost watts you pay for forever — only co-design recovers them, bolt-on retrofits can't
2025SemiAnalysis, Datacenter Anatomy Pt 1

Two interlocking tracks: facility and IT/cluster

The single machine is delivered by two tracks that must interlock, and most failures live in the seam between them. The facility track owns the grid-to-chip power spine and the chip-to-atmosphere heat spine: utility interconnection, on-site substation, MV/LV distribution, UPS and backup generation, the cooling plant, the CDUs and water loops, and the building shell that houses all of it. The IT/cluster track owns the compute and the software that turns it into a computer: the accelerators and their racks, the scale-up and scale-out fabric, the storage hierarchy that keeps the GPUs fed, and the scheduler and orchestration stack that decides what runs where. Traditionally these were two organizations, two budgets, two vocabularies, and two acceptance regimes that met at the rack PDU and otherwise ignored each other.

In an AI factory that division is the defect. The fabric topology (IT) dictates the rack adjacency and therefore the floor plan and the water manifold layout (facility). The accelerator's coolant inlet spec (IT) dictates the warm-water loop delta-T and the heat-rejection plant (facility). The redundancy posture the workload actually needs (IT, via checkpointing) determines whether the 2N power the facility team is proud of is value or waste. The capacity ramp curve (IT, generation by generation) determines the floor loading and water headroom the facility must provision irreversibly years before the racks arrive. Each of these is a sentence that starts in one track and ends in the other. The owner's organization that does not establish authority across the seam — a single technical owner of the system objective, with both tracks reporting into it — discovers the disagreements at integration, where they are most expensive and least reversible. → the owner's organization and delivery models are in Chapter 2.2; the integrated master schedule that forces the two tracks onto one critical path is in Chapter 2.1; the rack as the integration unit where the two tracks physically meet is in Chapter 7.13.

The two interlocking tracks and where they couple
Coupling seamFacility-track ownsIT/cluster-track ownsWhat goes wrong if the seam is ignoredEngineered in
Density → cooling modalityCooling plant, water loop, heat rejectionRack power, accelerator coolant specPower interconnected but unremovable heat; stranded MWChapter 5.1 / 5.4
Fabric topology → floor planSlab, manifold runs, rack adjacencyScale-up domain, scale-out blockingCable reach blown; copper turns to optics; cost & latencyChapter 8.5 / 7.13
Capacity ramp → irreversible substrateFloor loading, water & electrical headroomGeneration-by-generation GPU/MW curveA hall that cannot absorb the next-gen rack; re-pourChapter 1.1 / 6.1
Workload redundancy need → power tierUPS, generation, 2N vs N+1 topologyCheckpoint cadence, interruption toleranceGold-plated nines the workload never valuedChapter 12.2 / 9.4
Power architecture → server inputMV distribution, 415/480 VAC or 800 VDCRack/server power-input design, sidecarStranded efficiency; incompatible power shelfChapter 4.1 / 7.12
The two delivery tracks are not parallel and independent — each row is a coupling where a decision in one track determines an outcome in the other. The rightmost column names the chapter where the coupling is engineered.

Two narrative spines: grid-to-chip and chip-to-atmosphere

Because the machine is continuous, this guide tracks two physical chains end to end, and returns to them in every Part. They are the two flows the building exists to manage: electrons in, and heat out.

The grid-to-chip power spine follows a single electron from the point of interconnection to the transistor: utility substation → on-site substation and MV distribution → transformation and switchgear → UPS / ride-through → power distribution → busway and rack PDU → the power shelf and the on-package voltage regulators that finally deliver sub-volt power to the accelerator die. Every transformation in that chain has an efficiency and a failure mode, and the cumulative loss is the gap between the megawatts you pay the utility for and the megawatts that reach the chips. The headline of the 2026 era is that re-architecting this spine — legacy AC at ~61–87.5% utility-to-VRM versus an 800 VDC chain over 92% (SemiAnalysis, 2025) — is a roughly five-point end-to-end gain that no single subsystem can capture, because it requires the substation, the distribution, and the server power-input to be redesigned together. → the spine is engineered across Chapter 4.1 (topology and voltage), Chapter 4.2 (interconnect and MV distribution), and Chapter 7.12 (on-package delivery); the power-bound era that frames it is Chapter 16.1.

The chip-to-atmosphere heat spine follows a single joule the other way: the accelerator die → thermal interface → cold plate → in-rack manifold and quick-disconnects → CDU (which isolates the technology-cooling loop from facility water) → facility water loop → heat rejection (chillers, dry coolers, towers, adiabatic, or economizers) → and finally the atmosphere or a heat-reuse offtake. Nearly all the energy that enters the power spine leaves through the heat spine; the building is, thermodynamically, a machine for moving that heat. The 2026 density climb forces this spine from air to direct-to-chip liquid, and each interface in it — the coolant inlet temperature, the loop delta-T, the approach temperature at the rejection stage — is a constraint that ties back to the accelerator on one end and the climate of the site on the other. → the spine is engineered across Chapter 5.1 (the density wall), Chapter 5.4 (DLC), Chapter 5.6 (CDUs), Chapter 5.7 (warm-water loops), and Chapter 5.8 (heat rejection).

Deep dive: the cooling cliff is where co-design stops being optional

The clearest demonstration that the machine is co-designed — and the place a siloed organization is punished most violently — is the air-cooling discontinuity. Below ~41 kW/rack you are in air's regime, and the facility track and the IT track really can operate semi-independently: the IT team picks servers, the facility team supplies cold air, and the seam at the rack is forgiving. A GB200 NVL72 draws ~120–132 kW. There is no airflow scheme, no containment, no warmer supply air that closes a ~90 kW gap. You are over the cliff, and direct-to-chip liquid is mandatory — which means the IT decision to deploy that rack has just rewritten the facility: the floor must bear a ~1.36-tonne wet rack, the hall must be plumbed for facility water it may not have, the electrical headroom must triple, and the heat-rejection plant must be sized to a tight delta-T set by the accelerator's coolant spec (inlet 20–25 °C, deviation throttling GPUs up to ~50%).

The cliff is a one-way door because a hall built for air has the wrong floor loading, no plenum for liquid distribution, insufficient power, and often no facility water provisioned at all. Crossing it in a retrofit costs on the order of $5–10M/MW and still strands capacity. This is why the decision to plumb a hall for liquid is an archetype decision, not a mechanical one, and why it cannot be delegated to either track alone. A choice that looks local (cooling) propagates into power, structure, siting, and TCO simultaneously. → engineered in Chapter 5.1; retrofit paths in Chapter 5.10.

The lifecycle spine of this guide

The guide is organized along the project lifecycle, because that is the order in which the co-design decisions are actually made and the order in which they become irreversible. Reading it front to back walks the machine from intent to operation; reading it by discipline or by workload archetype is equally valid and supported by the cross-references. The spine is: Strategy (Part 1, what machine and why) → Delivery (Part 2, how it gets built and financed) → Siting (Part 3, where, read backward from power) → Power and Electrical (Parts 3–4, the grid-to-chip spine) → Cooling (Part 5, the chip-to-atmosphere spine) → Building (Part 6, the shell that houses both) → Compute (Part 7) → Networking (Part 8) → Storage (Part 9, the IT track that turns chips into a computer) → Software (Part 10, the scheduler that runs the machine) → Security & Reliability (Parts 11–12) → Commissioning (Part 13, proving the machine before it earns) → Operations (Part 14, running it for goodput) → Sustainability (Part 15) → Future (Part 16).

The altitude discipline of the guide is that each concept is derived once, in its canonical chapter, and referenced everywhere else — a roadmap forward-pointer is a bullet, never a chapter. This orientation chapter is deliberately the thinnest in numbers: its job is to install the lens (one machine, one objective, two tracks, two spines, power-bound), not to engineer anything. How to read the decision-and-consequence framing that structures every later chapter is the subject of Chapter 0.2; the vocabulary and metric stack you will need throughout is Chapter 0.3; and the workload-archetype decision that is the first real fork downstream of this orientation is Chapter 1.1.

How the three threads run through the whole guide

Three threads recur in every Part, flagged at the top of each chapter so you can read the guide along any of them. POWER-BOUND is the recognition that megawatts and time-to-power, not chips, are the binding constraint — it reorders siting, dominates the schedule, and makes the grid-to-chip spine the master narrative of Parts 3 and 4. GOODPUT is the system objective itself — useful work per dollar per watt — which reframes reliability (Part 12), operations (Part 14), and efficiency (Part 15) away from subsystem metrics and toward the one number that pays the capital stack. DENSITY-RAMP is the trajectory — 10 kW to 600 kW per rack and climbing — that drives the chip-to-atmosphere spine across the cooling cliff, dictates the irreversible building substrate, and forces every facility to be scoped for a ramp it has not yet committed to. The three threads are not independent: density-ramp drives the heat spine, power-bound constrains the electron spine, and goodput is the objective that adjudicates between them. Hold all three, and the seventeen Parts read as one argument about one machine.

This orientation installs the lens; the rest of the guide engineers the machine through it. How to read the decision-and-consequence framing is Chapter 0.2; the vocabulary and metric stack are Chapter 0.3; the reliability and redundancy primer is Chapter 0.5. The first real fork — what machine you are building — is the archetype decision in Chapter 1.1. The grid-to-chip power spine is engineered in Chapter 4.1 and Chapter 4.2; the chip-to-atmosphere heat spine in Chapter 5.1 and Chapter 5.4; the building that houses both in Chapter 6.1; the fabric that makes the chips one computer in Chapter 8.5; the storage that keeps them fed in Chapter 9.1. The goodput thread is canonicalized in Chapter 12.2 and Chapter 14.1; the power-bound thread in Chapter 16.1.