The Definitive Guide toAI Data Centers
Ask the Guide

Chapter 1.1

The Archetype Decision Framework: Workload Is the Master Variable

The workload you intend to run is the single upstream decision that deterministically sets density, cooling, fabric, redundancy, and siting — pick it wrong and you do not have an inefficient data center, you have the wrong one.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

  1. Which of the five workload archetypes (pre-training, post-training/RL, online inference, batch inference, edge) your facility is actually being built for — and the requirements cascade that follows from that single choice.
  2. Which of the four procurement archetypes (greenfield self-build, retrofit, colocation, neocloud rental) matches your time-to-power, capital intensity, and workload-duration constraints.
  3. Which decisions are reversible (and can be deferred or re-decided cheaply) versus irreversible (and must be over-engineered or hedged at scoping time).
  4. The density target to design the slab, power chain, and cooling plant against — and therefore whether you are committing to air, rear-door, or direct-to-chip liquid before steel is cut.
  5. Which scoping artifacts (workload profile sheet, capacity ramp curve, design-basis document) must exist and be signed before any long-lead equipment is ordered.
Workload is the master variable Pick the workload archetype and you have already half-decided everything below it. One upstream choice cascades through seven dependent decisions — each arrow feeds the next. THE MASTER INPUT Workload archetype training vs inference vs mixed sets 1 Power density training packs 100–130 kW/rack; inference can run 20–40 kW kW per rack · sets the ceiling for everything below 2 Cooling modality training = direct-to-chip liquid; inference can stay air air vs liquid · plumbing is poured into the slab on day one 3 Network fabric training: non-blocking InfiniBand; inference fine on Ethernet GB/s east–west bisection · decides switch & cable budget 4 Storage tier training streams huge checkpoints; inference wants low-latency PB & throughput vs latency · sizes the storage fabric 5 Redundancy posture training checkpoints & restarts (N); inference stays up (2N) N vs 2N · doubles or halves the electrical & UPS spend 6 Siting training chases cheap remote power; inference hugs users $/MWh & ms to user · a one-way door — you build it once 7 TCO the bill is the sum of all six choices above $ / token · total cost of ownership is an output, never an input Get node 1 wrong and the error propagates through all seven. One upstream choice sets every downstream cost — the early ones (siting, plumbing) are one-way doors.
The workload archetype is the upstream choice that sets every downstream cost — and the earliest ones (siting, plumbing) are one-way doors.

Every other decision in this guide — the voltage you step down to, the temperature of the water you push to the rack, the oversubscription ratio on the back-end fabric, the redundancy tier you commission to, the county you site in — is downstream of one question that is almost never asked first: what is this machine for? Not "AI" in the abstract. The specific workload, its coupling, its tolerance for interruption, and its proximity requirement to the user. That answer is the master variable. Get it right and the rest of the lifecycle is a series of well-posed engineering problems. Get it wrong and you have spent two to four years and a capital stack on a facility that is mis-matched to the revenue it was supposed to earn — and most of those mistakes cannot be undone without re-pouring concrete.

This chapter is the framework that forces the question to the front. We define the five workload archetypes and the four procurement archetypes, then trace the requirements cascade: the deterministic chain by which an archetype propagates into density, cooling, fabric, storage, redundancy, siting, and TCO. We close on which decisions are reversible (defer them, keep optionality cheap) versus irreversible (over-build or hedge them now), the artifacts that capture a defensible scope, and the anti-patterns that recur because someone skipped this step.

The two orthogonal questions

Scoping an AI data center is two questions, and they are orthogonal — you must answer both, and the answer to one does not determine the answer to the other. WHAT will run here: pre-training, post-training/RL, online inference, batch inference, or edge inference (in practice, a weighted mix, but with a dominant archetype that sets the design basis). HOW will the capacity be procured and built: greenfield self-build, retrofit of an existing hall, wholesale or retail colocation, or rental from a GPU neocloud. A frontier pre-training run can be served from a self-build, a colo, or a neocloud; an online-inference business can equally be any of the four. The cross-product is a 5x4 matrix of real, deployed facilities — and the cost of mismatching a cell to its workload is the recurring theme of Part 1.

The reason this matters more in 2026 than it did in 2020 is that the binding constraint moved. The industry was chip-bound — the question was how many accelerators you could buy. It is now power-bound: the question is how many megawatts you can energize, and when. The US generator interconnection queue held roughly 2,290 GW of active generation and storage capacity at the end of 2024 — about twice the entire US installed fleet — with median time-to-energization approaching five years and large-load waits of four to seven years in the densest hubs (LBNL Queued Up 2025 Edition; utility filings). When power is the scarce input, a mis-scoped archetype does not just waste capital — it burns an interconnection slot you cannot get back, against a depreciation clock that is already running.

The five workload archetypes

The five archetypes are distinguished by three properties that drive everything else: coupling (how tightly the accelerators must communicate within a single unit of work), interruption tolerance (whether the job survives a node failure cheaply or restarts from a checkpoint), and latency sensitivity (whether a user is waiting on the output in real time). Hold those three in mind as you read the cascade table below — they are the levers that translate "what runs here" into "how it must be built."

Pre-training is one tightly-coupled supercomputer. Thousands of GPUs run synchronous data-, tensor-, pipeline-, and expert-parallelism, dominated by all-reduce/all-gather collectives across the back-end fabric on every step. The whole job moves at the speed of its slowest straggler; a single failed GPU in a synchronous run forces a restart from the last checkpoint. This is the archetype that demands maximum density, direct-to-chip liquid cooling, a 1:1 non-blocking fabric, and the largest scale-up domains money can buy. → Chapter 1.2.

Post-training / SFT / RLHF / RL is the hybrid middle, and it is the most misunderstood archetype in the building. Supervised fine-tuning is a small, bursty training job. But large-scale RL for reasoning is inference-heavy training: the dominant cost is generating rollouts — trajectories of 10K–100K+ tokens sampled from the current policy — not the gradient update that follows. That makes an RL cluster look like a fleet of inference engines feeding a comparatively small trainer, with asynchronous, staleness-tolerant coupling between the two. Scoping it as if it were pre-training over-provisions the fabric; scoping it as pure inference starves the policy update. → Chapter 1.4.

Online (interactive) inference is the revenue workload for most operators. It is bursty, latency-bound, and always-on: traffic can swing from 30% to 90% of capacity in minutes, and an SLO measured in time-to-first-token and time-per-output-token governs the user experience. It is loosely coupled (most requests fit inside a single node or a small scale-up domain), so the back-end fabric can be oversubscribed; but it demands high facility availability and proximity to users. Modern reasoning models, which emit long decode sequences, have inflated the decode share and the KV-cache pressure, reshaping fleet sizing. → Chapter 1.3.

Batch (offline) inference — embeddings generation, document processing, synthetic-data creation, evaluation sweeps — is throughput-bound rather than latency-bound. There is no user waiting, so it tolerates interruption, queuing, and aggressive oversubscription, and it is the natural consumer of spot capacity, off-peak power, and curtailable interconnections. It is the cheapest archetype to host and the most flexible to schedule.

Edge inference pushes serving to the user: on-prem appliances, telco/MEC nodes, Tier-2 metro colos, CDN-adjacent sites. It is defined by a hard latency budget (the 30/50/100 ms perceptibility thresholds) and severe power/thermal/space constraints, operated lights-out with zero-touch provisioning. Its siting driver is the inverse of pre-training: not cheap power, but physical proximity. → Chapter 1.5.

Workload archetype → requirements cascade
ArchetypeCouplingRack densityCoolingScale-out fabricRedundancySiting driver
Pre-trainingTight / synchronous (all-reduce every step)120–140 kW (GB200 NVL72), → 600 kW (Rubin Ultra Kyber, H2 2027)Direct-to-chip liquid mandatory; warm-water loops1:1 non-blocking, 8-rail fat-tree; InfiniBand or Spectrum-XN or N+1 — checkpoint-and-resume tolerantCheap/stranded power + cold climate; power-first
Post-training / RLMixed — async rollouts + a smaller synchronous trainerHeterogeneous: inference-class rollout pool + dense trainerLiquid for the trainer; liquid or high-density air for rolloutsDisaggregated: tolerant rollout fabric, tight trainer fabricN+1; staleness-tolerant, restartableFollows the dominant sub-workload; often co-sited with training
Online inferenceLoose — request fits a node or a small scale-up domain30–60 kW typical (HGX B200 class, air or liquid)Air at the limit, rear-door, or DLC by density2:1–3:1 oversubscribed; Ethernet/RoCE common2N / Tier-IV-class + N+1 cooling on standbySub-50 ms proximity to users; latency-first, geo-distributed
Batch inferenceLoose / embarrassingly parallel30–60 kW; flexibleWhatever the host hall already has; air often fineHeavily oversubscribed; cost-optimizedN — interruption-tolerant, queue-and-retryCheapest power; curtailable/non-firm load; off-peak
Edge inferenceNone (single node / appliance)Constrained: a few kW to ~30 kW per micro-siteAir or sealed/modular; ambient-limitedMinimal — local serving, WAN backhaulOften N; resilience via fleet-of-sites geo-redundancyLatency budget (30/50/100 ms); proximity over cost
How a single archetype choice propagates into the rest of the facility. Density and fabric figures are 2026-current NVIDIA-class reference points; see keynumbers below for sources and vintages.

The table is a cascade. The leftmost two columns, archetype and coupling, are the inputs you control; everything to the right is a consequence. Choose pre-training and you have, in effect, also chosen liquid cooling, a non-blocking fabric, reinforced floors for ~3,000–5,000 lb wet racks, and a power-first siting search. Choose online inference and you have chosen a fundamentally different building: lower density, 2N power, and a latency-first site that may cost you 2–4x on energy. The columns do not move independently. That is precisely why the archetype is the master variable — it collapses a dozen subsystem decisions into one.

The four procurement archetypes

The second orthogonal question — how you acquire the capacity — trades four levers against each other: time-to-power, capital intensity, control, and workload duration. The fastest options surrender the most control and carry the highest unit cost; the cheapest-per-GPU-hour options demand the most capital and the longest lead time. The right answer is a function of how long you expect the workload to run and how certain you are about it.

Greenfield self-build gives maximal control over density, power architecture, and cooling — and the lowest long-run unit cost at scale — at the price of 24–36 months to a live cluster and the deepest capital commitment. It is the right call only for a durable, well-forecast workload at scale. Retrofit of an existing air-cooled hall trades capital and schedule for a hard physics ceiling: floor loading, plenum, electrical headroom, and available water cap how far you can push density. Colocation (wholesale or retail) buys time-to-power — a live 50k+ GPU cluster in a wholesale hall in 6–12 months — by renting someone else's powered shell. Neocloud / GPU rental compresses time-to-first-job to days or weeks and converts capex into opex, at the highest per-GPU-hour rate and the least control over the underlying fabric and reliability posture. The decision is rarely all-or-nothing: hybrids (burst-to-neocloud, colo-anchor-plus-cloud-overflow, build-core-rent-edge) are the norm. → Chapter 1.6; quantitative NPV in Chapter 1.8.

Procurement archetype → build-vs-buy-vs-rent tradeoffs
ProcurementTime-to-powerCapital intensityControlBest-fit workload duration
Greenfield self-build24–36 monthsHighest (capex)Maximal — full power/cooling/fabric designDurable, large, well-forecast (multi-year)
Retrofit (brownfield)6–18 monthsModerate — $2–3M/MW cooling, $5–10M/MW full AI retrofitBounded by existing slab/power/waterBridge capacity; modest-density inference
Colocation (wholesale/retail)6–12 months to a live clusterCapex-light (lease + IT)Shared shell; you own the ITMedium-term, scaling, uncertain demand
Neocloud / GPU rentalDays to weeksOpex onlyLeast — vendor owns fabric & reliabilitySpiky, short, experimental, or burst overflow
Lead times and retrofit costs are 2026 practitioner ranges (SemiAnalysis, JLL/CBRE market data, Introl). Workload-duration column is the heuristic, not a rule.
120–140 kW
per GB200 NVL72 rack (≈132 kW typical: ~115 kW liquid + ~17 kW air)
the load one rack now pulls — it resets the power and cooling budget for the whole hall
2025NVIDIA GB200 NVL72 / HPE & Supermicro datasheets
~600 kW
per Rubin Ultra Kyber NVL576 rack on 800 VDC
the 2027 density — under-provision now and the next refresh strands the building
H2 2027 (announced)NVIDIA GTC (Jensen Huang); DCD, Tom's Hardware
~41 kW
practical air-cooling ceiling per rack; RDHx ~50–100 kW; DLC 200+ kW
the wall that forces a billion-dollar liquid-cooling commitment you can't retrofit cheaply
2025ASHRAE TC 9.9; SemiAnalysis Datacenter Anatomy
~2/3
inference share of AI compute in 2026 (½ in 2025, ⅓ in 2023); 80–90% of draw at large operators
demand has shifted to serving — size for revenue throughput, not training peaks
2026Deloitte TMT Predictions 2026; McKinsey
~2,290 GW
active generation + storage in US interconnection queues (end-2024; ~twice US installed capacity); large-load waits 4–7 yr in top hubs
the line you wait in for power — it gates when your capex starts earning anything at all
end-2024LBNL, Queued Up 2025 Edition
$283–318k
all-in cost per 8-GPU H100 server (excl. storage); ~$31k/GPU/yr enterprise all-in
the unit ticket price that makes a cluster a nine-figure decision, not a line item
2025SemiAnalysis AI Neocloud Playbook
~$0.74/GPU-hr
TCO at 2048-GPU scale, 90% utilization; ~$1.03 small clusters; cloud H100 ~$1.49 (contested — single-source)
the cost floor that decides whether owning beats renting at your utilization
2025SemiAnalysis H100 cost/rental analyses
2–3 yr
accelerated economic life vs 5–6 yr book life; used GPUs retain ~20–40% residual after 3 yr
if chips obsolete before they're depreciated, profit and collateral are overstated
2025Goldman Sachs; CNBC/secondary-market analyses

The requirements cascade, derived

The cascade is a causal chain you can walk forward from a single input. It runs density → cooling → fabric → storage → redundancy → siting → TCO, and each link constrains the next. Here is the derivation made explicit.

Density sets cooling. This is the hardest constraint in the building because it is physics, not policy. Air cooling saturates around 41 kW/rack; a GB200 NVL72 draws ~132 kW. There is no airflow management, no containment scheme, no warmer supply air that closes a ~90 kW gap — you are over the cooling cliff and direct-to-chip liquid is mandatory. The GB200 DLC envelope is unforgiving: coolant inlet below ~25 °C, ~20 L/min flow, under ~10 °C rise across the cold plates, with deviation throttling the GPUs up to 50%. Choosing the density target therefore commits the entire cooling plant, the facility water loop, and the heat-rejection strategy. → Chapter 5.1 (the density wall) and Chapter 5.4 (DLC).

Coupling sets the fabric. A synchronous pre-training job spends a large fraction of every step in collectives, so the back-end fabric must be 1:1 non-blocking — typically an 8-rail-optimized fat-tree with 8x 400 Gb/s NICs per server (3,200 Gb/s/node). Oversubscribe it and you starve the all-reduce and watch MFU collapse. Loosely-coupled inference fits inside a node or a small scale-up domain, so the same fabric at 2:1 or 3:1 oversubscription is fine and cuts back-end cost ~31% — money that would be wasted on a non-blocking inference fabric. The scale-up domain size (8 GPUs in HGX, 72 in NVL72, heading to 576) is itself a workload decision: bigger domains lift tensor-/expert-parallel ceilings for training and widen expert parallelism for MoE inference. → Chapter 8.5 (topology & oversubscription).

Interruption tolerance sets redundancy. A synchronous training job already restarts from a checkpoint when any node fails, so spending on 2N facility power to prevent a restart is largely wasted — N or N+1 plus disciplined checkpointing is the rational posture. An always-on inference business is the opposite: an outage is lost revenue and a breached SLA, so 2N / Tier-IV-class power with N+1 cooling on standby is justified. This is the cleanest example of an anti-pattern: over-provisioned redundancy for checkpointable jobs buys nines the workload does not value. → Chapter 12.2 reframes this as goodput vs availability.

Latency sensitivity sets siting. Pre-training is indifferent to user proximity, so it chases the cheapest firm megawatts and the coldest free-cooling climate, accepting that the site may be hours from any metro. Online and edge inference invert this: they chase sub-50 ms reach to users and accept power that costs 2–4x more. Siting is the least reversible decision of all — you cannot move a slab — which is why it must be derived from the workload, never the other way around. → Chapter 3.1 (the reordered siting hierarchy) and Chapter 3.2 (speed-to-power).

Deep dive: why RL is inference-heavy training (and why mis-scoping it is expensive)

The instinct is to file reinforcement learning under "training" and spec it like pre-training: maximum density, non-blocking fabric, the works. That instinct is wrong, and the error is costly. Modern RL for reasoning alternates two phases with very different infrastructure profiles. In the rollout (generation) phase, the current policy samples long trajectories — commonly 10K–100K+ tokens each — to explore behaviors and collect reward signal. This is pure autoregressive inference: memory-bandwidth-bound decode, embarrassingly parallel across prompts, tolerant of a loosely-coupled and oversubscribed fabric. In the policy-update phase, the collected experience drives a comparatively small synchronous gradient step (PPO/GRPO and relatives).

The consequence: rollout generation, not the gradient update, is the dominant cost and the real bottleneck. A correctly-scoped RL cluster looks disaggregated — a large inference-class rollout pool feeding a smaller, tightly-coupled trainer, coupled asynchronously with bounded staleness so the rollout fleet never stalls waiting on the trainer. Scope it as pre-training and you pay for a non-blocking fabric and uniform max-density racks across a pool that is mostly doing inference — stranded capex. Scope it as pure inference and you have no trainer fabric for the policy update — a starved learner. RL is the archetype that most rewards reading coupling and interruption tolerance separately for each sub-workload rather than applying a single label. → Chapter 1.4.

Reversible vs irreversible decisions

Not all forks cost the same to re-decide. The discipline that separates a good scope from a fragile one is sorting decisions by the cost of changing your mind, and spending your optionality budget accordingly: over-build or hedge the irreversible decisions now; defer the reversible ones and keep them cheap to change.

Irreversible (decide once, at scoping): the site itself (you cannot move a slab); the grid interconnection capacity and voltage class (the queue slot is the scarcest asset in the project); the structural floor-loading basis (retrofitting a slab for ~3,000–5,000 lb wet racks mid-life is brutal); the base power architecture (415/480 VAC vs an 800 VDC path); and the macro cooling decision (a hall plumbed for liquid vs one that is not). These are the decisions where you pay to keep options open — e.g. provisioning floor loading and water for a density step-up you have not committed to yet.

Reversible (defer, re-decide cheaply): the specific accelerator generation within a power/cooling envelope; the scheduler and orchestration platform; oversubscription ratio on a fabric you sized non-blocking; the workload mix ratio within an archetype; and — critically — the procurement mode, which is why hybrid and rental exist. The strategic move is to convert irreversible decisions into reversible ones wherever the option premium is cheap: a powered shell instead of a full build-to-suit preserves IT-fit-out optionality; a colo lease preserves the option to exit; reserving floor loading and water headroom preserves a density ramp. → procurement framing in Chapter 1.6; refresh execution in Chapter 14.9.

Scoping artifacts: what a defensible scope produces

A scope that survives board scrutiny and lender diligence is three artifacts that pin down the archetype decision and its consequences, signed before long-lead equipment is ordered.

  • Workload profile sheet. The dominant archetype and the mix ratio; coupling, interruption tolerance, and latency budget; the implied scale-up domain size, GPU:CPU and GPU:storage ratios, and the back-end fabric blocking requirement. This is the single page from which the cascade is derived.
  • Capacity ramp curve. MW and GPU count over time, generation by generation, with the density step-ups called out — because the ramp, not the steady state, is what the irreversible substrate (floor, power, water, cooling plant) must accommodate.
  • Design-basis document. The frozen assumptions that everything downstream inherits: density tier, cooling modality, redundancy topology, voltage architecture, siting class, and the reversible-vs-irreversible register that records which assumptions are hedged and which are committed.

Per-archetype reference design-basis sheets and scalable-unit budgets are built out in Chapter 1.7; the economics that score the resulting scope live in Chapter 1.8.

Deep dive: the cooling cliff as a one-way door

Of all the cascade links, density → cooling is the one that punishes a wrong scope most violently, because it is a discontinuity rather than a slope. Below ~41 kW/rack you are in air's regime: raised floor or slab, hot/cold-aisle containment, CRAH/in-row coolers, warmer ASHRAE A1–A4 supply air. Push to ~50–100 kW and rear-door heat exchangers or air-assisted liquid bridge the gap — the brownfield-friendly path, because they need no facility water at the rack. Past ~100 kW the only answer is direct-to-chip liquid: cold plates, in-rack manifolds, ~150–200 quick-disconnects per rack, a CDU isolating the technology-cooling loop from facility water, and a warm-water loop sized to a tight delta-T.

The reason this is a one-way door: a hall built for air has the wrong floor loading, no plenum for liquid distribution, insufficient electrical headroom, and often no facility water provisioned. Crossing the cliff in a retrofit costs $5–10M/MW and still leaves stranded capacity — power you cannot use because cooling caps out first, or floor area you cannot fill because the slab cannot bear wet racks. The decision to plumb a hall for liquid is therefore an archetype decision masquerading as a mechanical one. If there is any chance the facility hosts training or next-generation dense inference, you plumb for liquid at scoping time or you accept that you have built an inference-only, current-generation building. → Chapter 5.1; retrofit paths in Chapter 5.4.

Anti-patterns

The same mis-scopes recur, because each one comes from skipping the archetype question and reasoning from the equipment or the real estate instead. Three are worth naming explicitly:

  • Training fabric for an inference business. Sizing a 1:1 non-blocking back-end fabric for a workload whose requests fit inside a node wastes ~31% of back-end cost on bisection bandwidth that never carries traffic. Inference earns the revenue but does not need the trainer's fabric — oversubscribe it and spend the savings on geo-distribution and uptime.
  • Retrofitting past the air-cooling cliff. Trying to land 132 kW liquid-cooled racks in a hall scoped for 40 kW air. The slab, the plenum, the power chain, and the absent facility water all say no. The retrofit either fails or strands capacity at a cost that would have funded a purpose-built liquid hall.
  • Over-provisioned redundancy for checkpointable jobs. Commissioning 2N / Tier-IV power for a synchronous training cluster that already tolerates checkpoint-and-resume. You are buying nines the workload does not value — capital that would return more as goodput (faster checkpointing, hot spares, more GPUs) than as facility availability. → Chapter 12.2.
Each archetype gets a full treatment of its own: training in Chapter 1.2, inference in Chapter 1.3, post-training/RL in Chapter 1.4, edge in Chapter 1.5. The procurement fork is deepened in Chapter 1.6; the per-subsystem mapping is tabulated in Chapter 1.7; the economics that score every archetype live in Chapter 1.8. The cooling cliff that this chapter treats as a fork is engineered in Chapter 5.1 and Chapter 5.4; the fabric oversubscription decision in Chapter 8.5; the checkpoint math behind training's interruption tolerance in Chapter 9.4; and the redundancy rethink in Chapter 12.2.