Appendix E
Glossary, Phase-Gate Timeline & Learning/Community Map
An appendix earns its place by being the page you keep open at the bench: this one collapses the guide's vocabulary into a single scannable glossary, lays the 24–60 month land-to-go-live schedule on one critical-path table, and points you at the certifications, conferences, and feeds that keep the rest from going stale between editions.
What you'll decide here
- Use the glossary as a decoder ring: every acronym in the body chapters resolves here, with the canonical chapter that owns the full treatment named in the third column — jump there when the one-line definition is not enough.
- Read the phase-gate timeline as a critical path, not a checklist: the bolded gate rows are the ones that serialize the whole program; durations off the critical path can overlap, but a slip on a critical-path gate slips go-live one-for-one.
- Treat the month ranges as planning anchors, not commitments — they are 2026 practitioner medians for a >50 MW greenfield AI build; a retrofit or colo fit-out compresses the front half, and a contested interconnection or substation lead time blows out the back half.
- Work the learning/community map as a maintenance plan: pick one certification track per discipline on your team, put the two or three anchor conferences on the calendar, and subscribe to the feeds so the figures in this guide are corrected by primary sources, not by a competitor's outage.
This appendix is the reference layer the rest of the guide leans on. It does three jobs. First, it is the glossary — a single alphabetized table that resolves every term of art the body chapters use, from efficiency ratios (PUE, WUE, ERF) through utilization metrics (MFU, MBU, goodput) to fabric and packaging vocabulary (NVLink domain, CoWoS, HBM, RoCE) and the program-management primitives (SU, ETTR, phase gate). Each entry names the chapter that owns the full treatment, so the glossary doubles as an index. Second, it is the phase-gate timeline — the 24-to-60-month sequence from raw land to a live cluster, with realistic durations and the critical path called out, because a common scoping error is treating parallelizable work as serial and serial work as parallelizable. Third, it is the learning and community map — the certifications, conferences, and feeds that let a practitioner keep this material current after the ink dries.
None of this is meant to be read front-to-back. It is meant to be searched. The tables are dense on purpose.
Glossary — efficiency, utilization & thermal metrics
The metric stack is split across three tables so each stays scannable. This first table covers the facility-efficiency and workload-utilization ratios; the canonical definitions and the post-PUE metric stack are built out in Chapter 15.1, with the utilization metrics anchored in Chapter 0.3 and the goodput reframing in Chapter 12.2.
| Term | Definition | Canonical chapter |
|---|---|---|
| PUE — Power Usage Effectiveness | Total facility power / IT power. The headline efficiency ratio; 1.0 is theoretical perfect, AI liquid-cooled halls target ~1.1–1.2. Says nothing about IT-side efficiency. | 15.1 |
| WUE — Water Usage Effectiveness | Liters of water consumed per kWh of IT energy. The water analog of PUE; evaporative cooling trades a better PUE for a worse WUE. | 15.1, 15.4 |
| ERF — Energy Reuse Factor | Fraction of facility energy exported as useful heat (district heating, etc.); the only metric here where higher is better. | 15.1, 15.5 |
| REF / CUE — Renewable Energy / Carbon Usage | REF: renewable share of supply. CUE: kg CO2e per kWh of IT energy. The carbon companions to PUE. | 15.1, 15.3 |
| ITUE / TUE — IT / Total Usage Effectiveness | ITUE pushes the boundary inside the server (fans, VRMs, PSUs); TUE = PUE x ITUE, the true facility-to-transistor ratio. | 15.1 |
| MFU — Model FLOPs Utilization | Achieved FLOPs / peak FLOPs for a training run. The headline training-efficiency number; 35–55% is good at scale, collectives and stragglers erode it. | 0.3, 13.9 |
| MBU — Model Bandwidth Utilization | Achieved memory bandwidth / peak, for memory-bound decode inference. The MFU analog when the bottleneck is HBM bandwidth, not FLOPs. | 0.3, 10.11 |
| Goodput | Useful work delivered per unit time after subtracting failed/restarted/stale work. The metric that matters; distinct from raw throughput and from facility availability. | 12.2, 10.11 |
| ETTR — Effective Training Time Ratio | Productive training wall-clock / total elapsed wall-clock. Folds in interruptions, checkpoint overhead, and restart loss; the goodput metric for training. | 12.2, 9.4 |
| Tokens-per-joule | Inference energy efficiency: tokens emitted per joule of facility energy. The cross-vendor, cross-architecture comparator that survives generation changes. | 15.1, 7.10 |
| $/GPU-hr | All-in cost to operate one accelerator for one hour (capex amortization + power + cooling + staff). The unit economic for build-vs-rent. | 1.8, 7.11 |
| $/M-tokens | Cost to serve one million tokens; the revenue-side unit for inference businesses. | 1.8, 10.11 |
| EDPp — Energy-Delay Product (per op) | Energy x latency, penalizing slow-and-power-hungry designs; a silicon/architecture figure of merit that resists gaming by either axis alone. | 7.10 |
| Delta-T / approach temperature | Temperature rise across a cold plate or heat exchanger; the tight delta-T (under ~10 C across DLC cold plates) is what sizes the warm-water loop. | 5.1, 5.4 |
| NTU / effectiveness | Number-of-transfer-units and heat-exchanger effectiveness; the sizing math for CDUs and dry/wet coolers. | 5.1 |
Glossary — compute, memory & packaging
The silicon and packaging vocabulary that gates supply and density. The accelerator landscape lives in Chapter 7.1, HBM as the binding constraint in Chapter 7.6, and advanced packaging in Chapter 7.7.
| Term | Definition | Canonical chapter |
|---|---|---|
| HBM — High-Bandwidth Memory | Stacked DRAM (HBM3E/HBM4) on-package with the accelerator; the bandwidth and capacity ceiling on AI compute and the true supply bottleneck. | 7.6 |
| CoWoS — Chip-on-Wafer-on-Substrate | TSMC's 2.5D advanced-packaging process that integrates logic die + HBM stacks on a silicon interposer; CoWoS wafer capacity is the upstream gate above assembly. | 7.7 |
| Interposer | The silicon (or organic/RDL) layer carrying high-density interconnect between logic and HBM in a 2.5D package; reticle-size limits drive the move to larger and stitched interposers. | 7.7 |
| XPU | Generic term for a non-GPU AI accelerator (TPU, Trainium/Inferentia, Maia, MTIA); hyperscaler custom silicon competing with merchant GPUs. | 7.4, 7.5 |
| MoE — Mixture of Experts | Sparse architecture activating a subset of expert sub-networks per token; widens expert-parallelism and reshapes both training fabric and inference KV-cache pressure. | 1.2, 8.5 |
| KV cache | Cached key/value tensors for attention during decode; its size scales with context length and concurrency, dominating inference memory and driving disaggregation. | 10.11 |
| Quantization (FP8/FP4/INT8) | Reduced numerical precision to cut memory and lift throughput; the compute-vs-accuracy lever, increasingly native in Blackwell/Rubin-class silicon. | 7.10 |
| TDP — Thermal Design Power | The sustained power (and heat) an accelerator package must dissipate; the per-chip number that propagates up to rack density and the cooling cliff. | 5.1, 7.12 |
| Power transient / load step | Synchronized GPU draw swings (idle-to-full across thousands of GPUs in milliseconds) that stress the power chain; mitigated chip→BBU→BESS. | 4.5, 7.12 |
| SST — Solid-State Transformer | Power-electronics transformer (~99% efficiency) enabling MV-to-DC conversion for 800 VDC megawatt-rack architectures. | 4.1, 4.4 |
Glossary — interconnect, fabric & networking
The two-tier network vocabulary: scale-up (inside the coherent domain) versus scale-out (across the cluster). Scale-up interconnect is treated in Chapter 8.3, scale-out topology and oversubscription in Chapter 8.5, and Ethernet/RoCE transport in Chapter 8.6.
| Term | Definition | Canonical chapter |
|---|---|---|
| NVLink domain (scale-up domain) | The set of GPUs sharing a coherent high-bandwidth NVLink/NVSwitch fabric (8 in HGX, 72 in NVL72, 576 in Rubin Ultra); its size sets tensor/expert-parallel ceilings. | 8.3, 8.5 |
| NVSwitch | NVIDIA's switch ASIC that fully connects a scale-up domain; NVLink-SHARP performs in-network reduction to accelerate collectives. | 8.3 |
| UALink | Open scale-up interconnect standard (UALink 1.0, up to 1,024 accelerators); the multi-vendor alternative to NVLink, often realized over Ethernet (UALoE). | 8.3 |
| InfiniBand (IB) | Low-latency lossless scale-out fabric with native RDMA and adaptive routing; the historical default for non-blocking training back-ends. | 8.6 |
| RoCE — RDMA over Converged Ethernet | RDMA carried on Ethernet (typically lossless via PFC/ECN+DCQCN); the open, cost-driven scale-out alternative to InfiniBand. | 8.6 |
| Spectrum-X | NVIDIA's Ethernet-based scale-out platform tuning RoCE for AI collectives (adaptive routing, congestion control); the Ethernet answer to InfiniBand. | 8.6 |
| UEC — Ultra Ethernet Consortium / UET | Spec 1.0 transport (Ultra Ethernet Transport) with packet spray + reorder, UCCM congestion control, and packet trimming; the open roadmap for AI-grade Ethernet. | 8.6 |
| Rail-optimized / fat-tree | Topology pinning each GPU NIC to a dedicated 'rail' of leaf/spine switching for a non-blocking, collision-free back-end; the canonical training fabric. | 8.5 |
| Oversubscription ratio | Ratio of edge bandwidth to bisection bandwidth (1:1 = non-blocking, 3:1 = oversubscribed); the cost lever that distinguishes training fabrics from inference fabrics. | 8.5 |
| Bisection bandwidth | Aggregate bandwidth across the worst-case cut of the network; the figure of merit for all-reduce-heavy training collectives. | 8.5 |
| PFC / ECN / DCQCN | Lossless-Ethernet congestion-control mechanics: Priority Flow Control (pause), Explicit Congestion Notification, and the DCQCN tuning loop; mis-tuned, they cause head-of-line blocking and victim flows. | 8.6 |
| CPO — Co-Packaged Optics | Optics integrated into the switch/accelerator package to beat copper-reach limits at NVLink/scale-up speeds; trades serviceability for reach and power. | 8.9 |
| NVMe-oF | NVMe over Fabrics (RoCE or TCP transport) for disaggregated storage; the placement-vs-transport tradeoff for the storage rail. | 8.5, 9.1 |
Glossary — facility, power, cooling & program
The building, electrical, mechanical, and project-management terms. Power topology lives in Chapter 4.1, DLC in Chapter 5.4, the reliability rethink in Chapter 12.2, and the integrated master schedule and critical path in Chapter 2.1.
| Term | Definition | Canonical chapter |
|---|---|---|
| SU — Scalable Unit (reference design) | The repeatable build block (a defined MW + GPU + cooling + fabric increment) that the capacity ramp is composed of; the unit of design reuse and procurement. | 1.7 |
| DLC — Direct-to-Chip Liquid Cooling | Cold plates on the hot components fed by a CDU-isolated technology loop; the 2026 default above the ~100 kW/rack air-cooling cliff. | 5.4 |
| CDU — Coolant Distribution Unit | Heat exchanger + pumps isolating the clean technology-cooling loop from facility water; sizes the warm-water delta-T and provides leak isolation. | 5.4, 5.13 |
| RDHx — Rear-Door Heat Exchanger | Liquid-cooled door bridging ~50–100 kW/rack without facility water at the rack; the brownfield-friendly step before full DLC. | 5.3, 5.10 |
| 800 VDC | Direct-current rack/distribution architecture for megawatt-class racks (NVIDIA/OCP Mt Diablo); cuts conversion stages and copper for ~600 kW–1 MW racks. | 4.1, 4.4 |
| BBU / BESS | Battery Backup Unit (rack-level ride-through) and Battery Energy Storage System (facility-level); the chip→BBU→BESS spine that absorbs GPU load transients and bridges to gensets. | 4.5, 4.7 |
| Tier (Uptime I–IV) | Uptime Institute topology classification: Tier I (basic) → Tier IV (fault-tolerant, 2N). Tier III = concurrently maintainable; the redundancy reference for inference-shaped builds. | 12.1, 12.2 |
| 2N / N+1 | Redundancy notation: N+1 = one spare component, 2N = fully mirrored. Training tolerates N/N+1 (checkpointable); always-on inference justifies 2N. | 12.2 |
| RBD / Markov / Monte-Carlo | The three availability-modeling techniques: Reliability Block Diagrams, Markov state models, and stochastic simulation; the quantitative machinery behind the nines. | 12.5 |
| Phase gate (stage gate) | A go/no-go decision point between project phases where deliverables are reviewed and capital is released; the spine of the timeline table below. | 2.1 |
| IMS / critical path | Integrated Master Schedule and its critical path: the longest dependent chain of tasks whose slip slips the whole project; everything off it has float. | 2.1 |
| Long-lead equipment | Items whose procurement lead time (transformers, switchgear, chillers, GPUs) drives the schedule; ordered against a frozen design basis before they bottleneck go-live. | 2.1, 4.1 |
| Commissioning (Cx) L1–L5 | The five commissioning levels from factory acceptance (L1) through component, system, and integrated systems testing (L5 IST); proves the facility before load. | 13.1, 13.6 |
| Speed-to-power | The time from contract to energized MW; the binding constraint of the 2026 era and the primary siting screen. | 3.2 |
| TTFT / TPOT | Time-To-First-Token and Time-Per-Output-Token; the two latency SLOs that govern online-inference fleet sizing. | 10.11 |
The project phase-gate timeline & critical path
The end-to-end schedule for a greenfield AI campus runs 24 to 60 months from land control to a live cluster, with the spread driven almost entirely by one variable: how long it takes to get firm megawatts energized. The table below sequences the program as phase gates, with each gate's realistic duration, the gate decision that releases the next phase, and whether the phase sits on the critical path (its slip slips go-live one-for-one) or has float (it can overlap or absorb delay). The critical path for a power-bound build is not construction — it is interconnection. The grid study, the utility agreement, and the substation/transformer lead time routinely dominate everything that follows, which is why land and power are secured before design is frozen and why long-lead electrical gear is ordered the moment the design basis is signed.
Read the duration ranges as 2026 practitioner medians for a >50 MW build, not commitments. A retrofit or a colocation fit-out deletes the land/permit/construct front half and compresses to 6–18 months; a contested interconnection or a multi-year transformer queue blows out the back half past 60 months.
| Phase | Typical duration | Gate decision (what releases the next phase) | On critical path? |
|---|---|---|---|
| 0. Scope & site search | 2–6 months | Workload profile, capacity ramp, and design basis signed; target market and shortlist approved. | Yes — gates everything |
| 1. Land control | 1–4 months | Site optioned/acquired; zoning and entitlement path confirmed; environmental Phase I clear. | Yes |
| 2. Power / interconnection | 12–48 months | Executed interconnection agreement and firm-capacity / energization date; the dominant critical-path item. | YES — usually binds |
| 3. Permitting & entitlement | 6–18 months | Building, environmental, water, and air permits issued; often overlaps power but can become the binding gate. | Often (overlaps power) |
| 4. Design (concept → DD → IFC) | 6–12 months | Issued-for-construction documents; design basis frozen so long-lead gear can be ordered. | Partly — front-loads procurement |
| 5. Long-lead procurement | 12–24 months (parallel) | POs placed against frozen design; transformers/switchgear/chillers ordered early to de-risk the schedule. | YES (gear lead time) |
| 6. Construction (shell + fit-out) | 12–24 months | Substantial completion; building, electrical, and mechanical infrastructure ready for commissioning. | Yes |
| 7. Commissioning (L1–L5 IST) | 3–9 months | Integrated Systems Testing (L5) passed; facility proven under simulated and staged real load. | Yes |
| 8. Cluster bring-up & burn-in | 1–4 months | GPU node burn-in, fabric validation, and reference-training/benchmark acceptance complete. | Yes |
| 9. Staged ramp & go-live | 1–3 months | Staged power/load ramp to full; handover to operations; SLA clock starts. | Yes — terminal gate |
Learning & community map — certifications
The certification ladder splits by discipline. There is no single credential for an AI-data-center engineer; the strong teams hold a spread across facility design, operations, and the network/compute stack. The table flags the credential, who issues it, and which role it maps to. Treat it as a hiring and development reference, not a gate — the deployed expertise in this field still outruns any certificate.
| Credential | Issuer | Maps to role / domain |
|---|---|---|
| ATD — Accredited Tier Designer | Uptime Institute | Facility design engineers / licensed PEs; the only credential mapping directly to the Tier classification used in commissioning. |
| ATS — Accredited Tier Specialist | Uptime Institute | Operations and facility staff managing/maintaining to Tier criteria; the operations companion to the ATD. |
| CDCDP / DCDC | CNet Training (BTEC-accredited) | Data Centre Design Professional / certified design consultant; multidisciplinary design competency. |
| CDCMP / CDCEP | CNet Training | Data Centre Management / Energy Professional; operations leadership and efficiency engineering. |
| CDCP / CDCS / CDCE | EPI | Certified Data Centre Professional → Specialist → Expert; a tiered facility-operations ladder. |
| PE (Electrical / Mechanical) | State licensing boards (US) / equivalents | The statutory license to stamp design documents; foundational for Phase 4 sign-off. |
| NVIDIA-Certified (NCP/NCA, networking & AI infra) | NVIDIA | GPU-cluster and fabric engineers; CUDA/NCCL, InfiniBand/Spectrum-X, and DGX/SuperPOD operations. |
| CCNP / network specialist (Ethernet, RoCE) | Cisco / Arista / Juniper | Scale-out fabric engineers building and tuning RoCE/lossless-Ethernet AI back-ends. |
| OCP-aligned training | Open Compute Project community | Open-hardware rack/power/cooling literacy (Open Rack, Mt Diablo 800 VDC, ORW). |
Learning & community map — conferences & feeds
Two final tables. The conference calendar is the place to calibrate against the field — hardware roadmaps break at OCP and GTC, facility practice at DCD and 7x24, network practice at the OCP networking tracks and vendor summits. The feeds are what keep the numbers in this guide honest between editions: independent analysis (SemiAnalysis), facility-industry reporting (DCD, Data Center Frontier), the standards bodies themselves, and the operator engineering blogs that publish ground truth from production fleets.
| Event | Cadence / typical window | Why it is on the calendar |
|---|---|---|
| OCP Global Summit | Annual, October (San Jose) | Where hyperscaler-grade open hardware breaks: Open Rack, 800 VDC / Mt Diablo, cooling and networking working groups. |
| NVIDIA GTC | Annual, March (San Jose) | The accelerator/roadmap keynote that sets the density-ramp expectations the rest of the industry designs against. |
| DCD>Connect (regional series) | Multiple/year (NYC, London, Virginia, APAC) | The facility-operator and capital-markets gathering; siting, power, cooling, and build-out practice. |
| 7x24 Exchange | Semiannual (US) | Mission-critical facility operations, commissioning, and reliability — the Cx/operations community. |
| Datacloud / Data Centre World | Annual (Europe + global) | European and global colocation, investment, and infrastructure deal-making and design practice. |
| DesignCon | Annual, late January (Santa Clara) | Signal/power integrity and high-speed interconnect engineering — the physical-layer fabric community. |
| Hot Chips / ISSCC / SC | Annual (academic/industry) | Silicon architecture (Hot Chips, ISSCC) and HPC/AI supercomputing (SC) — the upstream compute and packaging research. |
| Source | Type | What it is good for |
|---|---|---|
| SemiAnalysis | Independent analysis (paid) | The definitive $/GPU-hr, supply-chain (CoWoS/HBM), rack-teardown, and fabric-economics primary analysis. |
| Data Center Dynamics (DCD) | Industry news | Facility builds, power deals, interconnection news, and roadmap reporting across the global market. |
| Data Center Frontier | Industry news | US-focused facility engineering, power-architecture, and cooling-transition reporting. |
| The Next Platform / The Register (on-prem) | Technical journalism | Systems-level AI-infrastructure and interconnect analysis with an engineering bent. |
| LBNL (Berkeley Lab) — Queued Up et al. | Primary research | Authoritative interconnection-queue, grid, and data-center energy studies — the source behind the power figures. |
| Uptime Institute (research & blog) | Standards / research | Tier standards, outage analyses, and the annual data-center survey; the reliability ground truth. |
| OCP / UEC / UALink / OIF (standards bodies) | Primary specs | The authoritative spec text for open rack/power/cooling and AI-fabric standards — read the spec, not the summary. |
| Operator engineering blogs (Meta, Microsoft, Google) | Production ground truth | RoCE-at-scale, checkpointing, fleet-reliability, and cooling practice published from real production clusters. |
How to keep this appendix from going stale
Every figure in this guide has a half-life. Density numbers move each accelerator generation; interconnection-queue medians move each ISO filing; lead times move with the transformer market. The discipline that keeps a reference like this useful is the same one that keeps the guide itself current: cite the source and the as-of date for every load-bearing number, and re-check the ones that gate a decision before you act on them. The feeds table is the maintenance plan. When this guide and a primary source disagree, the primary source wins — and the glossary entry's canonical chapter is where you go to understand why the number moved, not just that it did.
The pairing to internalize: the glossary tells you what a term means, the canonical chapter tells you why it matters, the phase-gate timeline tells you when the decision is due, and the community map tells you where to check that the answer is still true. Used together, they are the difference between a reference that ages well and one that misleads a year after it ships.