Appendix E

Glossary, Phase-Gate Timeline & Learning/Community Map

An appendix earns its place by being the page you keep open at the bench: this one collapses the guide's vocabulary into a single scannable glossary, lays the 24–60 month land-to-go-live schedule on one critical-path table, and points you at the certifications, conferences, and feeds that keep the rest from going stale between editions.

What you'll decide here

Use the glossary as a decoder ring: every acronym in the body chapters resolves here, with the canonical chapter that owns the full treatment named in the third column — jump there when the one-line definition is not enough.
Read the phase-gate timeline as a critical path, not a checklist: the bolded gate rows are the ones that serialize the whole program; durations off the critical path can overlap, but a slip on a critical-path gate slips go-live one-for-one.
Treat the month ranges as planning anchors, not commitments — they are 2026 practitioner medians for a >50 MW greenfield AI build; a retrofit or colo fit-out compresses the front half, and a contested interconnection or substation lead time blows out the back half.
Work the learning/community map as a maintenance plan: pick one certification track per discipline on your team, put the two or three anchor conferences on the calendar, and subscribe to the feeds so the figures in this guide are corrected by primary sources, not by a competitor's outage.

This appendix is the reference layer the rest of the guide leans on. It does three jobs. First, it is the glossary — a single alphabetized table that resolves every term of art the body chapters use, from efficiency ratios (PUE, WUE, ERF) through utilization metrics (MFU, MBU, goodput) to fabric and packaging vocabulary (NVLink domain, CoWoS, HBM, RoCE) and the program-management primitives (SU, ETTR, phase gate). Each entry names the chapter that owns the full treatment, so the glossary doubles as an index. Second, it is the phase-gate timeline — the 24-to-60-month sequence from raw land to a live cluster, with realistic durations and the critical path called out, because a common scoping error is treating parallelizable work as serial and serial work as parallelizable. Third, it is the learning and community map — the certifications, conferences, and feeds that let a practitioner keep this material current after the ink dries.

None of this is meant to be read front-to-back. It is meant to be searched. The tables are dense on purpose.

Glossary — efficiency, utilization & thermal metrics

The metric stack is split across three tables so each stays scannable. This first table covers the facility-efficiency and workload-utilization ratios; the canonical definitions and the post-PUE metric stack are built out in Chapter 15.1, with the utilization metrics anchored in Chapter 0.3 and the goodput reframing in Chapter 12.2.

Glossary I — efficiency, utilization & thermal metrics

Term	Definition	Canonical chapter
PUE — Power Usage Effectiveness	Total facility power / IT power. The headline efficiency ratio; 1.0 is theoretical perfect, AI liquid-cooled halls target ~1.1–1.2. Says nothing about IT-side efficiency.	15.1
WUE — Water Usage Effectiveness	Liters of water consumed per kWh of IT energy. The water analog of PUE; evaporative cooling trades a better PUE for a worse WUE.	15.1, 15.4
ERF — Energy Reuse Factor	Fraction of facility energy exported as useful heat (district heating, etc.); the only metric here where higher is better.	15.1, 15.5
REF / CUE — Renewable Energy / Carbon Usage	REF: renewable share of supply. CUE: kg CO2e per kWh of IT energy. The carbon companions to PUE.	15.1, 15.3
ITUE / TUE — IT / Total Usage Effectiveness	ITUE pushes the boundary inside the server (fans, VRMs, PSUs); TUE = PUE x ITUE, the true facility-to-transistor ratio.	15.1
MFU — Model FLOPs Utilization	Achieved FLOPs / peak FLOPs for a training run. The headline training-efficiency number; 35–55% is good at scale, collectives and stragglers erode it.	0.3, 13.9
MBU — Model Bandwidth Utilization	Achieved memory bandwidth / peak, for memory-bound decode inference. The MFU analog when the bottleneck is HBM bandwidth, not FLOPs.	0.3, 10.11
Goodput	Useful work delivered per unit time after subtracting failed/restarted/stale work. The metric that matters; distinct from raw throughput and from facility availability.	12.2, 10.11
ETTR — Effective Training Time Ratio	Productive training wall-clock / total elapsed wall-clock. Folds in interruptions, checkpoint overhead, and restart loss; the goodput metric for training.	12.2, 9.4
Tokens-per-joule	Inference energy efficiency: tokens emitted per joule of facility energy. The cross-vendor, cross-architecture comparator that survives generation changes.	15.1, 7.10
$/GPU-hr	All-in cost to operate one accelerator for one hour (capex amortization + power + cooling + staff). The unit economic for build-vs-rent.	1.8, 7.11
$/M-tokens	Cost to serve one million tokens; the revenue-side unit for inference businesses.	1.8, 10.11
EDPp — Energy-Delay Product (per op)	Energy x latency, penalizing slow-and-power-hungry designs; a silicon/architecture figure of merit that resists gaming by either axis alone.	7.10
Delta-T / approach temperature	Temperature rise across a cold plate or heat exchanger; the tight delta-T (under ~10 C across DLC cold plates) is what sizes the warm-water loop.	5.1, 5.4
NTU / effectiveness	Number-of-transfer-units and heat-exchanger effectiveness; the sizing math for CDUs and dry/wet coolers.	5.1

Lower-is-better unless noted. Canonical chapter owns the full derivation; the one-liner here is the bench definition.

Glossary — compute, memory & packaging

The silicon and packaging vocabulary that gates supply and density. The accelerator landscape lives in Chapter 7.1, HBM as the binding constraint in Chapter 7.6, and advanced packaging in Chapter 7.7.

Glossary II — compute, memory & packaging

Term	Definition	Canonical chapter
HBM — High-Bandwidth Memory	Stacked DRAM (HBM3E/HBM4) on-package with the accelerator; the bandwidth and capacity ceiling on AI compute and the true supply bottleneck.	7.6
CoWoS — Chip-on-Wafer-on-Substrate	TSMC's 2.5D advanced-packaging process that integrates logic die + HBM stacks on a silicon interposer; CoWoS wafer capacity is the upstream gate above assembly.	7.7
Interposer	The silicon (or organic/RDL) layer carrying high-density interconnect between logic and HBM in a 2.5D package; reticle-size limits drive the move to larger and stitched interposers.	7.7
XPU	Generic term for a non-GPU AI accelerator (TPU, Trainium/Inferentia, Maia, MTIA); hyperscaler custom silicon competing with merchant GPUs.	7.4, 7.5
MoE — Mixture of Experts	Sparse architecture activating a subset of expert sub-networks per token; widens expert-parallelism and reshapes both training fabric and inference KV-cache pressure.	1.2, 8.5
KV cache	Cached key/value tensors for attention during decode; its size scales with context length and concurrency, dominating inference memory and driving disaggregation.	10.11
Quantization (FP8/FP4/INT8)	Reduced numerical precision to cut memory and lift throughput; the compute-vs-accuracy lever, increasingly native in Blackwell/Rubin-class silicon.	7.10
TDP — Thermal Design Power	The sustained power (and heat) an accelerator package must dissipate; the per-chip number that propagates up to rack density and the cooling cliff.	5.1, 7.12
Power transient / load step	Synchronized GPU draw swings (idle-to-full across thousands of GPUs in milliseconds) that stress the power chain; mitigated chip→BBU→BESS.	4.5, 7.12
SST — Solid-State Transformer	Power-electronics transformer (~99% efficiency) enabling MV-to-DC conversion for 800 VDC megawatt-rack architectures.	4.1, 4.4

Vendor-neutral definitions; NVIDIA terms are flagged as such because they dominate the 2026 deployed base.

Glossary — interconnect, fabric & networking

The two-tier network vocabulary: scale-up (inside the coherent domain) versus scale-out (across the cluster). Scale-up interconnect is treated in Chapter 8.3, scale-out topology and oversubscription in Chapter 8.5, and Ethernet/RoCE transport in Chapter 8.6.

Glossary III — interconnect, fabric & networking

Term	Definition	Canonical chapter
NVLink domain (scale-up domain)	The set of GPUs sharing a coherent high-bandwidth NVLink/NVSwitch fabric (8 in HGX, 72 in NVL72, 576 in Rubin Ultra); its size sets tensor/expert-parallel ceilings.	8.3, 8.5
NVSwitch	NVIDIA's switch ASIC that fully connects a scale-up domain; NVLink-SHARP performs in-network reduction to accelerate collectives.	8.3
UALink	Open scale-up interconnect standard (UALink 1.0, up to 1,024 accelerators); the multi-vendor alternative to NVLink, often realized over Ethernet (UALoE).	8.3
InfiniBand (IB)	Low-latency lossless scale-out fabric with native RDMA and adaptive routing; the historical default for non-blocking training back-ends.	8.6
RoCE — RDMA over Converged Ethernet	RDMA carried on Ethernet (typically lossless via PFC/ECN+DCQCN); the open, cost-driven scale-out alternative to InfiniBand.	8.6
Spectrum-X	NVIDIA's Ethernet-based scale-out platform tuning RoCE for AI collectives (adaptive routing, congestion control); the Ethernet answer to InfiniBand.	8.6
UEC — Ultra Ethernet Consortium / UET	Spec 1.0 transport (Ultra Ethernet Transport) with packet spray + reorder, UCCM congestion control, and packet trimming; the open roadmap for AI-grade Ethernet.	8.6
Rail-optimized / fat-tree	Topology pinning each GPU NIC to a dedicated 'rail' of leaf/spine switching for a non-blocking, collision-free back-end; the canonical training fabric.	8.5
Oversubscription ratio	Ratio of edge bandwidth to bisection bandwidth (1:1 = non-blocking, 3:1 = oversubscribed); the cost lever that distinguishes training fabrics from inference fabrics.	8.5
Bisection bandwidth	Aggregate bandwidth across the worst-case cut of the network; the figure of merit for all-reduce-heavy training collectives.	8.5
PFC / ECN / DCQCN	Lossless-Ethernet congestion-control mechanics: Priority Flow Control (pause), Explicit Congestion Notification, and the DCQCN tuning loop; mis-tuned, they cause head-of-line blocking and victim flows.	8.6
CPO — Co-Packaged Optics	Optics integrated into the switch/accelerator package to beat copper-reach limits at NVLink/scale-up speeds; trades serviceability for reach and power.	8.9
NVMe-oF	NVMe over Fabrics (RoCE or TCP transport) for disaggregated storage; the placement-vs-transport tradeoff for the storage rail.	8.5, 9.1

Scale-up = tight coherent domain (NVLink/UALink class); scale-out = looser cluster fabric (InfiniBand/Ethernet class).

Glossary — facility, power, cooling & program

The building, electrical, mechanical, and project-management terms. Power topology lives in Chapter 4.1, DLC in Chapter 5.4, the reliability rethink in Chapter 12.2, and the integrated master schedule and critical path in Chapter 2.1.

Glossary IV — facility, power, cooling & program management

Term	Definition	Canonical chapter
SU — Scalable Unit (reference design)	The repeatable build block (a defined MW + GPU + cooling + fabric increment) that the capacity ramp is composed of; the unit of design reuse and procurement.	1.7
DLC — Direct-to-Chip Liquid Cooling	Cold plates on the hot components fed by a CDU-isolated technology loop; the 2026 default above the ~100 kW/rack air-cooling cliff.	5.4
CDU — Coolant Distribution Unit	Heat exchanger + pumps isolating the clean technology-cooling loop from facility water; sizes the warm-water delta-T and provides leak isolation.	5.4, 5.13
RDHx — Rear-Door Heat Exchanger	Liquid-cooled door bridging ~50–100 kW/rack without facility water at the rack; the brownfield-friendly step before full DLC.	5.3, 5.10
800 VDC	Direct-current rack/distribution architecture for megawatt-class racks (NVIDIA/OCP Mt Diablo); cuts conversion stages and copper for ~600 kW–1 MW racks.	4.1, 4.4
BBU / BESS	Battery Backup Unit (rack-level ride-through) and Battery Energy Storage System (facility-level); the chip→BBU→BESS spine that absorbs GPU load transients and bridges to gensets.	4.5, 4.7
Tier (Uptime I–IV)	Uptime Institute topology classification: Tier I (basic) → Tier IV (fault-tolerant, 2N). Tier III = concurrently maintainable; the redundancy reference for inference-shaped builds.	12.1, 12.2
2N / N+1	Redundancy notation: N+1 = one spare component, 2N = fully mirrored. Training tolerates N/N+1 (checkpointable); always-on inference justifies 2N.	12.2
RBD / Markov / Monte-Carlo	The three availability-modeling techniques: Reliability Block Diagrams, Markov state models, and stochastic simulation; the quantitative machinery behind the nines.	12.5
Phase gate (stage gate)	A go/no-go decision point between project phases where deliverables are reviewed and capital is released; the spine of the timeline table below.	2.1
IMS / critical path	Integrated Master Schedule and its critical path: the longest dependent chain of tasks whose slip slips the whole project; everything off it has float.	2.1
Long-lead equipment	Items whose procurement lead time (transformers, switchgear, chillers, GPUs) drives the schedule; ordered against a frozen design basis before they bottleneck go-live.	2.1, 4.1
Commissioning (Cx) L1–L5	The five commissioning levels from factory acceptance (L1) through component, system, and integrated systems testing (L5 IST); proves the facility before load.	13.1, 13.6
Speed-to-power	The time from contract to energized MW; the binding constraint of the 2026 era and the primary siting screen.	3.2
TTFT / TPOT	Time-To-First-Token and Time-Per-Output-Token; the two latency SLOs that govern online-inference fleet sizing.	10.11

The cross-discipline vocabulary that the phase-gate timeline below assumes you already speak.

The project phase-gate timeline & critical path

The end-to-end schedule for a greenfield AI campus runs 24 to 60 months from land control to a live cluster, with the spread driven almost entirely by one variable: how long it takes to get firm megawatts energized. The table below sequences the program as phase gates, with each gate's realistic duration, the gate decision that releases the next phase, and whether the phase sits on the critical path (its slip slips go-live one-for-one) or has float (it can overlap or absorb delay). The critical path for a power-bound build is not construction — it is interconnection. The grid study, the utility agreement, and the substation/transformer lead time routinely dominate everything that follows, which is why land and power are secured before design is frozen and why long-lead electrical gear is ordered the moment the design basis is signed.

Read the duration ranges as 2026 practitioner medians for a >50 MW build, not commitments. A retrofit or a colocation fit-out deletes the land/permit/construct front half and compresses to 6–18 months; a contested interconnection or a multi-year transformer queue blows out the back half past 60 months.

Phase-gate timeline — land to go-live (greenfield, >50 MW AI campus)

Phase	Typical duration	Gate decision (what releases the next phase)	On critical path?
0. Scope & site search	2–6 months	Workload profile, capacity ramp, and design basis signed; target market and shortlist approved.	Yes — gates everything
1. Land control	1–4 months	Site optioned/acquired; zoning and entitlement path confirmed; environmental Phase I clear.	Yes
2. Power / interconnection	12–48 months	Executed interconnection agreement and firm-capacity / energization date; the dominant critical-path item.	YES — usually binds
3. Permitting & entitlement	6–18 months	Building, environmental, water, and air permits issued; often overlaps power but can become the binding gate.	Often (overlaps power)
4. Design (concept → DD → IFC)	6–12 months	Issued-for-construction documents; design basis frozen so long-lead gear can be ordered.	Partly — front-loads procurement
5. Long-lead procurement	12–24 months (parallel)	POs placed against frozen design; transformers/switchgear/chillers ordered early to de-risk the schedule.	YES (gear lead time)
6. Construction (shell + fit-out)	12–24 months	Substantial completion; building, electrical, and mechanical infrastructure ready for commissioning.	Yes
7. Commissioning (L1–L5 IST)	3–9 months	Integrated Systems Testing (L5) passed; facility proven under simulated and staged real load.	Yes
8. Cluster bring-up & burn-in	1–4 months	GPU node burn-in, fabric validation, and reference-training/benchmark acceptance complete.	Yes
9. Staged ramp & go-live	1–3 months	Staged power/load ramp to full; handover to operations; SLA clock starts.	Yes — terminal gate

Durations are 2026 practitioner medians; phases overlap, so the column does not sum to the 24–60 month total. Critical-path rows (bold-marked) serialize the program. See Chapter 2.1 for the integrated master schedule and Chapter 3.2 for the interconnection mechanics that dominate it.

Where the schedule actually lives or dies

Three rows dominate the spread. Power (Phase 2) is the single longest pole — median large-load energization waits run 4–7 years in the densest US hubs, which is why an executed interconnection agreement is worth more than a finished design. Long-lead gear (Phase 5) is the silent critical path: large power transformers and medium-voltage switchgear have run 18–24+ month lead times through 2025–2026, so they are ordered against a frozen design basis the moment IFC documents land, in parallel with construction. Commissioning (Phase 7) is the one teams compress at their peril — skipping L5 integrated systems testing to hit a go-live date is the recurring cause of the first-quarter outage. Everything else has float; these three do not. → master schedule mechanics in Chapter 2.1, interconnection in Chapter 3.2, commissioning in Chapter 13.1.

Learning & community map — certifications

The certification ladder splits by discipline. There is no single credential for an AI-data-center engineer; the strong teams hold a spread across facility design, operations, and the network/compute stack. The table flags the credential, who issues it, and which role it maps to. Treat it as a hiring and development reference, not a gate — the deployed expertise in this field still outruns any certificate.

Certification ladder

Credential	Issuer	Maps to role / domain
ATD — Accredited Tier Designer	Uptime Institute	Facility design engineers / licensed PEs; the only credential mapping directly to the Tier classification used in commissioning.
ATS — Accredited Tier Specialist	Uptime Institute	Operations and facility staff managing/maintaining to Tier criteria; the operations companion to the ATD.
CDCDP / DCDC	CNet Training (BTEC-accredited)	Data Centre Design Professional / certified design consultant; multidisciplinary design competency.
CDCMP / CDCEP	CNet Training	Data Centre Management / Energy Professional; operations leadership and efficiency engineering.
CDCP / CDCS / CDCE	EPI	Certified Data Centre Professional → Specialist → Expert; a tiered facility-operations ladder.
PE (Electrical / Mechanical)	State licensing boards (US) / equivalents	The statutory license to stamp design documents; foundational for Phase 4 sign-off.
NVIDIA-Certified (NCP/NCA, networking & AI infra)	NVIDIA	GPU-cluster and fabric engineers; CUDA/NCCL, InfiniBand/Spectrum-X, and DGX/SuperPOD operations.
CCNP / network specialist (Ethernet, RoCE)	Cisco / Arista / Juniper	Scale-out fabric engineers building and tuning RoCE/lossless-Ethernet AI back-ends.
OCP-aligned training	Open Compute Project community	Open-hardware rack/power/cooling literacy (Open Rack, Mt Diablo 800 VDC, ORW).

Vendor-neutral facility credentials first; vendor/network credentials second. Match the credential to the role, not the resume.

Learning & community map — conferences & feeds

Two final tables. The conference calendar is the place to calibrate against the field — hardware roadmaps break at OCP and GTC, facility practice at DCD and 7x24, network practice at the OCP networking tracks and vendor summits. The feeds are what keep the numbers in this guide honest between editions: independent analysis (SemiAnalysis), facility-industry reporting (DCD, Data Center Frontier), the standards bodies themselves, and the operator engineering blogs that publish ground truth from production fleets.

Conference calendar — the anchor events

Event	Cadence / typical window	Why it is on the calendar
OCP Global Summit	Annual, October (San Jose)	Where hyperscaler-grade open hardware breaks: Open Rack, 800 VDC / Mt Diablo, cooling and networking working groups.
NVIDIA GTC	Annual, March (San Jose)	The accelerator/roadmap keynote that sets the density-ramp expectations the rest of the industry designs against.
DCD>Connect (regional series)	Multiple/year (NYC, London, Virginia, APAC)	The facility-operator and capital-markets gathering; siting, power, cooling, and build-out practice.
7x24 Exchange	Semiannual (US)	Mission-critical facility operations, commissioning, and reliability — the Cx/operations community.
Datacloud / Data Centre World	Annual (Europe + global)	European and global colocation, investment, and infrastructure deal-making and design practice.
DesignCon	Annual, late January (Santa Clara)	Signal/power integrity and high-speed interconnect engineering — the physical-layer fabric community.
Hot Chips / ISSCC / SC	Annual (academic/industry)	Silicon architecture (Hot Chips, ISSCC) and HPC/AI supercomputing (SC) — the upstream compute and packaging research.

Dates rotate; verify each year. The 'why go' column is the calibration you cannot get from a feed.

Feeds to follow — keep the numbers current

Source	Type	What it is good for
SemiAnalysis	Independent analysis (paid)	The definitive $/GPU-hr, supply-chain (CoWoS/HBM), rack-teardown, and fabric-economics primary analysis.
Data Center Dynamics (DCD)	Industry news	Facility builds, power deals, interconnection news, and roadmap reporting across the global market.
Data Center Frontier	Industry news	US-focused facility engineering, power-architecture, and cooling-transition reporting.
The Next Platform / The Register (on-prem)	Technical journalism	Systems-level AI-infrastructure and interconnect analysis with an engineering bent.
LBNL (Berkeley Lab) — Queued Up et al.	Primary research	Authoritative interconnection-queue, grid, and data-center energy studies — the source behind the power figures.
Uptime Institute (research & blog)	Standards / research	Tier standards, outage analyses, and the annual data-center survey; the reliability ground truth.
OCP / UEC / UALink / OIF (standards bodies)	Primary specs	The authoritative spec text for open rack/power/cooling and AI-fabric standards — read the spec, not the summary.
Operator engineering blogs (Meta, Microsoft, Google)	Production ground truth	RoCE-at-scale, checkpointing, fleet-reliability, and cooling practice published from real production clusters.

Primary and independent sources rank above vendor marketing. These are the feeds that correct this guide between editions.

How to keep this appendix from going stale

Every figure in this guide has a half-life. Density numbers move each accelerator generation; interconnection-queue medians move each ISO filing; lead times move with the transformer market. The discipline that keeps a reference like this useful is the same one that keeps the guide itself current: cite the source and the as-of date for every load-bearing number, and re-check the ones that gate a decision before you act on them. The feeds table is the maintenance plan. When this guide and a primary source disagree, the primary source wins — and the glossary entry's canonical chapter is where you go to understand why the number moved, not just that it did.

The pairing to internalize: the glossary tells you what a term means, the canonical chapter tells you why it matters, the phase-gate timeline tells you when the decision is due, and the community map tells you where to check that the answer is still true. Used together, they are the difference between a reference that ages well and one that misleads a year after it ships.

The metrics in Glossary I are derived in Chapter 15.1 (efficiency stack), Chapter 0.3 (utilization), and Chapter 12.2 (goodput vs availability). The compute/packaging vocabulary maps to Chapter 7.1, Chapter 7.6, and Chapter 7.7; the fabric vocabulary to Chapter 8.3, Chapter 8.5, and Chapter 8.6. The phase-gate timeline is the appendix view of the integrated master schedule in Chapter 2.1, the interconnection critical path in Chapter 3.2, and the commissioning program in Chapter 13.1. The scoping artifacts the timeline assumes are produced in Chapter 1.1 and detailed in Chapter 1.7.