Appendix B
Reference Designs & Worked Examples
A reference design is the cascade made arithmetic: pick the archetype, the accelerator generation, and the scalable unit, and the megawatts, liters-per-minute, fiber strands, switch ports, and dollars all fall out of a small set of multipliers you can carry in your head — this appendix supplies those multipliers and works three reference builds end-to-end (a scalable-unit budget, a 50 MW campus, and a 100k-GPU cluster BOM) so you can sanity-check any vendor's sizing in an afternoon.
What you'll decide here
- Start from the per-archetype design-basis sheet that matches your dominant workload — it fixes density tier, cooling modality, fabric blocking, redundancy, and the GPU:CPU/storage/network ratios that every later number multiplies against.
- Treat the scalable unit (SU) as the atomic costing and deployment block: size the SU once from the budget table, then multiply — campuses and clusters in this appendix are all integer counts of SUs, not bespoke arithmetic.
- Use the 50 MW campus and 100k-GPU BOM as order-of-magnitude calibrators, not bids: counts are exact from the ratios; dollar figures are 2025–2026 list/street ranges that move quarterly and must be re-quoted.
- When a vendor proposal disagrees with these tables by more than ~20% on a count (racks, CDUs, switches, optics), find out why before you sign — the divergence is usually a hidden oversubscription, redundancy, or generation assumption.
- Re-derive, do not interpolate, when you change generation (GB200 → GB300 → VR200 → Kyber): density, flow, and busbar current step discontinuously, so the multipliers in §2 are generation-stamped on purpose.
This appendix is the reusable arithmetic layer behind Part 1's archetype framework (Chapter 1.1) and Part 1's requirements matrix (Chapter 1.7). It does four things, in order: (1) a per-archetype design-basis sheet that freezes the inputs every later number inherits; (2) a scalable-unit (SU) budget giving the power, cooling, water, and network draw of one atomic deployment block per accelerator generation; (3) a 50 MW campus sized from the SU up, with the power chain, cooling plant, and water loop derived; and (4) a 100k-GPU cluster reference BOM with counts and rough 2025–2026 costs for GPUs, racks, CDUs, switches, optics, and storage.
The discipline throughout is multiplier-first. A reference design is a chain of ratios: GPUs per rack, racks per SU, kW per rack, L/min per kW, NICs per node, optics per NIC, GB/s per GPU of storage. Once those are pinned, every aggregate is a multiplication you can audit. Counts in the tables are exact arithmetic from the stated ratios. Dollar figures are 2025–2026 street/list ranges (sources stamped inline); they drift quarterly and are calibration aids, not quotes. Density, flow, and current figures are generation-stamped because they step discontinuously across GB200 → GB300 → Vera Rubin VR200 → Rubin Ultra Kyber; do not interpolate across a generation boundary.
1. Per-archetype design-basis sheets
The design-basis sheet is the single page that everything downstream inherits — it is the concrete instantiation of the workload-profile and design-basis artifacts named in Chapter 1.1. Read it as: choose the row that matches your dominant archetype, and the rest of the appendix is parameterized for you. The three reference builds in §3–§4 use the frontier-training column unless noted, because it is the most constraining; an inference-shaped build relaxes density, fabric, and redundancy and is cheaper on every axis.
| Parameter | Frontier training | Post-training / RL | Online inference | Batch inference | Edge inference |
|---|---|---|---|---|---|
| Dominant accelerator | GB200/GB300 NVL72 | Disaggregated: NVL72 trainer + HGX rollout | HGX B200/B300; GB200 for MoE | HGX B200; prior-gen acceptable | L4/L40S, Jetson, single B200 |
| Rack density (design) | 120–142 kW | Mixed: 132 kW trainer / 40–60 kW rollout | 40–60 kW | 30–60 kW | 3–30 kW per micro-site |
| Cooling modality | DLC mandatory, warm-water | DLC trainer + RDHx/air rollout | Air, RDHx, or DLC by density | Air often sufficient | Air / sealed modular |
| Scale-up domain | 72 GPUs (→144, →576) | 72 trainer / 8 rollout | 8–72 (MoE wants 72) | 8 | 1 (single node) |
| Scale-out fabric | 1:1 non-blocking, 8-rail | Tight trainer / oversub rollout | 2:1–3:1 oversubscribed | Heavily oversubscribed | Minimal; WAN backhaul |
| Fabric transport | InfiniBand XDR or Spectrum-X | IB trainer / RoCE rollout | Ethernet/RoCE common | Ethernet, cost-optimized | Standard IP |
| GPU:CPU ratio | 2:1 (NVL72: 72G:36C) | 2:1 trainer / 4–8:1 rollout | 4:1–8:1 | 8:1+ | 1:1 appliance |
| GPU:storage (BW) | ~250–400 GB/s per 1,024 GPU | Trainer like training | KV-cache + model load tier | Streaming object tier | Local NVMe only |
| Redundancy basis | N / N+1 (checkpointable) | N+1, staleness-tolerant | 2N / Tier-IV-class + N+1 cooling | N (queue-and-retry) | N; fleet geo-redundancy |
| EDPp sizing factor | ~1.4–1.5× TDP | ~1.4× trainer | ~1.3× TDP | ~1.2× TDP | ~1.2× TDP |
| Siting driver | Cheap firm MW + cold climate | Follows dominant sub-workload | Sub-50 ms to users | Cheapest / curtailable MW | Latency budget (30/50/100 ms) |
2. The scalable unit (SU): power / cooling / water / network budget
The scalable unit is the atomic deployment and costing block — order it, integrate it at the factory (L11/L12), ship it, energize it, repeat. Sizing the SU once and then multiplying is what makes campus and cluster arithmetic tractable. We anchor the SU to the NVIDIA DGX SuperPOD GB200 reference: 8 × NVL72 racks = 576 GPUs per SU, with the full SuperPOD at 16 SUs (128 racks, 9,216 GPUs). The budget below gives one SU's draw across four generations; later sections count SUs, not racks.
The cross-generation columns exist because the multipliers step. A GB200 SU is ~1.06 MW of IT; the same 8-rack SU at Kyber density (~600 kW/rack) is ~4.8 MW — a 4.5× jump in the same floor footprint. That is the density-ramp trap from Chapter 1.1 rendered as a budget line: the floor, water, and busbar you reserve today must survive it.
| Metric | GB200 NVL72 (2025) | GB300 NVL72 (2025–26) | VR200 NVL144 (H2 2026) | Kyber NVL576 (H2 2027) |
|---|---|---|---|---|
| GPUs per SU | 576 | 576 | 1,152 (144/rack) | 4,608 (576/rack) |
| Rack density (TDP) | 132 kW | 142 kW | ~200 kW | ~600 kW |
| IT power per SU (TDP) | ~1.06 MW | ~1.14 MW | ~1.60 MW | ~4.80 MW |
| IT power per SU (EDPp ~1.4×) | ~1.48 MW | ~1.59 MW | ~2.24 MW | ~6.7 MW (smoothed ~30%) |
| DLC heat to liquid (~87%) | ~0.92 MW | ~0.99 MW | ~1.74 MW (100% liquid) | ~4.8 MW (100% liquid) |
| Residual air heat | ~0.14 MW (~17 kW/rack) | ~0.15 MW | ~0 (100% liquid) | ~0 (100% liquid) |
| Secondary-loop flow (~1.5 L/min·kW) | ~1,580 L/min | ~1,700 L/min | ~2,400 L/min | ~7,200 L/min |
| Coolant inlet / ΔT target | <25 °C / <10 °C | <25 °C / ~10 °C | ~45 °C warm-water | ~45 °C warm-water |
| Back-end NICs (8× 400G/node) | 144 NICs (3.2 Tb/s/node) | 144 NICs | 288 (CX-9 800G) | 1,152 (CX-9/CPO) |
| Leaf switch ports consumed (back-end) | 576 (1 port/GPU rail) | 576 | 1,152 | 4,608 |
| Back-end optics (transceivers, 1:1) | ~1,152 (NIC+leaf ends) | ~1,152 | ~2,304 | ~9,216 (CPO shifts mix) |
| Storage BW attributable (~250 GB/s/1,024 GPU) | ~140 GB/s | ~140 GB/s | ~280 GB/s | ~1,125 GB/s |
Worked example: deriving one GB200 SU line-by-line
Take the GB200 column and walk it forward so the multipliers are explicit. GPUs: 8 racks × 72 GPUs = 576. IT power (TDP): 8 × 132 kW = 1.056 MW ≈ 1.06 MW. EDPp: 1.06 MW × 1.4 ≈ 1.48 MW provisioned on the rack power chain. Heat split: at ~115 kW liquid + ~17 kW air per rack, liquid carries 8 × 115 = 920 kW and air carries 8 × 17 = 136 kW. Flow: 920 kW × ~1.5 L/min·kW ≈ 1,380–1,580 L/min secondary loop (≈ 130–200 LPM/rack, matching NVL72 CDU sizing). NICs: 8 racks × 18 compute nodes... note NVL72 presents 72 GPUs across 18 trays; a rail-optimized back-end gives 8× 400G per GPU-pair node → ~144 NICs/SU at 3.2 Tb/s/node. Optics: the back-end optic count tracks GPU rails, not NIC bodies — each of the 576 GPUs drives one 1:1 rail link, and each link burns two transceivers (server/NIC end + leaf end), so 576 GPUs × 2 ≈ ~1,152 rail-side optics for the SU's share of the non-blocking fabric (before spine). These are the only numbers; everything in §3–§4 is integer multiples of them. → SU definition in Chapter 1.7; fabric sizing in Chapter 8.5.
3. Worked example: a 50 MW campus sized from the SU up
Now multiply. The brief: a 50 MW IT frontier-training campus on GB200/GB300 NVL72, built as integer SUs, with the power chain, cooling plant, and water loop derived. We size on TDP for the IT budget and EDPp for the electrical chain, and we reserve floor/water/busbar headroom for a GB300 → VR200 density step (the irreversible substrate from Chapter 1.1).
SU count. 50 MW IT ÷ ~1.06 MW/SU ≈ 47 SUs. Round to 48 SUs (a clean 3 × 16-SU SuperPOD-scale halls, or 6 × 8-SU halls). That is 48 × 8 = 384 NVL72 racks and 48 × 576 = 27,648 GPUs. Total IT at 132 kW/rack = 50.7 MW — within the 50 MW brief at design margin.
| Subsystem | Sizing basis | Quantity / value |
|---|---|---|
| Scalable units | 50 MW ÷ 1.06 MW/SU | 48 SUs |
| NVL72 racks | 48 × 8 | 384 racks |
| GPUs | 384 × 72 | 27,648 GPUs |
| IT power (TDP) | 384 × 132 kW | 50.7 MW |
| Facility power (PUE ≈ 1.2) | 50.7 MW × 1.2 | ~60.8 MW |
| Utility interconnect (N, +margin) | ~61 MW × 1.15 | ~70 MW POI / 2× 132 kV feeders |
| Main transformers | ≥2 × 75 MVA (N+1 at MV) | 2–3 × 75 MVA |
| MV distribution | 33/13.8 kV ring or radial | per-hall 13.8 kV → 415 V / 800 VDC |
| UPS / ride-through (EDPp) | 50.7 MW × 1.4 EDPp | ~71 MW transient basis; BESS + rack BBU |
| Backup generation | N (gas RICE / turbine for island) | ~65–75 MW prime/standby |
| DLC heat to facility water | 384 × 115 kW | ~44 MW thermal |
| CDUs (L2L, ~1.4 MW each, N+1) | 44 MW ÷ ~1.3 MW useful + N+1 | ~36–40 CDUs (≈ 1 per 8–10 racks + spares) |
| Secondary-loop flow | 384 × ~190 LPM | ~73,000 L/min aggregate |
| Heat rejection | ~61 MW total heat | towers/dry-coolers + adiabatic, economized |
| Water make-up (WUE ~0.5 L/kWh) | 60.8 MW × 0.5 × 8,760 h | ~266 ML/yr (≈ 700 m³/day peak) |
| Back-end fabric (1:1) | 27,648 GPUs, 8-rail fat-tree | ~3,456 leaf+spine switch ASICs (see §4) |
| Floor area (white space) | 384 racks @ ~30 m²/rack incl. aisles/CDU | ~11,500 m² + plant |
| Floor loading reserve | wet rack ~1.36 t + VR200 headroom | design slab to ≥1,500 kg/m² |
4. Reference BOM: a 100k-GPU GB200/GB300 cluster
The flagship worked example: a 100,000-GPU GB200/GB300-class training cluster, costed as a bill of materials. Built from the SU: 100,000 ÷ 576 ≈ 174 SUs; round to 176 SUs = 1,408 NVL72 racks = 101,376 GPUs (≈ 100k). At 132 kW/rack that is ~186 MW IT, ~223 MW facility at PUE 1.2 — a multi-campus build in practice (scale-across over DCI, Chapter 8.8), but costed here as one logical cluster.
The cost column carries the heaviest caveat: GPU/system pricing is 2025 street/list (SemiAnalysis), networking and storage are practitioner ranges, and all of it moves quarterly. Use the counts as gospel (they are arithmetic) and the dollars as an order-of-magnitude frame. The GPU/system line dominates so heavily (~75–80% of cluster capex) that errors elsewhere barely move the total.
| BOM line | Count | Basis | Unit cost (2025–26) | Line cost (rough) |
|---|---|---|---|---|
| GB200/GB300 GPUs | 101,376 | 176 SU × 576 | ~$60–70k effective | — |
| NVL72 racks (integrated, L11) | 1,408 | 176 SU × 8 | ~$3.0–3.5M / rack | ~$4.5–4.9B |
| — (rack line includes GPUs, Grace, NVSwitch, DLC) | — | GB200 ~$200k/GPU system-level | — | ~$70–80B system total |
| CDUs (L2L, N+1) | ~150 | ~1 per 8–10 racks + spares | ~$120–180k | ~$22–27M |
| Back-end leaf switches (Quantum-X800/Spectrum-X) | ~2,000 | 8-rail, ~72 GPU/leaf group | ~$120k (64×800G) | ~$240M |
| Back-end spine switches | ~1,000 | 2-tier fat-tree, 1:1 | ~$120k | ~$120M |
| Front-end / storage / mgmt switches | ~600 | in-band + OOB + storage net | ~$30–60k | ~$25M |
| Back-end optics / transceivers (800G) | ~405,000 | 101,376 GPU × ~4 (NIC+leaf+spine ends) | ~$1,000–1,500 | ~$450–600M |
| DAC/AEC copper (intra-rack scale-up) | in-rack | 5,184 NVLink cables/rack (copper, in rack price) | incl. in rack | — |
| High-perf storage (parallel FS) | ~25 PB usable | ~250 GB/s per 1,024 GPU → ~25 TB/s | ~$0.20–0.40/GB flash tier | ~$1.0–2.0B |
| Capacity / object tier | ~150–250 PB | data lake + checkpoints | ~$0.02–0.05/GB | ~$5–12M |
| Facility power chain (per MW) | ~223 MW facility | transformers, UPS/BESS, switchgear, gen | ~$10–15M / MW (AI-grade) | ~$2.2–3.3B |
| Cooling plant + water loop (per MW) | ~186 MW IT | CDUs counted above + rejection + piping | ~$3–5M / MW | ~$0.6–0.9B |
| Cluster total (compute + network + storage + facility) | — | GPU/system dominates ~75–80% | — | ~$80–95B |
Sensitivity: how the three builds move when you change one input
Generation step (GB200 → VR200). Same 8-rack SU footprint, but IT power per SU goes ~1.06 → ~1.60 MW and flow ~1,580 → ~2,400 L/min. The 50 MW campus at VR200 density needs only ~31 SUs for the same 50 MW — but each hall now dissipates ~1.6× the heat per rack, so the cooling plant, not the floor, becomes binding. Oversubscription (1:1 → 2:1) for an inference cluster. Cuts back-end switches and optics by ~31% — on the 100k BOM that is ~$0.3B saved on a ~$90B total (≈ 0.3%), which is why you never compromise training fabric to save fabric money, but always oversubscribe an inference fabric. Closed-loop dry cooling. Drives the 50 MW campus's ~266 ML/yr water make-up toward zero, at the cost of ~+0.05 PUE (~+2.5 MW facility power) and a larger heat-rejection footprint — the WUE↔PUE trade from Chapter 15.4. Effective GPU life (3 yr → 5 yr). Does not move any count or capex line, but nearly doubles the denominator in $/GPU-hr — the dominant TCO lever, quantified in Chapter 1.8 and Appendix C.