Guide › Compute, Silicon & System Integration › 7.11

Chapter 7.11

Accelerator Selection, TCO & Procurement Strategy

Accelerator selection is not a spec-sheet beauty contest — it is a constrained optimization against whichever resource actually binds you (power or capital), scored in cost-per-useful-token, and executed through a procurement playbook that treats allocation, depreciation, and fleet heterogeneity as first-class engineering variables.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Which metric governs the decision — $/GPU-hour (you are buying compute supply), $/M-tokens (you are selling a product), or tokens/MW (you are power-bound) — because the winner of the comparison changes depending on which denominator you pick.
Whether you are in the power-limited or the capex-limited regime, because that single fact flips the objective function from 'most tokens per dollar of silicon' to 'most tokens per megawatt' and re-ranks every accelerator on the shortlist.
Buy vs rent vs build for THIS workload at THIS utilization and THIS depreciation life — the crossover is a function of all three, not a fixed answer, and it moves when the depreciation assumption moves.
How heterogeneous your fleet should be — single-vendor for software velocity and operational simplicity, or multi-vendor/multi-silicon for cost-per-token leverage and supply resilience — and whether you can actually carry the second software stack.
How you construct the RFP and the acceptance bar so that cross-vendor bids are comparable on realized goodput, delivered against an allocation reality where HBM and packaging — not your purchase order — gate when the silicon actually arrives.

By the time you reach this chapter you have met the silicon: NVIDIA's annual cadence (Chapter 7.2), AMD's open challenge (Chapter 7.3), the hyperscaler XPUs and custom ASICs (Chapter 7.4, Chapter 7.5), the HBM and packaging constraints that gate all of it (Chapter 7.6, Chapter 7.7), and the software ecosystems that lock or liberate the choice (Chapter 7.9). This chapter is where those threads are pulled into a single decision: which accelerators do you actually buy, how many, from whom, on what financial basis, and on what schedule? It is the procurement-execution chapter — the place where a spec sheet becomes a purchase order, a power budget, and a depreciation schedule.

This is the procurement-execution chapter, and it works in sequence. It starts by naming the governing metric, because the wrong denominator silently picks the wrong chip. It builds the TCO model that turns a per-GPU price into a cost-per-useful-token. It confronts the master fork — power-limited vs capex-limited — which decides whether you optimize tokens-per-dollar or tokens-per-megawatt. It lays out buy-vs-rent-vs-build as a crossover that moves with utilization and depreciation. It treats the heterogeneous fleet as an engineering choice with a carrying cost. And it closes on supply/allocation reality and the RFP that makes cross-vendor bids honest. The depreciation argument that underwrites all of it lives in Chapter 1.8; here we execute against it. The per-generation perf/watt and cost-per-token trajectory is consolidated in Chapter 16.2; here we use today's snapshot to decide.

The governing metric: name the denominator first

The single most common error in accelerator selection is comparing chips on the wrong axis. A GPU that wins on raw FLOPS can lose on delivered tokens; a chip that wins on $/GPU-hour can lose on $/M-tokens; a part that wins on $/M-tokens at the bench can lose on tokens/MW once the grid binds. The denominator you choose is the decision — so name it before you shortlist.

Raw FLOPS and HBM capacity are inputs, not metrics. They tell you what the silicon could do, not what it will do against your workload at your batch size and precision. The realized-MFU gap between paper FLOPS and delivered throughput is large and vendor-specific — a part can carry ~1.5x the paper FLOPS of a rival and still deliver only 37–66% of its throughput on a real workload because the software stack is immature (SemiAnalysis MI300X benchmarks, 2025). Never select on the spec sheet; select on what the spec sheet delivers.

$/GPU-hour is the right metric when you are buying or selling raw compute supply — a neocloud renter, a training tenant. $/M-tokens is the right metric when you are selling a product to an end user, because tokens are what the customer buys; this is now the governing metric for most operators because inference is ~2/3 of AI compute in 2026 (Deloitte TMT Predictions 2026). tokens/MW (equivalently perf/watt) is the right metric the moment power is the binding constraint — and in 2026, for a growing share of operators, it is. The three are linked by utilization and tokens-per-GPU-second, and a part that wins on one can lose on another. The honest pro-forma states which denominator it optimizes and why. → metric definitions in Chapter 0.3; the financial denominators in Chapter 1.8.

The master fork: are you power-limited or capex-limited?

Determine which resource binds you before you shortlist a single chip. In the capex-limited regime, megawatts are available and capital is the scarce input: you optimize tokens per dollar of silicon, and the cheapest-per-delivered-token part wins — even if it is less power-efficient — because you can simply energize more of it. In the power-limited regime, your interconnection is capped and capital is comparatively abundant: you optimize tokens per megawatt, and the most perf/watt-efficient part wins — even if it costs more per chip — because every watt spent on a less-efficient accelerator is a token you can never produce. The two regimes re-rank the same shortlist in opposite orders. A newer, pricier, more efficient generation looks expensive in the capex-limited world and looks cheap in the power-limited one, because it lets a fixed megawatt envelope emit far more tokens (Blackwell-class parts deliver on the order of 10x the tokens/watt of Hopper on MoE inference; NVIDIA / Signal65, 2026). In 2026 the binding constraint has broadly shifted from capital to power — so default to tokens/MW unless you can prove your megawatts are unconstrained. → the power-bound framing in Chapter 1.1; siting and interconnection in Chapter 3.2.

Building the TCO model

The TCO model is the machine that converts a per-GPU sticker price into the governing metric. It has a numerator (all-in cost) and a denominator (useful output), and most arguments about accelerator selection are really arguments about one term in this fraction.

The numerator is more than the chip. Start with the all-in server: an 8-GPU H100 node runs ~$283k–$318k excluding storage (SemiAnalysis AI Neocloud Playbook, 2025), and the GPU is only part of it — host CPUs, HBM (a top-3 BOM line, see Chapter 7.6), NICs, the NVSwitch/scale-up fabric, the chassis, and integration. Then layer the recurring stack: depreciation (the largest single line for an owner), power and cooling, networking amortization, facility/lease, staff, and a software/support envelope. The canonical self-operated build-up lands near ~$0.74/GPU-hr at 2048-GPU scale and 90% utilization, rising to ~$1.03/hr for small clusters (SemiAnalysis, 2025 — a single-source, contested figure) — note the scale penalty, which is itself a selection input.

The denominator is useful output, and 'useful' carries the weight. For training it is delivered model-FLOPS-utilization against the job, not nameplate FLOPS — a fabric that starves the all-reduce (see Chapter 8.5) collapses the denominator no matter how fast the chip. For inference it is tokens-per-GPU-second at your SLO, your precision, and your batch efficiency, which is where FP4/NVFP4 quantization (Chapter 7.10) and serving discipline (Chapter 10.11) move the number several-fold. The depreciation life you assume — the contested 2–3 year economic life vs the 5–6 year book life — is the dominant lever on the numerator and is argued in full in Chapter 1.8; here we simply note that the same hardware shortlist re-ranks if you change that one assumption, so model both.

The three governing metrics — when each one decides

Metric	You are…	Binding constraint it assumes	Wins the comparison	Failure mode if mis-chosen
Raw FLOPS / HBM GB	Reading a spec sheet	None — it is an input, not a metric	The biggest die (on paper)	Buys paper FLOPS the software never delivers (37–66% realized is common)
$/GPU-hour	Buying or selling raw compute supply	Capital / unit cost of capacity	Cheapest all-in cost per delivered GPU-hour	Ignores tokens-per-second differences — a slow cheap chip looks good, serves badly
$/M-tokens	Selling a product to an end user	Cost of the thing the customer buys	Most useful tokens per dollar (realized, at SLO)	Ignores the power envelope — wins on paper, cannot be energized at scale
tokens/MW (perf/watt)	Power-limited (capped interconnection)	Megawatts you can energize	Most tokens per watt — efficiency over chip price	Over-pays per chip when megawatts are actually abundant

Which denominator to optimize, and the failure mode of using the wrong one. Inference-share and tokens/watt figures: Deloitte TMT 2026; NVIDIA/Signal65 2026. Self-op cost: SemiAnalysis 2025.

Buy vs rent vs build: a crossover that moves

Chapter 1.6 framed procurement archetypes qualitatively and Chapter 1.8 scored them as an NPV; this is the accelerator-level execution of that fork. The question 'should I own GPUs or rent them?' has no fixed answer — it is a crossover that moves with three variables: utilization, depreciation life, and resale liquidity.

Renting from a neocloud is right when utilization is spiky, the workload is short or experimental, or you need time-to-first-job in days. The 2026 rental ladder spans nearly an order of magnitude for the same H100 — a ~$1.03/hr spot floor, neocloud median ~$2.29–3.50/hr, AWS ~$6.88/hr, Azure ~$12.29/hr (SemiAnalysis H100 Index / AM Compute, 2026) — and that ladder is volatile, with the 1-year contract index up ~40% from October 2025 to March 2026 as supply tightened. Owning wins when you can hold utilization high against your own self-operated cost (~$0.74/GPU-hr at scale), because the spread between that cost and the rental rate is your saved margin — but only if the asset stays full. The crossover sits where your achievable utilization, against your assumed depreciation life, makes owned cost beat rented price. Below your breakeven utilization (~70% for a debt-financed cluster), renting wins even at a headline premium, because the owned asset bleeds against its debt service. And the crossover moves with depreciation: shift from a 5-year to a 3-year economic life and the owned-cost line rises sharply, pushing the crossover toward renting for all but the most durable, highest-utilization workloads.

Buy vs rent vs build — the accelerator-level decision

Path	Unit basis (2026)	Time-to-first-job	Best when	What moves the crossover
Rent (neocloud)	~$2.29–3.50/GPU-hr median	Days to weeks	Spiky / short / uncertain util; burst overflow	Higher util → owning wins; volatile rates re-open it
Rent (hyperscaler)	~$6.88–12.29/GPU-hr	Hours to days	Managed envelope, enterprise SLA, integration	Convenience premium rarely beats owning at high util
Own (self-operate hardware)	~$0.74/GPU-hr at scale	Weeks–months (you have power/space)	Durable workload, high util, you hold the slab	Shorter depreciation life / thin resale → rent wins
Build (own facility + hardware)	~$8.5M/MW-yr all-in	24–36 months	Largest, well-forecast, multi-year; control needed	Demand uncertainty → option value favors lease/rent

Rental rates: SemiAnalysis H100 Index / AM Compute 2026. Self-op cost & per-server: SemiAnalysis 2025. Lead times: practitioner ranges (Introl, SemiAnalysis). The crossover is a function of utilization × depreciation × resale, not a fixed line.

The heterogeneous fleet: one vendor or several?

The next fork is composition: a single-vendor fleet, or a heterogeneous mix of GPUs, hyperscaler XPUs, and custom ASICs. This is not a religious question — it is a trade between software velocity and cost-per-token leverage, and it carries a real, quantifiable carrying cost.

Single-vendor (in practice, NVIDIA + CUDA) buys software velocity, the broadest library and kernel coverage, and operational simplicity — one stack to staff, one toolchain to qualify, one supplier relationship to manage. The cost is allocation pain and price exposure: you inherit the supply queue and pay the merchant premium. Heterogeneous fleets chase cost-per-token leverage and supply resilience. The economics that justify the second stack are real: AMD parts price ~15–30% below comparable NVIDIA SXM and on inference can close the effective-TCO gap to near zero or favorable (SemiAnalysis, 2025–2026); custom ASICs are eating the inference workload precisely on a tokens/$/W basis. The structural pull is strong — ASIC-based AI servers are projected at ~27.8% of shipments in 2026, with ASIC share of the inference segment rising toward ~40% (TrendForce / industry synthesis, 2026), because stable-architecture inference is exactly where fixed-function silicon's efficiency beats general-purpose flexibility.

The discipline is to match silicon to sub-workload and to be honest about the carrying cost of the second stack. Custom ASICs and TPUs win on stable, high-volume inference where the architecture is frozen and cost-per-token dominates — but you surrender portability and inherit a less mature toolchain (XLA/JAX or Neuron, see Chapter 7.9). Merchant GPUs win where the workload is changing — frontier training, research, anything that needs CUDA's velocity and kernel coverage — because reprogrammable flexibility is worth the premium when the model architecture is still moving. The carrying cost of heterogeneity is a second software stack: engineers, qualification, CI, kernel ports, and the realized-MFU gap on the less-supported part. Only carry it if the cost-per-token saving on the workload you actually run exceeds that engineering tax. → the GPU-vs-ASIC framing in Chapter 7.5; lock-in quantification in Chapter 7.9.

Fixed-function efficiency vs reprogrammable flexibility — the durable axis

The heterogeneous-fleet decision reduces to one axis that will outlive any specific generation: how stable is the workload? A custom ASIC bakes a serving pattern into silicon and wins decisively on tokens/$/W for that pattern — and loses its entire advantage the moment the model architecture shifts under it, because you cannot reprogram a fixed-function datapath. A merchant GPU pays a flexibility tax in die area and power but absorbs an architecture change for free. The correct fleet, therefore, tracks the stability of your workload mix: dedicate ASIC/TPU capacity to the frozen, high-volume serving tiers where the next two years of architecture are predictable, and keep a merchant-GPU pool for everything still in motion. Mis-read the stability and the ASIC strands: you have bought the cheapest tokens for a model you no longer serve. This is the GPU-vs-ASIC bet stated as an engineering hedge rather than a vendor preference.

~2/3

inference share of AI compute in 2026 (½ in 2025), making $/M-tokens the governing metric

2026Deloitte TMT Predictions 2026

~$283–318k

all-in cost per 8-GPU H100 server (excl. storage); GPU is only part of the numerator

2025SemiAnalysis AI Neocloud Playbook

~$0.74/GPU-hr

self-operated TCO at 2048-GPU scale, 90% util; ~$1.03 small clusters (scale penalty) (contested — single-source)

2025SemiAnalysis, GPU cluster cost

~$1.03 → ~$12.29/hr

H100 rental ladder: spot floor → neocloud ~$2.29–3.50 → AWS ~$6.88 → Azure ~$12.29

2026SemiAnalysis H100 Index / AM Compute

~10x

tokens/watt advantage of Blackwell-class over Hopper on MoE inference — the power-limited lever

2026NVIDIA / Signal65 (vendor; treat as optimistic)

~27.8%

ASIC share of AI server shipments in 2026; inference-segment ASIC share toward ~40%

2026TrendForce / industry synthesis

~15–30%

AMD price discount vs comparable NVIDIA SXM; inference effective-TCO gap → ~0/favorable

2025-2026SemiAnalysis (AMD/MI300X analyses)

~20–40%

GPU residual after 3 yr (CONTESTED); underwrites buy-vs-rent crossover & resale liquidity

2025Hashrate Index / CNBC synthesis

Supply and allocation: the order is not the arrival

Selection assumes you can get the silicon. In 2026 you frequently cannot — not on your timeline — and a procurement strategy that ignores allocation is a wish list. The binding constraint upstream of every accelerator is HBM and advanced packaging: 2026 HBM production sold out, HBM4 reaching volume only in H2 2026, and CoWoS-class packaging is the most-cited binding constraint on AI compute through 2030 (see Chapter 7.6, Chapter 7.7). Your purchase order does not set the arrival date; the supplier's HBM and CoWoS allocation does.

The consequences for procurement execution are concrete. First, lead time is a selection criterion, not a footnote: a part that is 15% cheaper per token but arrives two generations into your ramp may strand a power slot you fought years to energize (see Chapter 3.2). Second, allocation favors anchor commitments — take-or-pay volume, multi-generation agreements, and (increasingly) vendor-equity entanglements move you up the queue, which is why allocation strategy is inseparable from the financing structure (Chapter 2.5) and the end-to-end equipment supply chain (Chapter 2.3). Third, a second-source qualification is a hedge with option value: carrying a qualified AMD or ASIC path is not only a cost-per-token play, it is insurance against a single-supplier allocation squeeze. The heterogeneous fleet and the allocation hedge are the same decision viewed from two angles.

Deep dive: why HBM/CoWoS allocation, not your PO, sets your schedule

The instinct is to treat accelerator procurement like buying servers: choose the part, cut the PO, take delivery on the quoted lead time. For frontier AI silicon in 2026 this model is broken, and the break is upstream of the GPU vendor. Every leading-edge accelerator depends on two scarce inputs that the vendor does not fully control: HBM stacks (a three-supplier oligopoly, with 2026 capacity sold out and HBM4 ramping only in the second half of the year) and CoWoS-class advanced packaging (capacity-gated, with 16-Hi stacks limited by under ~100 global hybrid-bonding tools). A GPU is, in supply terms, a packaging slot wrapped around a die and a set of HBM stacks — and the slot, not the die, is the bottleneck.

The procurement consequence: the part you can buy and the part you can get on your ramp date are different questions, and the second one dominates. This is why anchor-tenant economics exist — a ~500k-Trainium2 commitment for a single customer, or a multi-generation NVIDIA agreement, buys allocation priority that a spot order cannot. It is why neoclouds with vendor relationships and pre-committed volume out-execute better-capitalized latecomers. And it is why the correct RFP asks not just 'what is your price?' but 'what is your allocated, contractually-committed delivery schedule, and what happens to it if HBM slips?' A selection made on price-per-token without a binding delivery commitment is a forecast, not a plan. → the upstream allocation gate in Chapter 7.6 and Chapter 2.3; the financing entanglement in Chapter 2.5.

Constructing the RFP for cross-vendor comparison

The RFP is where accelerator selection succeeds or fails, because it is the only instrument that makes heterogeneous bids comparable. The failure mode is universal and expensive: vendors bid on the spec sheet, you compare paper FLOPS, and the realized-MFU gap turns the cheapest bid into the most expensive cluster. A defensible RFP forecloses that by fixing the comparison on delivered goodput, not nameplate performance.

Specify the metric, not the part. State the workload, the precision, the SLO (TTFT/TPOT for inference; job-completion-time for training), and demand bids in your governing denominator — $/M-tokens at the SLO, or $/effective-PFLOP-hour at delivered MFU — so a cheap, slow part cannot win on sticker price.
Mandate a benchmark on your workload. Require a measured run on a representative model at your batch and sequence length, not a vendor-supplied number on a favorable case. The MI300X-vs-H100 benchmarking literature exists precisely because measured diverges from spec; bake the measurement into the bid.
Score on tokens/MW when power-limited. If your interconnection is capped, the bid that wins on $/M-tokens but loses on perf/watt cannot be energized at scale — require a tokens/MW line and weight it to the binding constraint.
Bind the delivery schedule. Demand a contractually-committed, HBM/CoWoS-allocated delivery curve with remedies for slip — because an unbacked lead time is the single most common reason a 'won' procurement misses its ramp.
Set a goodput acceptance bar. Tie acceptance and payment milestones to a sustained, measured goodput threshold (see Chapter 7.14), not to power-on — so a cluster that benches well but cannot hold MFU through burn-in is the vendor's problem, not yours.

The spec-sheet trap: never select on paper FLOPS

The recurring procurement disaster is selecting on nameplate FLOPS and HBM capacity, then discovering that the chosen part delivers a fraction of its paper performance on your actual workload because the software stack, the fabric, or the batch efficiency strangles it. A part carrying ~1.5x the paper FLOPS of a rival can deliver only 37–66% of that rival's realized throughput (SemiAnalysis, 2025) — meaning the 'faster, cheaper-per-FLOP' bid is, in delivered tokens, slower and dearer. The defense is to refuse to compare on inputs: force every bid into realized goodput on your workload, at your precision, at your SLO, measured before acceptance. The cost of skipping this is not a worse spreadsheet — it is a half-utilized cluster against a depreciation clock that started the day power was energized. Selection is decided on realized goodput, not on the spec sheet.

Putting it together: the selection sequence

The chapter resolves into an ordered procedure, because each step gates the next. (1) Name the denominator — $/GPU-hr, $/M-tokens, or tokens/MW — from what you sell and what binds you. (2) Determine the regime — power-limited or capex-limited — because it picks between tokens-per-dollar and tokens-per-megawatt. (3) Build the TCO model with an explicit, dual depreciation assumption (economic and book life), since that one term re-ranks the shortlist. (4) Decide buy vs rent vs build at your real utilization and resale liquidity, treating the answer as a crossover that moves, not a fixed policy. (5) Compose the fleet — single-vendor for velocity, heterogeneous for cost-per-token and supply resilience — pricing the carrying cost of the second stack against the workload's stability. (6) Execute the RFP on realized goodput with a bound, allocated delivery schedule. The output is not a chip; it is a purchase commitment, a power budget, and a depreciation schedule that the rest of Part 7 and Part 1 have to live with.

The silicon this chapter selects among is detailed in Chapter 7.1 (taxonomy), Chapter 7.2 (NVIDIA), Chapter 7.3 (AMD), Chapter 7.4 (hyperscaler XPUs) and Chapter 7.5 (custom ASICs). The allocation gate that decides when it arrives is Chapter 7.6 (HBM) and Chapter 7.7 (packaging), with the end-to-end supply chain in Chapter 2.3 and the financing entanglement in Chapter 2.5. Software lock-in and the realized-MFU gap are Chapter 7.9; precision and quantization are Chapter 7.10; system composition and GPU:CPU ratios are Chapter 7.8; acceptance and goodput gates are Chapter 7.14. The procurement archetypes are framed in Chapter 1.6 and the depreciation/economics that underwrite the whole TCO are the canonical argument of Chapter 1.8. Inference serving that sets the token denominator is Chapter 10.11; the per-generation perf/watt and cost-per-token trajectory is consolidated in Chapter 16.2.