Guide › Compute, Silicon & System Integration › 7.8

Chapter 7.8

Host CPUs, GPU:CPU Ratios & System Composition

The host CPU and the GPU:CPU ratio are not bookkeeping details bolted onto an accelerator purchase — they are the system-composition fork that decides whether your GPUs stay fed, and in the agentic era that fork has swung hard back toward the CPU.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Coherent Arm host-attach (Grace/Vera over NVLink-C2C) versus a discrete x86 host (EPYC/Xeon over PCIe) — and the memory model, blast radius, and supply-coupling each commits you to.
The GPU:CPU ratio your workload actually needs — the training-era 8:1 norm versus the agentic-inference rebalance toward 2:1 and below — because it sets host core count, host memory, and per-rack CPU spend.
The host memory architecture: LPDDR5X soldered into a coherent superchip versus socketed DDR5/MRDIMM you can size and service — capacity, bandwidth, RAS, and serviceability all trade here.
Whether your fleet is one composition or several — a dense coherent trainer SKU and a CPU-heavy agentic-inference SKU are different machines, and pretending one part serves both strands GPUs or strands cores.
Whether the system-composition decision is bundled (you take the vendor's rack-scale CPU:GPU ratio as shipped) or disaggregated (you compose host and accelerator yourself) — and what that costs you in flexibility versus integration risk.

An accelerator does not run a workload by itself. It runs as the muscle of a system — a host CPU that loads the model, drives the data pipeline, schedules kernels, services interrupts, runs the OS and the container runtime, terminates the storage and management network, and increasingly does real application work: tokenization, retrieval, sandboxed tool execution, agent orchestration, and the reward-evaluation loop of reinforcement learning. The question this chapter answers is not "which GPU" — that is Chapter 7.2 through Chapter 7.5. It is the question that sits one level up: what machine do you wrap around the accelerator, and in what ratio. Get it wrong in one direction and the host starves the GPUs — they idle waiting on data that the CPU pipeline cannot deliver, and you have paid for the most expensive silicon in the building to wait. Get it wrong in the other direction and you have bought host cores and host memory that no kernel ever touches — dead capex riding on a power budget you are otherwise fighting to conserve.

This was a quiet, settled question for the training era. The answer was "as little CPU as you can get away with": eight GPUs per host, the cheapest x86 that could keep the pipeline fed, and as much of the silicon budget as possible spent on accelerators. Two forces broke that settlement. First, coherent host-attach — NVIDIA's NVLink-C2C welding a Grace (and now Vera) Arm CPU to the GPU as a single memory-coherent superchip — turned the host from a discrete component into part of the accelerator's memory hierarchy. Second, and more disruptively, agentic inference and RL shifted real, sustained compute back onto the host. The orchestration layer that schedules sub-agents, routes tool calls, runs retrieval, and decides whether a task is done is CPU work, and it does not fit in the slivers of host left over from an 8:1 training box. The industry is re-deciding system composition in real time, and the fork has consequences that propagate into power, density, supply, and serviceability for the life of the fleet.

The two host-attach models

There are two physically and architecturally distinct ways to attach a host CPU to accelerators, and the choice is the first fork in system composition. They differ in the interconnect, the memory model the programmer sees, the failure blast radius, and — critically in 2026 — in how tightly they couple you to a single vendor's supply chain.

Coherent host-attach (the superchip model). NVIDIA's Grace-Blackwell and Vera-Rubin packages connect an Arm host CPU to the GPU(s) over NVLink-C2C, a chip-to-chip link delivering ~900 GB/s of coherent bandwidth on Grace-class parts (rising on Vera). "Coherent" is the load-bearing word: the CPU's LPDDR5X and the GPU's HBM live in one cache-coherent address space, so the GPU can reach host memory and the CPU can reach HBM with hardware coherence, no explicit copy, no staging through a bounce buffer. The host memory becomes a fast, large NUMA tier behind the GPU — the natural home for KV-cache spillover, large embedding tables, MoE expert weights that do not fit in HBM, and the working set of long-context and agentic inference. The cost: the CPU is welded to the GPU. You take the ratio the package ships (2 GPUs : 1 Grace in GB200; the Vera-Rubin pairing in 2026), the LPDDR5X capacity is soldered and fixed, and you are buying one vendor's CPU, GPU, and interconnect as an indivisible unit.

Discrete host-attach (the PCIe model). The classic HGX/8-GPU baseboard hangs accelerators off a pair of x86 sockets (AMD EPYC or Intel Xeon) over PCIe — Gen5 at ~64 GB/s per x16, Gen6 at ~128 GB/s. The CPU and GPU keep separate memory spaces; moving data between them is an explicit DMA over the PCIe link or, for GPU-to-GPU, a sideband over NVLink that bypasses the host entirely. You give up coherent host memory — host RAM is not a transparent extension of HBM, and PCIe bandwidth is an order of magnitude below NVLink-C2C — but you gain composability. You choose the CPU vendor, the core count, the socketed DDR5 capacity (terabytes if you want them), and the GPU:CPU ratio independently, and you can service or upgrade the host without touching the accelerator. For workloads that do not need coherent host memory — most discrete-GPU inference, and any fleet that values dual-source supply over the last increment of host-GPU bandwidth — this is the pragmatic default.

Coherent Arm superchip vs discrete x86 host — the system-composition fork

Axis	Coherent Arm (Grace / Vera + NVLink-C2C)	Discrete x86 (EPYC / Xeon + PCIe)	What the choice costs you
Host-to-GPU link	NVLink-C2C ~900 GB/s coherent (rising on Vera)	PCIe Gen5 ~64 GB/s / Gen6 ~128 GB/s per x16	~7–14x host-GPU bandwidth, and coherence vs explicit copies
Memory model	One cache-coherent address space; HBM + host LPDDR5X unified	Separate CPU and GPU spaces; DMA across PCIe	Coherent host RAM as an HBM spill tier vs manual staging
Host memory	Soldered LPDDR5X — Grace 480 GB / Vera up to 1.5 TB	Socketed DDR5 / MRDIMM — multiple TB, field-serviceable	Bandwidth-per-watt vs capacity, RAS, and serviceability
GPU:CPU ratio	Fixed by package (2:1 in GB200; Vera-Rubin pairing)	Composed freely (8:1 down to 1:1)	Take the vendor's ratio vs size it to your workload
CPU ISA / vendor	Arm (NVIDIA Olympus/Neoverse cores), single-source	x86, dual-source (AMD + Intel)	CUDA-tight integration vs portability and supply hedging
Blast radius / service	CPU + GPU fail and refresh as one welded unit	Host serviceable independently of accelerators	Tighter integration vs independent repair and upgrade

2026-current reference points. Grace/Vera figures are NVIDIA platform specs; EPYC/Xeon figures are vendor datasheets. The right column is the consequence you inherit, not a verdict.

The fork: do you need coherent host memory, or do you need composability?

This is the decision that sets system composition for the life of the fleet. Choose coherent host-attach when the workload genuinely uses host memory as part of the accelerator's hierarchy — long-context inference with large KV-cache, wide-MoE serving where experts spill out of HBM, recommendation models with terabyte embedding tables, or any case where the ~900 GB/s coherent path and the unified address space buy real goodput. You pay for it in a fixed ratio, soldered memory, single-vendor supply, and a host you cannot service or upgrade apart from the GPU. Choose discrete x86 host-attach when you value composability — sizing the CPU, the memory, and the ratio to the workload — and dual-source supply over the last increment of host-GPU bandwidth. Most inference fleets that do not lean on coherent memory land here, and so does anyone whose procurement strategy refuses to single-source the entire node. The trap is choosing the superchip for its headline bandwidth and then running a workload that never touches host memory — you have bought coherence you do not use and surrendered the composability you did need.

The GPU:CPU ratio — and why agentic AI is rebalancing it

The GPU:CPU ratio is the single number that most concisely captures system composition. For the training era it was effectively a constant: 8:1. An HGX baseboard carried eight accelerators and a dual-socket host whose job was to keep the data pipeline full and stay out of the way. The host did almost no application compute — pre-training is GPU-bound on collectives and matmul, and the CPU's contribution is loading shards, augmenting data, and launching kernels. The rational move was to minimize host spend: the dollars and the watts belonged to the GPUs. Even host memory got cut — neocloud reference builds trimmed training-server RAM from 2 TB to 1 TB to save roughly $13.6k per server precisely because the trainer never used it.

Agentic inference and RL break the constant. The work that defines an AI agent — decomposing a request into sub-tasks, scheduling them, routing tool calls, passing state between sub-agents, running retrieval, executing code in sandboxes, and evaluating whether the original goal is met — is overwhelmingly host-side, control-heavy compute. It is branchy, latency-sensitive, and does not vectorize onto a GPU. RL training compounds this: every action an agent takes during a rollout has to be executed and scored, and that evaluation loop runs on the CPU. As agents proliferate, the host stops being a feeder and becomes a co-processor doing sustained work, and the ratio that kept GPUs the scarce resource per node no longer holds. TrendForce projects the ratio shifting from the training-era 1:4–1:8 toward 1:1–1:2 in agentic deployments; Intel frames the same move as 1:8 heading toward 1:1. Arm goes further on the aggregate: roughly 30 million CPU cores per gigawatt in a traditional AI data center rising to ~120 million per gigawatt in the agent era — a fourfold increase in host cores per unit of power, and a >$100B data-center-CPU opportunity by 2030.

The consequence for system composition is concrete. A fleet scoped at 8:1 for training and then repurposed for agentic serving will be CPU-starved: the orchestration and tool-execution layer saturates the thin host, queues form ahead of the GPUs, tail latency blows past the SLO, and accelerators sit underutilized behind a host bottleneck. The fix is not more GPUs — it is more CPU per GPU, which means a different node, often a different rack SKU, and a host-side power and memory budget the training composition never planned for. This is why NVIDIA designed Vera explicitly "for agents": 88 custom Olympus cores, 176 threads via spatial multithreading, and up to 1.5 TB of LPDDR5X at ~1.2 TB/s — a host built to do real work, not just feed.

Why "the GPU:CPU ratio" is becoming the wrong question

Arm's leadership has argued that as host core counts climb toward 512 per socket and agentic workloads make the CPU a first-class compute tier, the GPU:CPU ratio stops being a meaningful design knob — what matters is whether you have enough host throughput, of the right kind, in the right place. The useful reframing: stop asking "how many GPUs per CPU" and start asking "is the host a feeder or a co-processor for this workload." A pure pre-training trainer wants a feeder — minimize host, maximize accelerator, 8:1 is fine. An agentic-inference or RL fleet wants a co-processor — size host cores and host memory to the orchestration and tool-execution load, and the ratio falls out of that sizing rather than driving it. Designing to a ratio you inherited from the training era is how you build a fleet that is simultaneously over-provisioned on GPUs and starved on CPUs.

Host memory: LPDDR5X soldered vs DDR5/MRDIMM socketed

The memory the host carries is the third axis of composition, and it tracks the host-attach fork tightly. The coherent superchips use LPDDR5X — the low-power DRAM from the mobile world — soldered directly onto the package. The reasons are bandwidth-per-watt and density: Grace pairs 480 GB of LPDDR5X at ~546 GB/s with a ~250 W TDP, and NVIDIA's framing is roughly 2x the bandwidth at half the power of conventional socketed server memory. Vera pushes this to up to 1.5 TB at ~1.2 TB/s. In a power-bound rack where every watt spent on host memory is a watt not spent on accelerators, LPDDR5X's efficiency is the whole point. The price you pay is rigidity: the capacity is fixed at manufacture, you cannot add or upgrade it, and a memory fault is a package fault — there is no DIMM to pull and replace, and LPDDR's RAS posture is weaker than registered server DRAM with full ECC and post-package repair.

The discrete x86 hosts use socketed DDR5, and increasingly MRDIMM (multiplexed-rank DIMMs) on the Intel side to claw back the bandwidth gap — Granite Rapids reaches MRDIMM-8800 against AMD's DDR5-6000/6400 on EPYC Turin. Socketed memory gives you what the superchip cannot: you size capacity to the workload (multiple terabytes for embedding tables, large KV-cache, or in-memory retrieval indices), you get full server-grade RAS, and you field-service or upgrade memory without touching the accelerators. The cost is power and density — registered DDR5 burns more per gigabyte than LPDDR5X, and the DIMM slots consume board area and a slice of the rack's thermal and electrical budget. The decision mirrors the host-attach fork: take the superchip and you take its soldered LPDDR5X efficiency-and-rigidity bargain; take the discrete host and you can spend power to buy capacity, bandwidth, RAS, and serviceability on your own terms.

~900 GB/s

NVLink-C2C coherent host-GPU bandwidth (Grace–Blackwell); ~7–14x PCIe Gen5/Gen6 per x16

2025NVIDIA; arXiv 2411.02814 (coherent fabrics)

1:1–1:2

agentic-era GPU:CPU ratio, down from the training-era 1:4–1:8 (Intel: 1:8 → 1:1)

2026TrendForce; Intel (via TrendForce)

~120M cores/GW

host CPU cores per gigawatt in the agent era, up ~4x from ~30M/GW traditional

2026Arm (via TrendForce)

88 cores / 1.5 TB

NVIDIA Vera CPU: 88 Olympus cores, 176 threads, up to 1.5 TB LPDDR5X @ ~1.2 TB/s, OEM avail. H2 2026

2026NVIDIA (Vera CPU)

72 cores / 480 GB

NVIDIA Grace: 72 Arm cores, 480 GB LPDDR5X @ ~546 GB/s, ~250 W; 36 per GB200 NVL72

2025NVIDIA; Introl

192 cores / 500 W

AMD EPYC 9965 (Turin) host CPU, DDR5-6000/6400; Intel Xeon 6 up to 128 P-cores w/ MRDIMM-8800

2025AMD; Intel; Phoronix

~$13.6k/server

saved by trimming training-server host RAM 2 TB → 1 TB — host memory the trainer never used

2025SemiAnalysis (AI Neocloud Playbook)

2 : 1

GPU:CPU ratio fixed by the GB200 NVL72 package (72 Blackwell GPUs : 36 Grace CPUs)

2025NVIDIA (GB200 NVL72)

System composition as a fleet decision, not a node decision

A common mistake is treating composition as a single answer. A real AI operation runs more than one workload shape, and the host-attach model, the GPU:CPU ratio, and the host memory that suit a pre-training trainer are the wrong answers for an agentic-inference fleet — and vice versa. The discipline is to recognize that system composition is plural, and to compose distinct node and rack SKUs for the workloads that actually dominate your draw.

A training SKU wants the coherent superchip ratio as shipped (2:1 in GB200), minimal host application compute, host memory sized to the data pipeline not to the model, and every spare watt routed to accelerators. An agentic-inference SKU wants the opposite: a fat host — many cores for orchestration and tool execution, large host memory for KV-cache and retrieval state, and a GPU:CPU ratio near 1:1–2:1 so the host never becomes the queue ahead of the GPUs. A batch-inference SKU sits between them, throughput-bound and tolerant of a thinner host. Forcing one composition across all three is how you end up simultaneously paying for unused host memory on the trainers and starving the agentic fleet of the CPU it needs. This is the same plurality the archetype framework demands at facility scale (Chapter 1.1) — here it lands inside the node.

Deep dive: why the host starves the GPU (and how to see it before you buy)

The failure mode that justifies this entire chapter is GPU starvation by the host, and it is subtle because the symptom — low GPU utilization — looks like a workload problem, not a composition problem. The mechanism: an accelerator can only do work that has been staged for it. If the host cannot decode the next batch, run the tokenizer, service the retrieval call, execute the agent's tool, and launch the kernel fast enough, the GPU drains its queue and idles. On a training box this rarely binds — pre-training is GPU-bound and the 8:1 host has cycles to spare. On an agentic-inference box it binds constantly, because every request fans out into host-side orchestration and tool calls that contend for the same thin host the training composition specced.

You can see this before you buy by profiling the host-side critical path, not just the GPU. Measure the CPU time per request spent in tokenization, retrieval, sandbox execution, and orchestration, and the host memory footprint of the KV-cache and retrieval state. If host-side time is a meaningful fraction of end-to-end latency, or if host memory pressure forces eviction, you are CPU-bound and an 8:1 ratio will strand accelerators. The fix is to move down the ratio (more CPU per GPU), widen the host (more cores), or fatten host memory — and to recognize that adding GPUs to a CPU-bound fleet buys nothing but idle silicon. The goodput discipline of Chapter 9.4 for training has a direct analogue here: utilization, not nameplate FLOPS, is the number that pays the bill.

Deep dive: coherent host memory as an HBM extension — what it actually buys

The headline case for the coherent superchip is that host LPDDR5X becomes a usable extension of HBM. It is worth being precise about when this is real value versus marketing. HBM is the scarce, expensive, capacity-constrained tier (Chapter 7.6) — a Blackwell GPU carries 192 GB, a Grace host carries 480 GB of LPDDR5X. Across NVLink-C2C's ~900 GB/s coherent link, that host memory is reachable from the GPU as a second NUMA tier: slower than HBM, but enormously larger and, crucially, addressable without an explicit copy. For workloads whose working set exceeds HBM — long-context inference with multi-gigabyte KV-cache, wide-MoE serving where the full expert set dwarfs HBM, embedding-heavy recommendation, or in-memory retrieval over large indices — this lets you spill gracefully into a coherent tier instead of falling off a cliff to PCIe-attached host memory or to storage.

The discrete PCIe host can do none of this transparently. Host memory there is a separate space at ~64–128 GB/s, reached by explicit DMA, and the programmer (or the framework) must manage the staging by hand. So the coherent model's payoff is real precisely for the memory-hungry inference workloads that are growing fastest — and that is the same pressure (CXL pooling and memory-semantic fabrics, Chapter 8.4) reshaping the memory hierarchy above the host. But the payoff is zero for a workload whose state fits in HBM and never reaches across the link. Buying coherence you do not exercise is the inverse error of starving a host you over-loaded — both come from specifying composition without profiling the workload.

The bundling fork: take the ratio, or compose it yourself

One last decision sits underneath all of this: how much of system composition do you let the vendor make for you. The rack-scale coherent platforms (GB200 NVL72, Vera-Rubin) ship as integrated units with the CPU:GPU ratio, the host CPU, the host memory, and the interconnect all fixed at the factory. You buy a rack, not a node, and the composition is a given. That is genuinely valuable — the integration risk of welding 72 GPUs, 36 CPUs, NVSwitch trays, and a liquid-cooling loop into one coherent domain is enormous, and the vendor has absorbed it. But it means the GPU:CPU ratio, the host ISA, and the host memory are not yours to tune. If your agentic workload wants a fatter host than the package ships, your only lever is a different SKU, not a different host.

The discrete path keeps composition in your hands. You build the node — choosing AMD or Intel, the core count, the socketed memory, the GPU count per host, the ratio — and you own both the flexibility and the integration risk. For a fleet with heterogeneous workloads and a procurement strategy that refuses to single-source, this is the path that lets composition track the workload. The 2026 reality is that most large operators run both: bundled coherent racks where the workload genuinely needs the coherent domain (training, memory-hungry inference), and composed discrete nodes where composability and dual-source supply matter more. The decision is not which path is correct in the abstract — it is which workloads belong on which path, and that lands you right back at the fleet-plurality discipline above. → procurement and fleet composition in Chapter 7.11.

The training-era ratio is a liability you will carry for years

The single most expensive composition mistake of the 2026–2027 transition is buying agentic-inference capacity at the training-era 8:1 ratio because that is what the standard HGX node ships and what the org is used to specifying. The accelerators are the cheap part of the mistake to spot — they show up as low utilization. The damage is the host: a thin CPU and modest host memory that throttle the orchestration and tool-execution layer, cap your agent concurrency, and cannot be widened without replacing the node. You will carry that under-specced host for the depreciation life of the fleet, paying for idle GPUs behind a host bottleneck. Profile the host-side critical path before you commit a ratio, and if the workload is agentic, spec the host as a co-processor — not a feeder.

System composition wraps around the accelerator chosen in Chapter 7.2 (NVIDIA, including Grace/Vera and the GB200/Vera-Rubin packages) and Chapter 7.3 (AMD). The coherent host-memory tier it enables is fed by the HBM economics of Chapter 7.6 and the memory-semantic/CXL fabrics of Chapter 8.4; the NVLink-C2C link itself is part of the scale-up domain treated in Chapter 8.5. The agentic and RL workloads driving the ratio rebalance are scoped in Chapter 1.3 (inference) and Chapter 1.4 (post-training/RL); the goodput-vs-utilization discipline behind GPU starvation mirrors Chapter 9.4. The fleet-plurality logic — composing distinct SKUs per workload — is the node-level form of the archetype framework in Chapter 1.1, and the buy-vs-compose economics close out in Chapter 7.11. Host memory power competes for the same rack watt budget engineered in Chapter 5.4 (DLC) and the precision/memory tradeoffs of Chapter 7.10.