Chapter 7.8
Host CPUs, GPU:CPU Ratios & System Composition
The host CPU and the GPU:CPU ratio are not bookkeeping details bolted onto an accelerator purchase — they are the system-composition fork that decides whether your GPUs stay fed, and in the agentic era that fork has swung hard back toward the CPU.
What you'll decide here
- Coherent Arm host-attach (Grace/Vera over NVLink-C2C) versus a discrete x86 host (EPYC/Xeon over PCIe) — and the memory model, blast radius, and supply-coupling each commits you to.
- The GPU:CPU ratio your workload actually needs — the training-era 8:1 norm versus the agentic-inference rebalance toward 2:1 and below — because it sets host core count, host memory, and per-rack CPU spend.
- The host memory architecture: LPDDR5X soldered into a coherent superchip versus socketed DDR5/MRDIMM you can size and service — capacity, bandwidth, RAS, and serviceability all trade here.
- Whether your fleet is one composition or several — a dense coherent trainer SKU and a CPU-heavy agentic-inference SKU are different machines, and pretending one part serves both strands GPUs or strands cores.
- Whether the system-composition decision is bundled (you take the vendor's rack-scale CPU:GPU ratio as shipped) or disaggregated (you compose host and accelerator yourself) — and what that costs you in flexibility versus integration risk.
An accelerator does not run a workload by itself. It runs as the muscle of a system — a host CPU that loads the model, drives the data pipeline, schedules kernels, services interrupts, runs the OS and the container runtime, terminates the storage and management network, and increasingly does real application work: tokenization, retrieval, sandboxed tool execution, agent orchestration, and the reward-evaluation loop of reinforcement learning. The question this chapter answers is not "which GPU" — that is Chapter 7.2 through Chapter 7.5. It is the question that sits one level up: what machine do you wrap around the accelerator, and in what ratio. Get it wrong in one direction and the host starves the GPUs — they idle waiting on data that the CPU pipeline cannot deliver, and you have paid for the most expensive silicon in the building to wait. Get it wrong in the other direction and you have bought host cores and host memory that no kernel ever touches — dead capex riding on a power budget you are otherwise fighting to conserve.
This was a quiet, settled question for the training era. The answer was "as little CPU as you can get away with": eight GPUs per host, the cheapest x86 that could keep the pipeline fed, and as much of the silicon budget as possible spent on accelerators. Two forces broke that settlement. First, coherent host-attach — NVIDIA's NVLink-C2C welding a Grace (and now Vera) Arm CPU to the GPU as a single memory-coherent superchip — turned the host from a discrete component into part of the accelerator's memory hierarchy. Second, and more disruptively, agentic inference and RL shifted real, sustained compute back onto the host. The orchestration layer that schedules sub-agents, routes tool calls, runs retrieval, and decides whether a task is done is CPU work, and it does not fit in the slivers of host left over from an 8:1 training box. The industry is re-deciding system composition in real time, and the fork has consequences that propagate into power, density, supply, and serviceability for the life of the fleet.
The two host-attach models
There are two physically and architecturally distinct ways to attach a host CPU to accelerators, and the choice is the first fork in system composition. They differ in the interconnect, the memory model the programmer sees, the failure blast radius, and — critically in 2026 — in how tightly they couple you to a single vendor's supply chain.
Coherent host-attach (the superchip model). NVIDIA's Grace-Blackwell and Vera-Rubin packages connect an Arm host CPU to the GPU(s) over NVLink-C2C, a chip-to-chip link delivering ~900 GB/s of coherent bandwidth on Grace-class parts (rising on Vera). "Coherent" is the load-bearing word: the CPU's LPDDR5X and the GPU's HBM live in one cache-coherent address space, so the GPU can reach host memory and the CPU can reach HBM with hardware coherence, no explicit copy, no staging through a bounce buffer. The host memory becomes a fast, large NUMA tier behind the GPU — the natural home for KV-cache spillover, large embedding tables, MoE expert weights that do not fit in HBM, and the working set of long-context and agentic inference. The cost: the CPU is welded to the GPU. You take the ratio the package ships (2 GPUs : 1 Grace in GB200; the Vera-Rubin pairing in 2026), the LPDDR5X capacity is soldered and fixed, and you are buying one vendor's CPU, GPU, and interconnect as an indivisible unit.
Discrete host-attach (the PCIe model). The classic HGX/8-GPU baseboard hangs accelerators off a pair of x86 sockets (AMD EPYC or Intel Xeon) over PCIe — Gen5 at ~64 GB/s per x16, Gen6 at ~128 GB/s. The CPU and GPU keep separate memory spaces; moving data between them is an explicit DMA over the PCIe link or, for GPU-to-GPU, a sideband over NVLink that bypasses the host entirely. You give up coherent host memory — host RAM is not a transparent extension of HBM, and PCIe bandwidth is an order of magnitude below NVLink-C2C — but you gain composability. You choose the CPU vendor, the core count, the socketed DDR5 capacity (terabytes if you want them), and the GPU:CPU ratio independently, and you can service or upgrade the host without touching the accelerator. For workloads that do not need coherent host memory — most discrete-GPU inference, and any fleet that values dual-source supply over the last increment of host-GPU bandwidth — this is the pragmatic default.
| Axis | Coherent Arm (Grace / Vera + NVLink-C2C) | Discrete x86 (EPYC / Xeon + PCIe) | What the choice costs you |
|---|---|---|---|
| Host-to-GPU link | NVLink-C2C ~900 GB/s coherent (rising on Vera) | PCIe Gen5 ~64 GB/s / Gen6 ~128 GB/s per x16 | ~7–14x host-GPU bandwidth, and coherence vs explicit copies |
| Memory model | One cache-coherent address space; HBM + host LPDDR5X unified | Separate CPU and GPU spaces; DMA across PCIe | Coherent host RAM as an HBM spill tier vs manual staging |
| Host memory | Soldered LPDDR5X — Grace 480 GB / Vera up to 1.5 TB | Socketed DDR5 / MRDIMM — multiple TB, field-serviceable | Bandwidth-per-watt vs capacity, RAS, and serviceability |
| GPU:CPU ratio | Fixed by package (2:1 in GB200; Vera-Rubin pairing) | Composed freely (8:1 down to 1:1) | Take the vendor's ratio vs size it to your workload |
| CPU ISA / vendor | Arm (NVIDIA Olympus/Neoverse cores), single-source | x86, dual-source (AMD + Intel) | CUDA-tight integration vs portability and supply hedging |
| Blast radius / service | CPU + GPU fail and refresh as one welded unit | Host serviceable independently of accelerators | Tighter integration vs independent repair and upgrade |
The GPU:CPU ratio — and why agentic AI is rebalancing it
The GPU:CPU ratio is the single number that most concisely captures system composition. For the training era it was effectively a constant: 8:1. An HGX baseboard carried eight accelerators and a dual-socket host whose job was to keep the data pipeline full and stay out of the way. The host did almost no application compute — pre-training is GPU-bound on collectives and matmul, and the CPU's contribution is loading shards, augmenting data, and launching kernels. The rational move was to minimize host spend: the dollars and the watts belonged to the GPUs. Even host memory got cut — neocloud reference builds trimmed training-server RAM from 2 TB to 1 TB to save roughly $13.6k per server precisely because the trainer never used it.
Agentic inference and RL break the constant. The work that defines an AI agent — decomposing a request into sub-tasks, scheduling them, routing tool calls, passing state between sub-agents, running retrieval, executing code in sandboxes, and evaluating whether the original goal is met — is overwhelmingly host-side, control-heavy compute. It is branchy, latency-sensitive, and does not vectorize onto a GPU. RL training compounds this: every action an agent takes during a rollout has to be executed and scored, and that evaluation loop runs on the CPU. As agents proliferate, the host stops being a feeder and becomes a co-processor doing sustained work, and the ratio that kept GPUs the scarce resource per node no longer holds. TrendForce projects the ratio shifting from the training-era 1:4–1:8 toward 1:1–1:2 in agentic deployments; Intel frames the same move as 1:8 heading toward 1:1. Arm goes further on the aggregate: roughly 30 million CPU cores per gigawatt in a traditional AI data center rising to ~120 million per gigawatt in the agent era — a fourfold increase in host cores per unit of power, and a >$100B data-center-CPU opportunity by 2030.
The consequence for system composition is concrete. A fleet scoped at 8:1 for training and then repurposed for agentic serving will be CPU-starved: the orchestration and tool-execution layer saturates the thin host, queues form ahead of the GPUs, tail latency blows past the SLO, and accelerators sit underutilized behind a host bottleneck. The fix is not more GPUs — it is more CPU per GPU, which means a different node, often a different rack SKU, and a host-side power and memory budget the training composition never planned for. This is why NVIDIA designed Vera explicitly "for agents": 88 custom Olympus cores, 176 threads via spatial multithreading, and up to 1.5 TB of LPDDR5X at ~1.2 TB/s — a host built to do real work, not just feed.
Host memory: LPDDR5X soldered vs DDR5/MRDIMM socketed
The memory the host carries is the third axis of composition, and it tracks the host-attach fork tightly. The coherent superchips use LPDDR5X — the low-power DRAM from the mobile world — soldered directly onto the package. The reasons are bandwidth-per-watt and density: Grace pairs 480 GB of LPDDR5X at ~546 GB/s with a ~250 W TDP, and NVIDIA's framing is roughly 2x the bandwidth at half the power of conventional socketed server memory. Vera pushes this to up to 1.5 TB at ~1.2 TB/s. In a power-bound rack where every watt spent on host memory is a watt not spent on accelerators, LPDDR5X's efficiency is the whole point. The price you pay is rigidity: the capacity is fixed at manufacture, you cannot add or upgrade it, and a memory fault is a package fault — there is no DIMM to pull and replace, and LPDDR's RAS posture is weaker than registered server DRAM with full ECC and post-package repair.
The discrete x86 hosts use socketed DDR5, and increasingly MRDIMM (multiplexed-rank DIMMs) on the Intel side to claw back the bandwidth gap — Granite Rapids reaches MRDIMM-8800 against AMD's DDR5-6000/6400 on EPYC Turin. Socketed memory gives you what the superchip cannot: you size capacity to the workload (multiple terabytes for embedding tables, large KV-cache, or in-memory retrieval indices), you get full server-grade RAS, and you field-service or upgrade memory without touching the accelerators. The cost is power and density — registered DDR5 burns more per gigabyte than LPDDR5X, and the DIMM slots consume board area and a slice of the rack's thermal and electrical budget. The decision mirrors the host-attach fork: take the superchip and you take its soldered LPDDR5X efficiency-and-rigidity bargain; take the discrete host and you can spend power to buy capacity, bandwidth, RAS, and serviceability on your own terms.
System composition as a fleet decision, not a node decision
A common mistake is treating composition as a single answer. A real AI operation runs more than one workload shape, and the host-attach model, the GPU:CPU ratio, and the host memory that suit a pre-training trainer are the wrong answers for an agentic-inference fleet — and vice versa. The discipline is to recognize that system composition is plural, and to compose distinct node and rack SKUs for the workloads that actually dominate your draw.
A training SKU wants the coherent superchip ratio as shipped (2:1 in GB200), minimal host application compute, host memory sized to the data pipeline not to the model, and every spare watt routed to accelerators. An agentic-inference SKU wants the opposite: a fat host — many cores for orchestration and tool execution, large host memory for KV-cache and retrieval state, and a GPU:CPU ratio near 1:1–2:1 so the host never becomes the queue ahead of the GPUs. A batch-inference SKU sits between them, throughput-bound and tolerant of a thinner host. Forcing one composition across all three is how you end up simultaneously paying for unused host memory on the trainers and starving the agentic fleet of the CPU it needs. This is the same plurality the archetype framework demands at facility scale (Chapter 1.1) — here it lands inside the node.
Deep dive: why the host starves the GPU (and how to see it before you buy)
The failure mode that justifies this entire chapter is GPU starvation by the host, and it is subtle because the symptom — low GPU utilization — looks like a workload problem, not a composition problem. The mechanism: an accelerator can only do work that has been staged for it. If the host cannot decode the next batch, run the tokenizer, service the retrieval call, execute the agent's tool, and launch the kernel fast enough, the GPU drains its queue and idles. On a training box this rarely binds — pre-training is GPU-bound and the 8:1 host has cycles to spare. On an agentic-inference box it binds constantly, because every request fans out into host-side orchestration and tool calls that contend for the same thin host the training composition specced.
You can see this before you buy by profiling the host-side critical path, not just the GPU. Measure the CPU time per request spent in tokenization, retrieval, sandbox execution, and orchestration, and the host memory footprint of the KV-cache and retrieval state. If host-side time is a meaningful fraction of end-to-end latency, or if host memory pressure forces eviction, you are CPU-bound and an 8:1 ratio will strand accelerators. The fix is to move down the ratio (more CPU per GPU), widen the host (more cores), or fatten host memory — and to recognize that adding GPUs to a CPU-bound fleet buys nothing but idle silicon. The goodput discipline of Chapter 9.4 for training has a direct analogue here: utilization, not nameplate FLOPS, is the number that pays the bill.
Deep dive: coherent host memory as an HBM extension — what it actually buys
The headline case for the coherent superchip is that host LPDDR5X becomes a usable extension of HBM. It is worth being precise about when this is real value versus marketing. HBM is the scarce, expensive, capacity-constrained tier (Chapter 7.6) — a Blackwell GPU carries 192 GB, a Grace host carries 480 GB of LPDDR5X. Across NVLink-C2C's ~900 GB/s coherent link, that host memory is reachable from the GPU as a second NUMA tier: slower than HBM, but enormously larger and, crucially, addressable without an explicit copy. For workloads whose working set exceeds HBM — long-context inference with multi-gigabyte KV-cache, wide-MoE serving where the full expert set dwarfs HBM, embedding-heavy recommendation, or in-memory retrieval over large indices — this lets you spill gracefully into a coherent tier instead of falling off a cliff to PCIe-attached host memory or to storage.
The discrete PCIe host can do none of this transparently. Host memory there is a separate space at ~64–128 GB/s, reached by explicit DMA, and the programmer (or the framework) must manage the staging by hand. So the coherent model's payoff is real precisely for the memory-hungry inference workloads that are growing fastest — and that is the same pressure (CXL pooling and memory-semantic fabrics, Chapter 8.4) reshaping the memory hierarchy above the host. But the payoff is zero for a workload whose state fits in HBM and never reaches across the link. Buying coherence you do not exercise is the inverse error of starving a host you over-loaded — both come from specifying composition without profiling the workload.
The bundling fork: take the ratio, or compose it yourself
One last decision sits underneath all of this: how much of system composition do you let the vendor make for you. The rack-scale coherent platforms (GB200 NVL72, Vera-Rubin) ship as integrated units with the CPU:GPU ratio, the host CPU, the host memory, and the interconnect all fixed at the factory. You buy a rack, not a node, and the composition is a given. That is genuinely valuable — the integration risk of welding 72 GPUs, 36 CPUs, NVSwitch trays, and a liquid-cooling loop into one coherent domain is enormous, and the vendor has absorbed it. But it means the GPU:CPU ratio, the host ISA, and the host memory are not yours to tune. If your agentic workload wants a fatter host than the package ships, your only lever is a different SKU, not a different host.
The discrete path keeps composition in your hands. You build the node — choosing AMD or Intel, the core count, the socketed memory, the GPU count per host, the ratio — and you own both the flexibility and the integration risk. For a fleet with heterogeneous workloads and a procurement strategy that refuses to single-source, this is the path that lets composition track the workload. The 2026 reality is that most large operators run both: bundled coherent racks where the workload genuinely needs the coherent domain (training, memory-hungry inference), and composed discrete nodes where composability and dual-source supply matter more. The decision is not which path is correct in the abstract — it is which workloads belong on which path, and that lands you right back at the fleet-plurality discipline above. → procurement and fleet composition in Chapter 7.11.