Guide › Strategy, Workload Archetypes & Economics › 1.4

Chapter 1.4

Post-Training, Fine-Tuning & RL: The Hybrid Middle

Post-training is not a smaller training cluster — it is a fleet of inference engines feeding a comparatively tiny trainer, and the operator who scopes it as either pure training or pure inference strands capital on the half they got wrong.

GOODPUTDENSITY-RAMP

What you'll decide here

Whether your post-training workload is fine-tune-shaped (small, bursty, LoRA/QLoRA-friendly) or RL-shaped (rollout-dominated, inference-heavy) — because they want opposite hardware and the label "post-training" hides the difference.
Whether to run RL collocated (rollout and trainer time-sharing one GPU pool) or disaggregated (a dedicated rollout fleet feeding a separate trainer) — the fork that sets your fabric, your utilization, and your staleness budget.
The staleness bound you will tolerate between the policy that generates rollouts and the policy being updated — the single knob that trades wall-clock throughput against convergence stability.
Whether to host post-training on dedicated capacity or as a time-shared tenant on a training or inference fabric — because the demand is spiky, project-based, and notoriously hard to forecast.
For fine-tuning specifically: LoRA/QLoRA on shared inference-class GPUs versus a full-weight fine-tune on training-class hardware — a 10x-plus difference in the capacity you must reserve.

Of the five workload archetypes in Chapter 1.1, post-training is the one operators most reliably mis-scope, because the word hides a contradiction. "Post-training" sounds like training — a gradient-descent job that updates weights — and so it gets specced like training: dense racks, a non-blocking fabric, the works. But the dominant cost of modern post-training is not the gradient step. It is generating the data the gradient step learns from — and that generation is autoregressive inference. The result is a workload whose center of mass sits on the inference side of the master fork even though its purpose is to change the weights. Get the balance right and you build a cheap, flexible, high-goodput cluster. Get it wrong and you pay for a trainer's fabric to run an inference fleet, or you starve the learner on an inference cluster that has no trainer.

This chapter is the engineering of the hybrid middle. We map the post-training landscape — supervised fine-tuning (SFT), preference optimization (RLHF/RLAIF/DPO), and large-scale reinforcement learning for reasoning (PPO, GRPO and relatives). We derive why RL is inference-heavy training from the token economics of rollout generation, then walk the two infrastructure forks that follow: collocated vs disaggregated execution, and the staleness budget that governs asynchronous designs. We size the fine-tuning end of the spectrum — LoRA/QLoRA vs full fine-tune — and close on capacity planning for spiky, project-based demand, where the real decision is dedicated capacity versus a time-shared tenancy on someone else's fabric.

The post-training landscape

Post-training is everything that happens to a model after the pre-training run produces a base model. It spans a spectrum from cheap-and-bounded to expensive-and-open-ended, and the infrastructure profile changes radically across it. A facility scoped for one end serves the other badly.

Supervised fine-tuning (SFT) is the closest to classic training: a labeled dataset of prompt/response pairs drives ordinary gradient descent. It is small (often thousands to low-millions of examples), bursty (it starts and finishes inside hours to days), and tightly coupled only at the scale of the model being tuned — which for most fine-tunes is a single node or a small handful of them, not a thousand-GPU supercomputer. SFT is the cheapest post-training to host and the easiest to time-share.

Preference optimization — RLHF (human feedback), RLAIF (AI feedback), and the simpler offline variants like DPO — aligns a model to preferences over outputs. Classic PPO-style RLHF is the heavyweight: it runs up to four models concurrently (a policy/actor, a frozen reference, a reward model, and a value/critic), so the weights alone for a 70B-class actor plus its companions can demand 8–16 H100-class GPUs before optimizer state and activations enter the budget. DPO collapses that to a single-pair contrastive loss with no reward model and no sampling, which is why it looks and costs like SFT. The choice between PPO-style and DPO-style alignment is therefore as much an infrastructure decision as an algorithmic one.

Large-scale RL for reasoning — the GRPO/PPO family that produced the 2025–2026 wave of reasoning models — is the archetype that breaks the "post-training = training" intuition entirely. It alternates two phases with opposite hardware profiles: a rollout phase that samples long trajectories from the current policy (pure inference), and a policy-update phase that runs a comparatively small synchronous gradient step. The rollout phase dominates. That single fact reorganizes the entire cluster, and the rest of this chapter is its consequences. → the algorithmic detail of GRPO/PPO and async off-policy RL lives in the training-frameworks treatment of Chapter 10.8.

The fork inside the fork: fine-tune-shaped vs RL-shaped

Before you size anything, classify the workload. Fine-tune-shaped post-training (SFT, DPO, LoRA/QLoRA adapters) is small, bursty, and gradient-dominated — it wants a few high-density nodes, or better, idle slices of an existing inference or training fabric, and it time-shares beautifully. RL-shaped post-training (PPO/GRPO for reasoning, agentic RL) is rollout-dominated and inference-heavy — it wants a large inference-class generation pool feeding a small trainer, coupled asynchronously, and it is the one you must design a disaggregated cluster for. Putting an RL-shaped job on a fine-tune-shaped allocation starves it; putting a fine-tune-shaped job on an RL cluster wastes the rollout fleet. The label "post-training" hides the difference; the shape is what you size to.

Why RL is inference-heavy training

Here is the derivation, because the conclusion is counterintuitive and the capital at stake is large. In RL for reasoning, the model improves by trying things and being rewarded. Each training step requires the current policy to generate candidate responses — rollouts — that are then scored and used to compute a gradient. For reasoning and agentic tasks, those rollouts are long: chains of thought, tool calls, and multi-turn trajectories routinely run 10K–100K+ tokens each, and you generate many of them per prompt (GRPO samples a group of completions per prompt precisely to estimate a baseline). Generation is autoregressive decode — one token at a time, memory-bandwidth-bound, the same physics as online inference. The gradient update that follows consumes a fraction of the wall-clock.

The numbers are stark. Across modern RL post-training systems, rollout generation accounts for roughly 80% of wall-clock time in agentic and reasoning RL, and at a 16K-token generation length rollouts consume on the order of 70% of the compute (multiple 2025–2026 RL-systems papers and practitioner reports). The bottleneck is HBM bandwidth — the rollout engines repeatedly stream weights and KV-cache through memory exactly as an inference server does. The consequence is unambiguous: an RL cluster is, by load, mostly an inference cluster. The trainer is the minority partner. Any scoping that treats the whole cluster as a synchronous trainer is paying for a non-blocking fabric and uniform max-density racks across a pool that spends four-fifths of its life doing decode.

Post-training sub-archetypes → infrastructure profile

Sub-archetype	Dominant cost	Rollout share	Coupling	Hardware it wants	Hosting
SFT (supervised fine-tune)	Gradient step on labeled pairs	None (no sampling)	Tight, but small scale (1–few nodes)	A few training-class nodes; liquid by density	Time-shares well; bursty tenant
DPO / offline preference	Single contrastive gradient step	None (offline pairs)	Tight, small scale	SFT-class; no reward-model fleet	Time-shares well
RLHF (PPO, 4-model)	Sampling + reward + actor/critic update	High (sampling-dominated)	Mixed: inference sampling + sync update	Inference pool + trainer; 8–16 GPUs just for 70B weights	Dedicated or large shared slice
RL for reasoning (GRPO/PPO)	Long-trajectory rollout generation	~70–80% of wall-clock	Async: rollout fleet + small trainer	Large inference-class rollout pool + small dense trainer	Dedicated disaggregated cluster
LoRA / QLoRA adapter	Low-rank delta on frozen base	None	Loose; single GPU possible	Inference-class GPU; 65B on one 48 GB card (QLoRA)	Fits on idle inference capacity

How each post-training flavor maps to hardware. "Rollout share" is the fraction of wall-clock dominated by autoregressive generation; figures are 2025–2026 RL-systems ranges, see keynumbers for sources.

Collocated vs disaggregated: the RL infrastructure fork

Once you accept that an RL cluster is two workloads — a rollout fleet and a trainer — the central architectural question is whether they live in the same GPU pool or separate ones. This is the fork that sets utilization, fabric, and the failure modes you will fight.

Collocated designs put rollout and training on one shared resource pool and alternate phases in time: the GPUs generate, then switch to updating, then generate again. The win is simplicity and high utilization when the phases are balanced — no GPU sits idle waiting for the other half. The frameworks that pioneered this (the verl family and relatives) make it the default for moderate scale. The cost is that the pool must be provisioned for the more demanding phase, and the phase switch serializes generation and update — while the trainer runs, the rollout engines are idle, and vice versa. At large scale that serialization is exactly the wall-clock the rollout share was warning you about.

Disaggregated designs run a dedicated rollout/generation fleet and a separate trainer continuously and in parallel, connected by a weight-sync path. This is the architecture the 2025–2026 systems literature has converged toward (StreamRL, AReaL, ROLL Flash, ROLLART and kin) because it lets each side use best-fit hardware: route decode-heavy rollouts to bandwidth-optimized inference GPUs, the policy update to compute-optimized trainer GPUs, and even push environment/reward computation to CPU clusters or serverless. It also overlaps the two phases so neither stalls. The price is that disaggregation is asynchronous by construction — the trainer consumes rollouts generated by a slightly older policy — and that introduces the staleness problem treated next. It also demands a deliberate weight-broadcast path and a fabric that carries fresh policy weights to the rollout fleet fast enough to keep them from drifting too far off-policy.

Collocated vs disaggregated RL — the decision

Axis	Collocated (time-shared pool)	Disaggregated (separate fleets)
GPU utilization	High when phases balance; idle half during each phase switch	Both fleets busy continuously; overlap hides stalls
Hardware fit	One GPU type for both phases (compromise)	Best-fit: bandwidth-optimized rollout + compute-optimized trainer
Coupling / sync	Synchronous by default; on-policy, stable	Asynchronous; off-policy, needs a staleness bound
Fabric demand	One pool; modest scale-out	Weight-sync broadcast path + tolerant rollout fabric
Failure blast radius	A trainer fault stalls the whole pool	Rollout faults isolated from the trainer; restartable
Best fit	Moderate scale; simplicity; on-policy algorithms	Large scale; throughput-critical; heterogeneous hardware

Practitioner framing as of 2026 (verl / StreamRL / AReaL / ROLLART and the async-RL systems literature). The right answer is scale-dependent: collocated below, disaggregated above.

The staleness budget

Asynchrony is where disaggregated RL buys its speed, and where it can lose its convergence. When the rollout fleet and the trainer run in parallel, the trajectories the trainer learns from were generated by a policy that is now a few updates old. That gap is staleness (or policy lag), and it is the single most important tuning knob in an async RL system. Allow zero staleness and you are back to synchronous, on-policy training — stable, but the trainer idles while rollouts generate and the rollout fleet idles while the trainer steps. Allow unbounded staleness and the trajectories become so off-policy that the gradient points in the wrong direction; training destabilizes, with the characteristic gradient-norm spikes and collapses the literature documents.

The engineering answer in 2026 is a bounded staleness budget — let the trainer consume rollouts up to k policy versions old, no more — paired with off-policy corrections (importance-sampling variants, variance control) that keep the stale gradient honest. Well-tuned, the payoff is large: recent variance-controlled async methods match the best synchronous accuracy on long-context agentic RL roughly 2.5x faster in wall-clock (about 42 hours versus 105 hours in one published comparison) and keep improving past the synchronous ceiling. The infrastructure consequence is that the staleness bound is not just an ML hyperparameter — it sets how often you must broadcast fresh weights across the fabric to the rollout fleet, and therefore how much weight-sync bandwidth you must provision. Tighter staleness means more frequent broadcasts means more east-west traffic on the sync path. This is the GOODPUT lens applied to RL: the cluster's useful output is gradient steps that actually improve the policy, and both an idle trainer (too synchronous) and a divergent one (too stale) are goodput you paid for and did not get.

LoRA / QLoRA vs full fine-tune: sizing the cheap end

At the fine-tuning end of the spectrum, the decisive fork is parameter-efficient fine-tuning (PEFT) versus full-weight fine-tuning, and the capacity difference is enormous. A full fine-tune updates every weight, so it must hold the full model, its gradients, and optimizer state (for Adam, roughly the model again, twice over) in GPU memory — the reason a serious full fine-tune of a frontier-scale model still needs a training-class multi-node allocation. LoRA freezes the base weights and trains a small low-rank delta, cutting the trainable parameter count by orders of magnitude (often to ~0.1% of the model) and the optimizer-state memory with it. QLoRA goes further by quantizing the frozen base to 4-bit, so the memory you must reserve is dominated by a quantized read-only model plus a tiny adapter.

The consequence for capacity planning is concrete. QLoRA's headline result — fine-tuning a 65B-parameter model on a single 48 GB GPU, collapsing a requirement from >780 GB of GPU memory to <48 GB without degrading task quality — means a workload that would otherwise demand a multi-node training reservation now fits on one idle inference card. That reframes fine-tuning from a capacity problem into a scheduling problem: LoRA/QLoRA jobs are exactly the kind of small, interruptible tenant you backfill onto an inference fleet's idle troughs or a training cluster's gaps. The decision rule: if the fine-tune can be expressed as a low-rank adapter and quality holds, never reserve dedicated training capacity for it — fit it onto capacity you already own. Reserve full-fine-tune-class allocations only when adapters demonstrably do not reach the quality bar.

~80%

of wall-clock spent on rollout generation in agentic/reasoning RL post-training

20262025–2026 RL-systems papers (ROLL Flash, ROLLART) & Introl RLHF infra report

~70%

of compute consumed by rollouts at 16K-token generation length (RLVR long-CoT)

2025RLVR / long-CoT RL-systems analyses (arXiv)

10K–100K+

tokens per RL trajectory for reasoning/agentic tasks — the rollout that dominates cost

2026domain-research keyNumbers; reasoning-model RL reports

2.5x

wall-clock speedup of variance-controlled async RL vs synchronous at equal accuracy (~42h vs ~105h)

2026Stable Asynchrony / VCPO (arXiv 2602.17616)

8–16 GPUs

just to hold weights for a 70B PPO-RLHF stack (actor + reference + reward + critic), pre-optimizer

2025Introl RLHF infrastructure report

65B on 48 GB

QLoRA fine-tune on a single 48 GB GPU; memory cut from >780 GB to <48 GB without quality loss

2023QLoRA (Dettmers et al., arXiv 2305.14314)

~0.1%

share of parameters trained by a LoRA adapter vs full fine-tune (model-dependent)

2026LoRA (Hu et al.) / 2026 PEFT practitioner guides

from 8:1

GPU:CPU norm rebalancing toward more CPU per node as agentic RL adds rollout/tool/env load

2026domain-research (System Composition); SemiAnalysis

Capacity planning for spiky, project-based demand

Post-training demand does not behave like training or inference, and that is the planning problem. A pre-training run is a forecastable monolith — you know the GPU-months before you start. An online-inference fleet has a diurnal, statistically stable load you can size to a percentile. Post-training is neither: it is spiky and project-based. An alignment project, a new reasoning RL run, a customer fine-tune — each spins up, consumes a burst of heterogeneous capacity for days or weeks, then ends. The aggregate demand is a sum of poorly-correlated bursts that no single reservation curve fits well. This is the open capacity-planning question of the archetype: dedicated clusters versus time-shared fabric.

Dedicated post-training capacity buys predictability and isolation — your RL run is not preempted by someone else's inference spike — at the cost of low average utilization, because the spiky demand leaves the dedicated pool idle between projects. It is the right call only when post-training is a continuous, first-class product line (a frontier lab iterating on reasoning models) rather than an occasional activity. Time-shared tenancy — running post-training on slack from a training or inference fabric — inverts the tradeoff: high utilization because you backfill troughs, at the cost of contention and the engineering to preempt cleanly. The natural fit is striking: RL rollouts are inference, so they backfill an inference fleet's idle capacity with the same engines; LoRA/QLoRA fine-tunes are small and interruptible, so they slot into either fabric's gaps. The heterogeneity that makes RL hard to scope on a single fabric is exactly what makes it a good time-sharing citizen across two.

The two ways to strand capital on the hybrid middle

Over-build: scoping an RL cluster as if it were pre-training — uniform max-density racks, a 1:1 non-blocking back-end fabric across the whole pool — pays for trainer-grade interconnect on the ~80% of GPUs that spend their life doing decode. That bisection bandwidth never carries the traffic it was sized for. Under-build: scoping post-training as pure inference — a fleet of serving nodes with no trainer fabric and no weight-sync path — leaves the policy update with nowhere to run and no way to broadcast fresh weights, starving the learner. The discipline that avoids both is the one from Chapter 1.1: read coupling and interruption tolerance separately for the rollout sub-workload and the trainer sub-workload, and provision each to its own profile rather than averaging them into one wrong number.

Deep dive: the weight-sync path is the hidden network requirement

The fabric mistake operators make on disaggregated RL is to size the rollout fabric (tolerant, oversubscribable — it is inference) and forget the weight-sync path (the channel that broadcasts updated policy weights from the trainer to every rollout engine). This path is small in aggregate bytes but punishing in frequency: every time the trainer produces a new policy version within your staleness budget, the full updated weights (or a delta) must reach the entire rollout fleet before those engines drift too far off-policy. Tighten the staleness bound to protect convergence and you increase broadcast frequency; widen the rollout fleet to increase generation throughput and you increase the fan-out of every broadcast. The two scaling pressures multiply.

The consequence is a fabric you must design deliberately, not inherit. A pure inference fleet has no equivalent of this traffic — its weights are static between deployments — so an inference cluster repurposed for RL rollouts will lack the broadcast headroom unless you add it. The disaggregated-RL systems of 2025–2026 treat staleness-bounded weight synchronization as a first-class subsystem precisely because it is the coupling that an otherwise-inference rollout fleet cannot avoid. Size it from the staleness budget and the rollout fan-out, and route it so a weight broadcast does not collide with rollout result traffic. → topology and oversubscription mechanics in Chapter 8.5; the scale-up domain that bounds trainer parallelism in Chapter 8.2.

Deep dive: resource heterogeneity as a feature, not a bug

Pre-training prizes homogeneity — every node identical so the slowest straggler is as fast as possible. RL post-training prizes the opposite. Its phases want different silicon: rollout generation is decode, which rewards HBM bandwidth and tolerates older or cheaper inference-class accelerators; the policy update is compute-bound and rewards the densest trainer GPUs; environment simulation and reward scoring are often CPU work that should never touch a GPU at all. The 2026 disaggregated-RL systems lean into this, mapping each pipeline stage to best-fit hardware and even offloading stateless reward computation to serverless CPU infrastructure.

This is why RL is a natural home for a mixed fleet — last generation's inference GPUs that are no longer competitive for frontier serving can feed rollouts, while a smaller pool of current trainers does the updates. It is also why the GPU:CPU ratio is rebalancing away from the old training-era ~8:1 toward more CPU per node: agentic RL adds tool execution, retrieval, and sandboxed environment steps that are host-CPU work. An operator scoping RL capacity should plan for a deliberately heterogeneous bill of materials and a host-CPU budget heavier than a pre-training cluster's. → GPU:CPU ratios and system composition in Chapter 7.8; accelerator selection by role in Chapter 7.11.

The density-ramp angle: post-training inherits the cliff

Post-training does not escape the density-and-cooling physics of the other archetypes — it inherits a split version of it. The trainer pool, being a dense synchronous cluster in miniature, lands on the same cooling cliff as pre-training: if it runs current-generation accelerators it is over the ~41 kW air ceiling and direct-to-chip liquid is mandatory for that pool. The rollout pool is more forgiving — it is inference, so it can live on high-density air or rear-door heat exchangers depending on the accelerator. The trap is the same one from Chapter 1.1: if you might run training-class trainers, you must plumb that portion of the hall for liquid at scoping time, because crossing the cliff in a retrofit strands capacity. A post-training facility is rarely uniform: it is two thermal zones, and the irreversible substrate (floor loading, water, electrical headroom) must accommodate the denser of the two. → the cooling cliff is engineered in Chapter 5.1 and Chapter 5.4.

Post-training sits between the two archetypes it borrows from: pre-training in Chapter 1.2 (the synchronous trainer it shrinks) and inference in Chapter 1.3 (the rollout fleet it leans on). The archetype framework that demands you read coupling per-sub-workload is Chapter 1.1; procurement of the spiky, project-based capacity this chapter describes is the build-vs-buy-vs-rent fork in Chapter 1.6, with its economics in Chapter 1.8. The serving engineering that an RL rollout fleet reuses is Chapter 10.11; the RL/PPO/GRPO algorithm and framework detail is Chapter 10.8; the weight-sync and oversubscription fabric is Chapter 8.5 and the scale-up domain is Chapter 8.2; GPU:CPU composition for the CPU-heavier RL node is Chapter 7.8; and the cooling cliff the trainer pool inherits is Chapter 5.1.