Chapter 1.4
Post-Training, Fine-Tuning & RL: The Hybrid Middle
Post-training is not a smaller training cluster — it is a fleet of inference engines feeding a comparatively tiny trainer, and the operator who scopes it as either pure training or pure inference strands capital on the half they got wrong.
What you'll decide here
- Whether your post-training workload is fine-tune-shaped (small, bursty, LoRA/QLoRA-friendly) or RL-shaped (rollout-dominated, inference-heavy) — because they want opposite hardware and the label "post-training" hides the difference.
- Whether to run RL collocated (rollout and trainer time-sharing one GPU pool) or disaggregated (a dedicated rollout fleet feeding a separate trainer) — the fork that sets your fabric, your utilization, and your staleness budget.
- The staleness bound you will tolerate between the policy that generates rollouts and the policy being updated — the single knob that trades wall-clock throughput against convergence stability.
- Whether to host post-training on dedicated capacity or as a time-shared tenant on a training or inference fabric — because the demand is spiky, project-based, and notoriously hard to forecast.
- For fine-tuning specifically: LoRA/QLoRA on shared inference-class GPUs versus a full-weight fine-tune on training-class hardware — a 10x-plus difference in the capacity you must reserve.
Of the five workload archetypes in Chapter 1.1, post-training is the one operators most reliably mis-scope, because the word hides a contradiction. "Post-training" sounds like training — a gradient-descent job that updates weights — and so it gets specced like training: dense racks, a non-blocking fabric, the works. But the dominant cost of modern post-training is not the gradient step. It is generating the data the gradient step learns from — and that generation is autoregressive inference. The result is a workload whose center of mass sits on the inference side of the master fork even though its purpose is to change the weights. Get the balance right and you build a cheap, flexible, high-goodput cluster. Get it wrong and you pay for a trainer's fabric to run an inference fleet, or you starve the learner on an inference cluster that has no trainer.
This chapter is the engineering of the hybrid middle. We map the post-training landscape — supervised fine-tuning (SFT), preference optimization (RLHF/RLAIF/DPO), and large-scale reinforcement learning for reasoning (PPO, GRPO and relatives). We derive why RL is inference-heavy training from the token economics of rollout generation, then walk the two infrastructure forks that follow: collocated vs disaggregated execution, and the staleness budget that governs asynchronous designs. We size the fine-tuning end of the spectrum — LoRA/QLoRA vs full fine-tune — and close on capacity planning for spiky, project-based demand, where the real decision is dedicated capacity versus a time-shared tenancy on someone else's fabric.
The post-training landscape
Post-training is everything that happens to a model after the pre-training run produces a base model. It spans a spectrum from cheap-and-bounded to expensive-and-open-ended, and the infrastructure profile changes radically across it. A facility scoped for one end serves the other badly.
Supervised fine-tuning (SFT) is the closest to classic training: a labeled dataset of prompt/response pairs drives ordinary gradient descent. It is small (often thousands to low-millions of examples), bursty (it starts and finishes inside hours to days), and tightly coupled only at the scale of the model being tuned — which for most fine-tunes is a single node or a small handful of them, not a thousand-GPU supercomputer. SFT is the cheapest post-training to host and the easiest to time-share.
Preference optimization — RLHF (human feedback), RLAIF (AI feedback), and the simpler offline variants like DPO — aligns a model to preferences over outputs. Classic PPO-style RLHF is the heavyweight: it runs up to four models concurrently (a policy/actor, a frozen reference, a reward model, and a value/critic), so the weights alone for a 70B-class actor plus its companions can demand 8–16 H100-class GPUs before optimizer state and activations enter the budget. DPO collapses that to a single-pair contrastive loss with no reward model and no sampling, which is why it looks and costs like SFT. The choice between PPO-style and DPO-style alignment is therefore as much an infrastructure decision as an algorithmic one.
Large-scale RL for reasoning — the GRPO/PPO family that produced the 2025–2026 wave of reasoning models — is the archetype that breaks the "post-training = training" intuition entirely. It alternates two phases with opposite hardware profiles: a rollout phase that samples long trajectories from the current policy (pure inference), and a policy-update phase that runs a comparatively small synchronous gradient step. The rollout phase dominates. That single fact reorganizes the entire cluster, and the rest of this chapter is its consequences. → the algorithmic detail of GRPO/PPO and async off-policy RL lives in the training-frameworks treatment of Chapter 10.8.
Why RL is inference-heavy training
Here is the derivation, because the conclusion is counterintuitive and the capital at stake is large. In RL for reasoning, the model improves by trying things and being rewarded. Each training step requires the current policy to generate candidate responses — rollouts — that are then scored and used to compute a gradient. For reasoning and agentic tasks, those rollouts are long: chains of thought, tool calls, and multi-turn trajectories routinely run 10K–100K+ tokens each, and you generate many of them per prompt (GRPO samples a group of completions per prompt precisely to estimate a baseline). Generation is autoregressive decode — one token at a time, memory-bandwidth-bound, the same physics as online inference. The gradient update that follows consumes a fraction of the wall-clock.
The numbers are stark. Across modern RL post-training systems, rollout generation accounts for roughly 80% of wall-clock time in agentic and reasoning RL, and at a 16K-token generation length rollouts consume on the order of 70% of the compute (multiple 2025–2026 RL-systems papers and practitioner reports). The bottleneck is HBM bandwidth — the rollout engines repeatedly stream weights and KV-cache through memory exactly as an inference server does. The consequence is unambiguous: an RL cluster is, by load, mostly an inference cluster. The trainer is the minority partner. Any scoping that treats the whole cluster as a synchronous trainer is paying for a non-blocking fabric and uniform max-density racks across a pool that spends four-fifths of its life doing decode.
| Sub-archetype | Dominant cost | Rollout share | Coupling | Hardware it wants | Hosting |
|---|---|---|---|---|---|
| SFT (supervised fine-tune) | Gradient step on labeled pairs | None (no sampling) | Tight, but small scale (1–few nodes) | A few training-class nodes; liquid by density | Time-shares well; bursty tenant |
| DPO / offline preference | Single contrastive gradient step | None (offline pairs) | Tight, small scale | SFT-class; no reward-model fleet | Time-shares well |
| RLHF (PPO, 4-model) | Sampling + reward + actor/critic update | High (sampling-dominated) | Mixed: inference sampling + sync update | Inference pool + trainer; 8–16 GPUs just for 70B weights | Dedicated or large shared slice |
| RL for reasoning (GRPO/PPO) | Long-trajectory rollout generation | ~70–80% of wall-clock | Async: rollout fleet + small trainer | Large inference-class rollout pool + small dense trainer | Dedicated disaggregated cluster |
| LoRA / QLoRA adapter | Low-rank delta on frozen base | None | Loose; single GPU possible | Inference-class GPU; 65B on one 48 GB card (QLoRA) | Fits on idle inference capacity |
Collocated vs disaggregated: the RL infrastructure fork
Once you accept that an RL cluster is two workloads — a rollout fleet and a trainer — the central architectural question is whether they live in the same GPU pool or separate ones. This is the fork that sets utilization, fabric, and the failure modes you will fight.
Collocated designs put rollout and training on one shared resource pool and alternate phases in time: the GPUs generate, then switch to updating, then generate again. The win is simplicity and high utilization when the phases are balanced — no GPU sits idle waiting for the other half. The frameworks that pioneered this (the verl family and relatives) make it the default for moderate scale. The cost is that the pool must be provisioned for the more demanding phase, and the phase switch serializes generation and update — while the trainer runs, the rollout engines are idle, and vice versa. At large scale that serialization is exactly the wall-clock the rollout share was warning you about.
Disaggregated designs run a dedicated rollout/generation fleet and a separate trainer continuously and in parallel, connected by a weight-sync path. This is the architecture the 2025–2026 systems literature has converged toward (StreamRL, AReaL, ROLL Flash, ROLLART and kin) because it lets each side use best-fit hardware: route decode-heavy rollouts to bandwidth-optimized inference GPUs, the policy update to compute-optimized trainer GPUs, and even push environment/reward computation to CPU clusters or serverless. It also overlaps the two phases so neither stalls. The price is that disaggregation is asynchronous by construction — the trainer consumes rollouts generated by a slightly older policy — and that introduces the staleness problem treated next. It also demands a deliberate weight-broadcast path and a fabric that carries fresh policy weights to the rollout fleet fast enough to keep them from drifting too far off-policy.
| Axis | Collocated (time-shared pool) | Disaggregated (separate fleets) |
|---|---|---|
| GPU utilization | High when phases balance; idle half during each phase switch | Both fleets busy continuously; overlap hides stalls |
| Hardware fit | One GPU type for both phases (compromise) | Best-fit: bandwidth-optimized rollout + compute-optimized trainer |
| Coupling / sync | Synchronous by default; on-policy, stable | Asynchronous; off-policy, needs a staleness bound |
| Fabric demand | One pool; modest scale-out | Weight-sync broadcast path + tolerant rollout fabric |
| Failure blast radius | A trainer fault stalls the whole pool | Rollout faults isolated from the trainer; restartable |
| Best fit | Moderate scale; simplicity; on-policy algorithms | Large scale; throughput-critical; heterogeneous hardware |
The staleness budget
Asynchrony is where disaggregated RL buys its speed, and where it can lose its convergence. When the rollout fleet and the trainer run in parallel, the trajectories the trainer learns from were generated by a policy that is now a few updates old. That gap is staleness (or policy lag), and it is the single most important tuning knob in an async RL system. Allow zero staleness and you are back to synchronous, on-policy training — stable, but the trainer idles while rollouts generate and the rollout fleet idles while the trainer steps. Allow unbounded staleness and the trajectories become so off-policy that the gradient points in the wrong direction; training destabilizes, with the characteristic gradient-norm spikes and collapses the literature documents.
The engineering answer in 2026 is a bounded staleness budget — let the trainer consume rollouts up to k policy versions old, no more — paired with off-policy corrections (importance-sampling variants, variance control) that keep the stale gradient honest. Well-tuned, the payoff is large: recent variance-controlled async methods match the best synchronous accuracy on long-context agentic RL roughly 2.5x faster in wall-clock (about 42 hours versus 105 hours in one published comparison) and keep improving past the synchronous ceiling. The infrastructure consequence is that the staleness bound is not just an ML hyperparameter — it sets how often you must broadcast fresh weights across the fabric to the rollout fleet, and therefore how much weight-sync bandwidth you must provision. Tighter staleness means more frequent broadcasts means more east-west traffic on the sync path. This is the GOODPUT lens applied to RL: the cluster's useful output is gradient steps that actually improve the policy, and both an idle trainer (too synchronous) and a divergent one (too stale) are goodput you paid for and did not get.
LoRA / QLoRA vs full fine-tune: sizing the cheap end
At the fine-tuning end of the spectrum, the decisive fork is parameter-efficient fine-tuning (PEFT) versus full-weight fine-tuning, and the capacity difference is enormous. A full fine-tune updates every weight, so it must hold the full model, its gradients, and optimizer state (for Adam, roughly the model again, twice over) in GPU memory — the reason a serious full fine-tune of a frontier-scale model still needs a training-class multi-node allocation. LoRA freezes the base weights and trains a small low-rank delta, cutting the trainable parameter count by orders of magnitude (often to ~0.1% of the model) and the optimizer-state memory with it. QLoRA goes further by quantizing the frozen base to 4-bit, so the memory you must reserve is dominated by a quantized read-only model plus a tiny adapter.
The consequence for capacity planning is concrete. QLoRA's headline result — fine-tuning a 65B-parameter model on a single 48 GB GPU, collapsing a requirement from >780 GB of GPU memory to <48 GB without degrading task quality — means a workload that would otherwise demand a multi-node training reservation now fits on one idle inference card. That reframes fine-tuning from a capacity problem into a scheduling problem: LoRA/QLoRA jobs are exactly the kind of small, interruptible tenant you backfill onto an inference fleet's idle troughs or a training cluster's gaps. The decision rule: if the fine-tune can be expressed as a low-rank adapter and quality holds, never reserve dedicated training capacity for it — fit it onto capacity you already own. Reserve full-fine-tune-class allocations only when adapters demonstrably do not reach the quality bar.
Capacity planning for spiky, project-based demand
Post-training demand does not behave like training or inference, and that is the planning problem. A pre-training run is a forecastable monolith — you know the GPU-months before you start. An online-inference fleet has a diurnal, statistically stable load you can size to a percentile. Post-training is neither: it is spiky and project-based. An alignment project, a new reasoning RL run, a customer fine-tune — each spins up, consumes a burst of heterogeneous capacity for days or weeks, then ends. The aggregate demand is a sum of poorly-correlated bursts that no single reservation curve fits well. This is the open capacity-planning question of the archetype: dedicated clusters versus time-shared fabric.
Dedicated post-training capacity buys predictability and isolation — your RL run is not preempted by someone else's inference spike — at the cost of low average utilization, because the spiky demand leaves the dedicated pool idle between projects. It is the right call only when post-training is a continuous, first-class product line (a frontier lab iterating on reasoning models) rather than an occasional activity. Time-shared tenancy — running post-training on slack from a training or inference fabric — inverts the tradeoff: high utilization because you backfill troughs, at the cost of contention and the engineering to preempt cleanly. The natural fit is striking: RL rollouts are inference, so they backfill an inference fleet's idle capacity with the same engines; LoRA/QLoRA fine-tunes are small and interruptible, so they slot into either fabric's gaps. The heterogeneity that makes RL hard to scope on a single fabric is exactly what makes it a good time-sharing citizen across two.
Deep dive: the weight-sync path is the hidden network requirement
The fabric mistake operators make on disaggregated RL is to size the rollout fabric (tolerant, oversubscribable — it is inference) and forget the weight-sync path (the channel that broadcasts updated policy weights from the trainer to every rollout engine). This path is small in aggregate bytes but punishing in frequency: every time the trainer produces a new policy version within your staleness budget, the full updated weights (or a delta) must reach the entire rollout fleet before those engines drift too far off-policy. Tighten the staleness bound to protect convergence and you increase broadcast frequency; widen the rollout fleet to increase generation throughput and you increase the fan-out of every broadcast. The two scaling pressures multiply.
The consequence is a fabric you must design deliberately, not inherit. A pure inference fleet has no equivalent of this traffic — its weights are static between deployments — so an inference cluster repurposed for RL rollouts will lack the broadcast headroom unless you add it. The disaggregated-RL systems of 2025–2026 treat staleness-bounded weight synchronization as a first-class subsystem precisely because it is the coupling that an otherwise-inference rollout fleet cannot avoid. Size it from the staleness budget and the rollout fan-out, and route it so a weight broadcast does not collide with rollout result traffic. → topology and oversubscription mechanics in Chapter 8.5; the scale-up domain that bounds trainer parallelism in Chapter 8.2.
Deep dive: resource heterogeneity as a feature, not a bug
Pre-training prizes homogeneity — every node identical so the slowest straggler is as fast as possible. RL post-training prizes the opposite. Its phases want different silicon: rollout generation is decode, which rewards HBM bandwidth and tolerates older or cheaper inference-class accelerators; the policy update is compute-bound and rewards the densest trainer GPUs; environment simulation and reward scoring are often CPU work that should never touch a GPU at all. The 2026 disaggregated-RL systems lean into this, mapping each pipeline stage to best-fit hardware and even offloading stateless reward computation to serverless CPU infrastructure.
This is why RL is a natural home for a mixed fleet — last generation's inference GPUs that are no longer competitive for frontier serving can feed rollouts, while a smaller pool of current trainers does the updates. It is also why the GPU:CPU ratio is rebalancing away from the old training-era ~8:1 toward more CPU per node: agentic RL adds tool execution, retrieval, and sandboxed environment steps that are host-CPU work. An operator scoping RL capacity should plan for a deliberately heterogeneous bill of materials and a host-CPU budget heavier than a pre-training cluster's. → GPU:CPU ratios and system composition in Chapter 7.8; accelerator selection by role in Chapter 7.11.
The density-ramp angle: post-training inherits the cliff
Post-training does not escape the density-and-cooling physics of the other archetypes — it inherits a split version of it. The trainer pool, being a dense synchronous cluster in miniature, lands on the same cooling cliff as pre-training: if it runs current-generation accelerators it is over the ~41 kW air ceiling and direct-to-chip liquid is mandatory for that pool. The rollout pool is more forgiving — it is inference, so it can live on high-density air or rear-door heat exchangers depending on the accelerator. The trap is the same one from Chapter 1.1: if you might run training-class trainers, you must plumb that portion of the hall for liquid at scoping time, because crossing the cliff in a retrofit strands capacity. A post-training facility is rarely uniform: it is two thermal zones, and the irreversible substrate (floor loading, water, electrical headroom) must accommodate the denser of the two. → the cooling cliff is engineered in Chapter 5.1 and Chapter 5.4.