Guide › Compute, Silicon & System Integration › 7.12

Chapter 7.12

On-Package Power Delivery & Power Integrity

The last meter of the power chain — 48V on the board down to ~0.7V at 2,000+ amps inside the package — is where the AI accelerator either gets the clean, low-impedance, fast-responding current it needs to hit clock, or it droops, throttles, and turns purchased megawatts into wasted goodput; and the same di/dt event that defines this meter is the seed of the facility-scale transient three layers up.

POWER-BOUNDGOODPUTDENSITY-RAMP

What you'll decide here

Where you convert 48V to core voltage — board-edge multiphase VRM, lateral power-on-package, or vertical power delivery underneath the die — and therefore how much of your delivered watts you lose to I²R before they ever reach a transistor.
Whether to design for frontside or backside power delivery (BSPDN) at the silicon level — a process-node decision made years upstream that sets your achievable IR-drop floor and free routing budget.
Your PDN impedance target across frequency (the sub-milliohm flat-impedance line) and the decoupling hierarchy — bulk, package, on-die MIM/deep-trench — that hits it without anti-resonance.
How much on-package and on-shelf capacitance to budget against the worst-case di/dt load step — the joules-per-GPU number that buys you ride-through and grid-transient smoothing, and the volume/cost it consumes.
Whether the load step is mitigated at the chip (capacitance, load-line, adaptive clocking) or pushed upward as a synchronized transient the facility BBU/BESS spine has to absorb — and who owns which decibel of the attenuation budget.

Everything upstream of this chapter — the substation, the 800 VDC spine, the busbar, the power shelf — exists to deliver clean voltage to one place: the pins of the accelerator package. But the accelerator does not run on 48V. It runs on a core rail somewhere around 0.6–0.9V, and at the power levels of a 2026-class GPU that means a steady-state core current measured in the thousands of amps (2,000A is now routine; roadmap parts target 2,000–5,000A and beyond). Converting 48V to sub-1V at multi-kiloamp current, with the conversion happening as physically close to the die as manufacturing allows, while holding the rail flat through load steps that swing hundreds of amps in nanoseconds — that is on-package power delivery, and it is the least-visible, highest-stakes meter of the entire power chain.

This chapter works at the millimeter and microsecond scale. We walk the conversion from board 48V down to core (multiphase VRMs and smart power stages), the architectural fork between lateral and vertical power delivery and the silicon-level move to backside power (BSPDN), the PDN impedance target and the decoupling hierarchy that meets it, and the physics of the load-line, Vdroop, and the di/dt event. Then we read the spine end-to-end: the on-die di/dt event is the same physical phenomenon that, summed across 72 GPUs stepping in lockstep, becomes the synchronized facility transient the BBU/BESS spine must absorb. The cost of guessing wrong at any layer is either stranded silicon performance or a grid-disturbing load that gets you a worse interconnection deal.

Board 48V to core: multiphase VRMs and smart power stages

The conventional answer, inherited from the server world, is the multiphase buck VRM: a digital controller drives N interleaved phases, each phase a high-side/low-side MOSFET pair (now integrated as a smart power stage, or DrMOS, with current/temperature telemetry built in) feeding an inductor, all summing into a shared output rail. Interleaving the phases multiplies the effective switching frequency at the output and shrinks the output ripple, so more phases buy you a tighter rail and more current — at the cost of board area, gate-drive losses, and controller complexity. A modern GPU core rail can demand 16, 20, or more phases just to source the current without melting any single power stage.

Two design moves dominate here. First, 48V direct conversion versus a two-stage 48V→12V→core path: the industry has decisively moved to higher distribution voltage (48V over the legacy 12V) because at constant power, quadrupling the voltage cuts distribution current — and therefore I²R loss in the board copper — by 16x. Direct 48V-to-core conversion (single-stage) saves the intermediate-bus-converter losses but stresses the duty cycle; two-stage is more forgiving but adds a conversion. Second, the load-line (also called adaptive voltage positioning): deliberately programming the regulator so the delivered voltage droops with load current along a defined slope. This is counterintuitive — you are intentionally lowering voltage as current rises — but it halves the worst-case transient excursion by pre-positioning the rail, and it lets the silicon run at the lowest voltage that still meets timing, which is pure power savings. The load-line slope is a negotiated number between the silicon vendor (who owns the timing margin) and the power team (who owns the regulator), and getting it wrong either burns power or causes timing failures under transient.

Lateral, vertical, and the move to power-from-below

Where the final 48V-to-core conversion physically sits is the central architectural fork of on-package power, and it is a continuum of three positions, each cutting PDN resistance harder than the last.

Board-edge multiphase is the legacy position: the VRM lives on the motherboard, current travels through the socket/connector and across the package to the die. Cheap, serviceable, well-understood — and increasingly untenable as core current climbs, because the PDN resistance from board edge to die dominates the loss budget. Lateral power-on-package moves current-multiplier modules onto the package substrate itself, flanking the die on the north/south or east/west edges, so the final conversion happens millimeters away. Vertical power delivery (VPD) is the endgame: the conversion modules sit directly underneath the processor, feeding current straight up into the die with the shortest possible path. Vendor data puts VPD at roughly a 66% reduction in path ESR and ~43% in ESL versus lateral, translating to a few points of total-system efficiency recovered — points that, at 1.5kW/GPU across tens of thousands of GPUs, are real megawatts.

The silicon-level analog is backside power delivery network (BSPDN) — routing the power rails on the back of the wafer, beneath the transistors, instead of competing for the same congested metal stack as signal routing on top. Intel ships it as PowerVia on the 18A node (in Panther Lake client parts as of early 2026); TSMC's Super Power Rail arrives with the A16 node targeting volume in late 2026; Samsung's SF2Z follows. The payoff is twofold: BSPDN demonstrably cuts on-die IR droop by ~30% (Intel test-chip data) and frees frontside routing tracks for signal, improving both power integrity and logic density. The catch is thermal — with power rails on the back, the heat path to the cold plate is now through the power-delivery layer, complicating the already-tight thermal budget covered in Chapter 5.4.

Power-delivery position → loss, density, and consequence

Position	Where conversion sits	PDN resistance vs board-edge	Loss recovered (system)	Downstream consequence
Board-edge multiphase	Motherboard, across socket + package	Baseline (highest)	—	Serviceable and cheap; loss budget caps achievable core current
Lateral power-on-package	Package substrate, flanking the die	Large reduction (modules at die edge)	~1–2%	Consumes package edge area; substrate routing complexity rises
Vertical power delivery (VPD)	Directly under the die	~50x lower (Vicor current-multiplier claim); ~66% ESR / ~43% ESL vs lateral	~2–5.5%	Highest density; couples power and thermal under the die; serviceability hardest
Backside power (BSPDN)	On-wafer, beneath the transistors	Cuts on-die IR droop ~30%	Enables higher clock at iso-power (~6% freq, Intel)	Process-node decision (18A / A16 / SF2Z); heat path now through the power layer

PDN resistance/ESR/ESL reductions are vendor-reported directional figures (Vicor for lateral/vertical; Intel/TSMC for BSPDN). 'Loss recovered' is approximate system-level efficiency, not a single-rail figure.

The table is a one-way ratchet driven by current. As core current climbs generation over generation, the loss in any given position grows as I², so each generation pushes the industry one step down the table. The strategic point: VPD and BSPDN are not optional refinements, they are the enabling technologies for the next density step. A 600 kW Kyber-class rack of GPUs each pulling thousands of amps is not buildable on board-edge multiphase — the loss budget alone forecloses it. The density ramp covered in Chapter 7.1 is, at the package level, a forced march down this table.

The PDN impedance target and the decoupling hierarchy

Power integrity reduces to one specification: hold the core rail inside its noise window across every frequency at which the load draws current. The load is a digital engine that switches at GHz and gates entire tiles on and off at MHz, so its current demand has spectral content from DC up past 100 MHz. The governing design rule is the target impedance: Z_target = ΔV_allowed / ΔI_max. For a sub-1V rail with a tight noise budget and a multi-hundred-amp transient, that target lands in the sub-milliohm range — often well under 1 mΩ out to 100 MHz, and into the microohm range at DC. The job of the PDN is to present a flat impedance below that line across the whole band. Any frequency where impedance pokes above the line is a frequency where a load step produces a droop that violates the noise budget — and droop, in a clocked machine, means a timing failure, which means you either lower the clock (lost performance) or raise the voltage (lost power).

No single component holds sub-milliohm flat across nine decades of frequency, so the PDN is a hierarchy of decoupling, each tier owning a frequency band:

Bulk capacitance (board, µs–ms band): large electrolytic/polymer caps near the VRM that hold the rail through the regulator's own control-loop response time. This is also where the load-step ride-through energy lives.
Mid-frequency ceramics (package + board, ~MHz band): MLCCs spread across the package and board backside, bridging the gap between bulk caps and on-die caps. This band is where anti-resonance bites: the parasitic inductance of one tier resonates against the capacitance of the next, producing an impedance peak exactly where you need a valley.
On-die decoupling (>100 MHz band): MIM and deep-trench capacitors integrated into the silicon, the only thing fast enough to respond to the highest-frequency current steps. As package and board parasitic inductance increasingly dominate, the burden shifts on-die — and the newest accelerators add on-die voltage telemetry to report rail quality in real time so the system can adapt.

The decision here is the on-package capacitance budget: how much die area and package real estate you spend on decoupling versus logic and signal. Spend too little and you fail target impedance and droop; spend too much and you sacrifice die area (the most expensive real estate on the package) or package complexity. This budget is co-designed with the load-line slope and the VRM phase count — three knobs that together must close the impedance target.

Load-line, Vdroop, and the di/dt event

Now the physics that ties it all together. A GPU running a synchronous training step does not draw constant current. It draws current that tracks the workload: when thousands of cores fire together at the start of a matrix-multiply tile, current can step hundreds of amps in nanoseconds to microseconds. That rate of change of current is di/dt, and across the inductance of the PDN it generates a voltage transient: V = L·di/dt. A fast load step pulls the rail down (Vdroop); the sudden release at the end of a compute burst lets it overshoot up. Either excursion outside the noise window is a fault.

Three lines of defense, in order of speed. On-die capacitance handles the fastest edge — it is the only charge reservoir close enough to respond in nanoseconds. The load-line pre-positions the rail so the droop lands inside budget rather than below it. Adaptive clocking / droop detection is the last resort: dedicated on-die circuits sense an impending droop and momentarily stretch the clock to ride it out, trading a sliver of performance for guaranteed correctness rather than crashing. The accelerator vendor co-designs all three; the data-center engineer inherits the result as a power profile — a characteristic spike-and-idle waveform whose amplitude and slew are now a published spec, because they have consequences three layers up.

The synchronized load step is the seed of the facility transient

In a synchronous training job, all 72 GPUs in an NVL72 rack — and every rack in the job — hit the same compute phase at the same instant, because they are lockstep on the same collective. Their individual di/dt events add coherently. A few hundred amps of swing per GPU becomes tens of megawatts of swing per cluster, in milliseconds, all in phase. This is no longer a chip problem; it is a grid-disturbance problem. NERC logged events of ~1,500 MW of large load lost on a single fault and 1.5 GW dropped in 82 seconds in Virginia, severe enough to trigger a rare Level 3 alert. The on-die di/dt event you mitigate with picofarads is the same phenomenon the utility sees as a multi-gigawatt step. Whoever does not attenuate it at their layer exports it to the next one.

2,000A+

core current for a 2026-class AI accelerator at ~0.7–0.9V; roadmap parts target 2,000–5,000A

2025Vicor power-on-package; AI-PDN design literature

<1 mΩ

PDN target impedance for an AI core rail, held flat to ~100 MHz (µΩ-range at DC)

2025AI-accelerator PDN / power-integrity guides

65 J/GPU

energy storage in GB300 NVL72 power shelves for di/dt smoothing and ride-through

2025NVIDIA Developer (GB300 steady power)

30%

reduction in peak grid power demand on Megatron training (GB300 w/ storage vs GB200 without)

2025NVIDIA Developer (GB300 steady power)

~30%

on-die IR/voltage-droop reduction from backside power (Intel PowerVia test chip), ~6% freq at iso-power

2025Intel PowerVia / SemiEngineering

~66% / ~43%

PDN ESR / ESL reduction from vertical vs lateral power delivery; ~2–5.5% system efficiency recovered

2025Vicor vertical-power-delivery analysis

~1,500 MW

large load lost on a single 230 kV fault; 1.5 GW dropped in 82 s (VA, 2024) — synchronized-transient risk

2026NERC Level 3 Alert / Utility Dive

Late 2026

TSMC A16 Super Power Rail (BSPDN) volume target; Intel 18A PowerVia shipping (Panther Lake)

2026TSMC / Intel roadmaps; SemiEngineering

Deep dive: the di/dt attenuation budget as a layered allocation problem

Think of the synchronized transient as a single quantity of 'swing' that must be attenuated by some total factor between the transistor and the utility meter, and the engineering question is which layer absorbs which decibel. The budget is allocated top-to-bottom across five reservoirs, each with a characteristic timescale, and the design discipline is to make each layer absorb the band it is fast enough to handle and no slower.

On-die MIM/deep-trench caps (ns): the fastest edge. Picofarads-to-nanofarads, but the only thing in the loop fast enough for the leading edge of a tile firing.
Package + board decoupling (ns–µs): MLCCs and bulk caps that hold the rail through the VRM's control-loop latency.
On-shelf energy storage (µs–ms): the GB300's 65 J/GPU of electrolytic capacitance in the power shelf — charges during idle, discharges into spikes, smoothing the rack-level waveform and cutting peak grid draw 30%.
Rack/row BBU (ms–s): battery backup units that ride through brownouts and shave the slower envelope of the load swing.
Facility BESS + software ramp-rate caps (s+): grid-scale storage plus firmware that deliberately limits how fast the cluster is allowed to ramp, so the utility never sees a step it cannot tolerate.

The consequence is an allocation rule: every joule you do not store low in the stack you must store higher, where it is bulkier and more expensive per joule, or export to the grid, where it costs you in interconnection terms and ride-through compliance. Conversely, over-provisioning on-package capacitance to absorb a slow transient that the BBU handles better is wasted die area. The cheapest design allocates each band to the smallest, fastest reservoir that can hold it. This is the canonical statement of the chip→BBU→BESS spine; the facility end of it — ride-through, ramp-rate compliance, and transient absorption — is engineered in Chapter 4.5, which back-refs here for the physical origin.

Deep dive: why ramp smoothing is a goodput decision, not just a grid courtesy

It is tempting to file power-transient smoothing under 'being a good grid citizen' — soften the spikes so the utility is happy. That framing undersells it. The smoothing has a direct goodput consequence. Two reasons. First, an operator whose load presents a violent, synchronized di/dt to the grid is a worse interconnection counterparty: utilities increasingly require ride-through capability and may impose ramp-rate limits or worse terms (or deny capacity) to loads that destabilize the local grid — and interconnection capacity is the scarcest asset in the project, per the queue dynamics in Chapter 7.11's procurement framing. Smoothing the transient literally buys you faster, cheaper power.

Second, the on-die end of the same problem caps clock. If the PDN cannot hold the rail through the worst-case di/dt, the silicon must either run at a higher guard-band voltage (burning power, lowering perf/watt) or drop clock under transient via adaptive clocking (lowering throughput directly). A GPU that droops is a GPU that is not computing at its rated rate — pure goodput loss. So the same capacitance and load-line investment that smooths the grid-facing waveform also lets the silicon run at its lowest reliable voltage and highest reliable clock. The chip-side power integrity and the facility-side transient mitigation are two ends of one optimization, and treating them separately is how operators end up over-spending on facility storage to paper over a chip-rail that was under-decoupled — or vice versa.

The fork at every layer: attenuate here or pass it up

From transistor to substation, the recurring decision is identical: do you absorb the transient at this layer, or export it to the next? Absorbing it costs local capacitance/storage/silicon area; exporting it costs someone upstream more, in bulkier reservoirs or worse grid terms. The disciplined answer is the fastest, smallest reservoir that can hold each frequency band — on-die for the nanosecond edge, package decoupling for microseconds, shelf capacitance for milliseconds, BBU/BESS and ramp-rate firmware for the slow envelope. Mis-allocate and you either strand silicon performance (under-decoupled rail, throttling) or strand capital (over-built facility storage compensating for a chip problem). The on-package layer is where this budget starts, and getting it right is the precondition for every layer above it being affordable.

Where this sits in the power chain

On-package power delivery is the terminal node of a chain that this guide reads end-to-end. Upstream of the package, the 800 VDC spine, the disaggregated sidecar power, and the busbar that feed the shelf are covered in Chapter 4.7; the LV distribution, PDUs, and rack power that deliver 48V to the board are in Chapter 4.6; and the UPS / energy-storage layer that absorbs the slow envelope of the transient — the facility end of the di/dt spine that starts on the die in this chapter — is in Chapter 4.5. The accelerator whose rail we are feeding is taxonomized in Chapter 7.1; the thermal path that now runs through the backside power layer is engineered in Chapter 5.4 against the density wall of Chapter 5.1. The forward roadmap — where core current, VPD adoption, and BSPDN go through 2030 — is consolidated in Chapter 16.2.

The chip→BBU→BESS spine is read from both ends: the on-die di/dt origin lives here; facility transient absorption, ride-through, and ramp-rate compliance are engineered in Chapter 4.5 (which back-refs here). The 48V/800VDC distribution that feeds the package is in Chapter 4.6 and Chapter 4.7. The accelerator itself is in Chapter 7.1; selection and procurement economics — including why grid-friendly load shapes earn better interconnection terms — in Chapter 7.11. The thermal coupling that BSPDN and VPD force is in Chapter 5.4 and Chapter 5.1; the scale-up fabric whose collectives drive the synchronized load step is in Chapter 8.2; checkpoint-tolerant training behavior that the redundancy posture leans on is in Chapter 9.4; and the consolidated 2026→2030 subsystem roadmap is in Chapter 16.2.