Guide › Compute, Silicon & System Integration › 7.7

Chapter 7.7

Advanced Packaging & the Integration Substrate

The accelerator you can buy is not set by how fast a fab can print logic — it is set by how large an interposer a packaging house can yield, because the package is the substrate on which compute, memory, and bandwidth are physically integrated, and in 2026 it is the most-cited binding constraint on AI compute through the end of the decade.

GOODPUTDENSITY-RAMP

What you'll decide here

Which 2.5D packaging family (CoWoS-S silicon interposer, CoWoS-R RDL, or CoWoS-L stitched silicon bridge) your accelerator targets — because that single choice sets the reticle-multiple ceiling, the HBM-stack count per package, the yield curve, and which OSAT can build it.
Whether to stay monolithic or disaggregate into chiplets across a UCIe die-to-die fabric — trading the yield and reuse upside of small dies against the area, power, and latency tax of every die crossing.
How many HBM stacks the program needs (8, 12, or 16+) and therefore which reticle-multiple and interposer technology that demands — the package-area math, not the memory contract, is the real gate on capacity-per-package.
Whether to design around silicon, RDL, or glass-core interposers, and how that reach/cost/yield triad propagates into per-die cost, large-package warpage, and the thermal envelope your cooling plant must clear.
Where in the supply chain your real lead-time risk lives — almost never at final assembly, almost always at the CoWoS allocation gate that sits upstream of it (and is treated as a procurement problem in Chapter 2.3, not here).

"How fast is this chip?" used to be answered at the transistor: a smaller node, more transistors, a faster clock. That era is functionally over for AI silicon. A single reticle field — the largest area a lithography scanner can pattern in one shot — is about 858 mm² (roughly 26 × 33 mm), and the largest useful monolithic die is already pressed against that limit. An H100-class GPU is essentially a full-reticle die. You cannot make the logic meaningfully bigger by printing it; you have hit the reticle wall. Everything that has happened to AI accelerator performance since is, at the physical level, a packaging story: how to stitch multiple reticle-sized fields together, how to bolt a dozen HBM stacks beside the logic, and how to wire it all with enough bandwidth that the assembled system behaves like one chip. The package is the integration substrate, and the firm that controls the largest, highest-yielding package controls how much compute the world can ship.

This chapter is the engineering of that substrate. We lay out the 2.5D/3D packaging taxonomy and why it is the most-cited binding constraint on AI compute through 2030; the CoWoS-S/R/L families and SoIC / hybrid bonding, and the reticle-stitching and per-package-area ceilings each imposes; the interposer fork (silicon vs RDL vs glass) and its reach/cost/yield triad; the way HBM-stack-count-per-package falls directly out of interposer area — the engineering driver behind the allocation view of Chapter 7.6; chiplet disaggregation and UCIe and the economics of every die-to-die crossing; and the thermal, warpage, and yield consequences of building packages the size of a coaster. Procurement and allocation of CoWoS capacity stay where they belong — in Chapter 7.6 and Chapter 2.3 — and we point there rather than re-litigate them. Here we own the physics.

Why packaging became the binding constraint

The constraint moved upstream the same way power did in the facility. The industry assumed the gate on accelerator supply was wafer-out logic — how many GPUs a leading-edge fab could print. It is not. The gate is advanced packaging capacity, specifically TSMC's CoWoS lines, which through 2026 are fully booked across both the CoWoS-S and CoWoS-L families. Total 2026 CoWoS demand is estimated near 1 million wafers, against a capacity ramping toward roughly 110–130k wafers/month by year-end, up from ~13k/month at the end of 2023 (TrendForce; Silicon Analysts; TSMC reporting, 2026). A logic wafer that cannot be packaged is not an accelerator. That is why a chip company can have ample N3/N5 logic allocation and still ship late: the bottleneck is the CoWoS slot, and the CoWoS slot is set by interposer area and stack count, which is the engineering this chapter governs.

The tradeoff is sharp. Choosing a more aggressive package — more reticles stitched together, more HBM stacks — buys performance, but it consumes disproportionately more CoWoS capacity per accelerator (a 5.5×-reticle part eats far more interposer area, and yield risk, than a 2× part) and pushes the program onto the newest, least-mature, most-allocation-starved process. Choosing a conservative package preserves yield and supply but caps memory bandwidth and compute area. There is no free axis: the package is where the GOODPUT ambition of the silicon team collides with the DENSITY-RAMP reality of the packaging line.

The 2.5D / 3D taxonomy

Advanced packaging splits cleanly into two regimes, and the fork between them is the first decision an architecture team makes. 2.5D places multiple dies side-by-side on a shared horizontal carrier — an interposer — that routes thousands of fine wires between them. This is how logic meets HBM today: the GPU die and its HBM stacks sit beside each other on a silicon (or RDL, or glass) interposer, with HBM's wide thousand-plus-bit bus fanned out across that carrier. CoWoS is the canonical 2.5D flow. 3D stacks active dies vertically, face-to-face, bonded directly so that a die sits on top of another die rather than beside it. TSMC's SoIC and the hybrid-bonding processes underneath it are the 3D flow; AMD's MI300 stacks compute chiplets on base dies this way, and the HBM stack itself is a 3D structure internally.

The consequence of the fork is geometric. 2.5D buys you area — you spread components across a large flat carrier, and the limit is how large an interposer you can yield. 3D buys you proximity and density — wires between stacked dies are microns long instead of millimeters, slashing interconnect energy and latency, but you inherit a severe thermal problem (the die on top has to dump its heat through the die below) and a yield-multiplication problem (a bad die anywhere in the stack can scrap the whole assembly). Real 2026 accelerators are hybrids: a 2.5D interposer carrying 3D-stacked logic and 3D-stacked HBM. The package is two integration technologies at once.

CoWoS family fork: which 2.5D flow, and what it ceilings

Family	Interposer / carrier	Reticle-multiple ceiling (2026)	HBM stacks	Best fit	Cost & yield posture
CoWoS-S	Full silicon interposer (TSV-bearing)	~3.3× reticle (mature); pushing higher	8–12	Today's mainstream GPU + 8 HBM (H100/H200-class)	Highest interposer cost; best signal integrity; mature yield
CoWoS-R	RDL interposer (organic, no large Si)	Lower than -S; smaller spans	Up to ~8	Cost-sensitive parts; fewer stacks; warpage-tolerant	Cheaper carrier; relaxed routing; warpage easier to manage
CoWoS-L	Stitched local-silicon-interconnect (LSI) bridges in RDL	~5.5× now (100×100 mm), ~9.5× on roadmap (2027)	12, → 16+	Reticle-stitched multi-die GPUs + many HBM (B200/Rubin-class)	Most area for the money; stitching = new yield-loss modes

TSMC CoWoS variants are the dominant 2.5D flows for AI accelerators in 2026. Reticle-multiple and HBM-stack ceilings are TSMC roadmap/symposium figures; OSAT alternatives (ASE, Amkor, Samsung) exist but trail on the largest reticle multiples. "Reach" is max routable die-to-die span.

The table is the central engineering fork of the chapter. CoWoS-S uses one big slab of TSV-bearing silicon as the interposer — the cleanest electrically, but a large silicon interposer is expensive and its own reticle-stitching limits cap how big it can get. CoWoS-R drops the silicon for an organic RDL carrier — cheaper and more warpage-tolerant, but with coarser routing and fewer stacks, a deliberate down-spec for cost-sensitive parts. CoWoS-L is the 2026 frontier: it abandons the single monolithic silicon interposer for an RDL carrier with small local silicon interconnect bridges embedded only where dense die-to-die routing is needed — letting the package span far more than a single reticle (~5.5× in volume now, with a ~9.5× node on the 2027 roadmap supporting 12 HBM5 stacks) while spending silicon only where it earns its cost. The consequence of choosing -L is that you inherit a new family of yield-loss mechanisms — bridge-to-RDL alignment, stitching seams, larger-package warpage — in exchange for the area no single silicon interposer can give you.

SoIC, hybrid bonding, and the 3D ceiling

Where 2.5D spreads outward, SoIC (System-on-Integrated-Chips) stacks upward using hybrid bonding — a bumpless, copper-to-copper, dielectric-to-dielectric direct bond that eliminates the solder microbumps of older 3D stacking. The payoff is interconnect density: classic microbump pitch is ~40–50 µm; hybrid bonding takes the bond pitch into the single-digit-micron range and, on the roadmap, sub-micron. That is one-to-two orders of magnitude more vertical connections per unit area, which is what lets two dies behave electrically as if they were one — picojoule-per-bit interconnect, full-bandwidth die-to-die, latency measured in fractions of a clock. AMD's MI300-class parts stack compute chiplets on a base die this way; HBM itself is a hybrid-bonding story internally as stacks climb to 12-Hi and 16-Hi.

The 3D ceiling is thermal, and it is a hard one. When you put logic on top of logic, the upper die's heat must traverse the lower die to reach the cold plate, and power density at the bond interface can exceed what any external cooling can extract without throttling. This is why 3D-on-logic is selective — you stack the parts that benefit most from proximity (cache, certain compute tiles) and keep the hottest, highest-power logic where it can see the cold plate directly. The decision to go 3D is a decision to make the cooling problem of Part 5 harder in exchange for interconnect you cannot get any other way. The forward pointer is explicit: large-package hotspot and heat-flux behavior is engineered in Chapter 5.1 (the density wall) and removed by the cold plate in Chapter 5.4.

The interposer fork: silicon vs RDL vs glass

Underneath the CoWoS branding sits the real material decision: what is the carrier made of? Three answers, three reach/cost/yield triads. Silicon interposers route the finest lines and carry TSVs for power and signal straight through — best signal integrity, the obvious choice for HBM's wide bus — but a large silicon interposer is costly, is itself reticle-limited (you cannot pattern one bigger than a scanner field without stitching), and its CTE mismatch against the organic substrate drives warpage as it grows. RDL (organic redistribution-layer) carriers are far cheaper and more warpage-forgiving and can be made large, but route coarser and carry less bandwidth density — hence CoWoS-L's compromise of an RDL carrier with silicon bridges only where density is mandatory. Glass is the emerging third option: glass-core substrates and interposers promise the dimensional stability and flatness of silicon at lower cost and at panel scale, with excellent high-frequency loss characteristics — but the ecosystem is immature, with handling, via-formation, and crack-propagation risks still being industrialized through 2026.

Interposer material triad: reach, cost, yield, and what it costs you downstream

Carrier	Routing density / reach	Relative cost	Warpage / yield risk	2026 status	Downstream consequence
Silicon interposer	Highest density; best SI; reticle-stitch limited	Highest	CTE-mismatch warpage grows with area	Mainstream (CoWoS-S)	Largest CoWoS-capacity draw per part
RDL (organic)	Coarser routing; large carriers feasible	Lowest	Most warpage-tolerant; relaxed	Mainstream for cost parts (CoWoS-R) & as CoWoS-L base	Caps stacks/bandwidth; cheaper supply
Glass core	Silicon-like density; panel-scale, low loss	Mid (promised); ecosystem premium today	Flat & stable, but crack/handling risk	Pre-volume / ramping 2026–2027	Could relieve the area ceiling if it yields

Qualitative practitioner ranking, 2026. Glass is pre-volume for high-end AI accelerators; figures reflect the technology's promise and current maturity, not a shipped commodity.

HBM-stack-count is an interposer-area problem

This is the link that ties this chapter to the memory chapter, and it is worth stating as a near-identity: HBM stacks per package is a function of interposer area, not of how much HBM you can buy. Each HBM stack occupies a fixed footprint beside the logic die and must be reached by a thousand-plus-wire bus fanned across the interposer. Eight stacks fit on a ~3×-reticle silicon interposer; getting to 12 stacks requires the ~5.5×-reticle area that only CoWoS-L delivers in volume in 2026; 16+ stacks need the next reticle-multiple node still ramping. So when Chapter 7.6 describes HBM as the top-3 BOM line and the binding allocation gate, the engineering reason a given accelerator carries the stack count it does lives here: the package area the program chose set the stack count, which set the memory capacity and bandwidth, which set where it lands on the memory roadmap.

The decision consequence is that memory ambition and packaging ambition are one decision. A team that commits to a 12-stack, ~288 GB HBM4 accelerator (Rubin-class, ~2 TB/s/stack) has by that act committed to a 5.5×-reticle CoWoS-L package, its yield curve, and its slice of the scarcest packaging capacity on earth. You cannot buy your way to more bandwidth without buying your way to more interposer area, and interposer area is exactly the thing in shortest supply. The procurement reflex — "order more HBM" — is necessary but not sufficient; without the CoWoS slot to mount it on, the HBM is inventory, not bandwidth. That allocation logic is owned by Chapter 2.3; the engineering driver is owned here.

~858 mm²

single reticle field (≈26×33 mm) — the area limit every advanced-packaging technique exists to defeat

2026Lithography scanner field limit (ASML/industry standard)

~5.5× → ~9.5×

CoWoS-L reticle multiple: 5.5× (100×100 mm, >98% yield) in volume now; 9.5× on the 2027 roadmap

2026Tom's Hardware; TrendForce; 3DInCites (TSMC symposium)

12 → 16+

HBM stacks per package: 12 on 5.5×-reticle CoWoS-L (HBM3E/HBM4); 16+ on the next node

2026TrendForce; Tom's Hardware (TSMC roadmap)

~110–130k wpm

TSMC CoWoS capacity ramping to year-end 2026, from ~13k/mo end-2023 — the true accelerator-supply gate

2026TrendForce; Silicon Analysts; FinancialContent

~1.0M wafers

estimated 2026 CoWoS demand (vs ~370k in 2024); CoWoS-S and -L fully booked

2026Silicon Analysts; TrendForce

<10 µm

hybrid-bonding bond pitch (vs ~40–50 µm microbump), heading sub-micron — the density behind SoIC 3D stacking

2026UCIe Consortium; Synopsys; TSMC SoIC

>20 Tbps/mm

UCIe die-to-die edge bandwidth density at 64 Gbps; up to ~300 TB/s/mm² areal in 3D hybrid-bonded configs

2026Alphawave Semi; UCIe 2.0 spec

~288 GB

HBM4 capacity per Rubin-class package at ~2 TB/s/stack — the memory the package area enables

2026 (roadmap)NVIDIA Developer (Rubin platform)

Chiplet disaggregation and UCIe

Once the reticle wall makes one big die impossible, the question becomes how to build a big system from small dies — and that is the chiplet decision. Disaggregation splits a would-be monolithic SoC into multiple dies ("chiplets") that are packaged together: compute tiles, I/O dies, cache dies, each potentially on a different process node tuned to its job. The upside is real and quantifiable. Yield improves super-linearly as die size shrinks — defect density punishes large dies brutally, so four small dies yield far better than one die of the same total area. You reuse a chiplet across many products. You put I/O on a cheap mature node and spend leading-edge wafers only on the logic that needs them. AMD productized this years ago; it is now the default architecture for the largest accelerators.

The cost of disaggregation is paid at every die-to-die crossing, and this is where UCIe (Universal Chiplet Interconnect Express) enters. A monolithic SoC routes between blocks for free, on-die. A chiplet system must serialize, drive across a physical die boundary, and deserialize — spending area on PHYs, power on the link (picojoules per bit that add up across terabytes per second), and latency on the crossing. UCIe standardizes that interface so chiplets from different vendors and nodes interoperate, with a 2.5D ("standard package") profile and a hybrid-bonded 3D ("advanced package") profile reaching >20 Tbps/mm of edge bandwidth density and, in 3D, areal densities into the hundreds of TB/s/mm². The strategic consequence of UCIe is an open chiplet market: a buyer can, in principle, assemble an accelerator from best-of-breed dies rather than one vendor's monolith — the packaging-era analog of the open-system disaggregation Part 8 describes for the network.

Deep dive: the die-to-die tax, quantified — when disaggregation stops paying

Disaggregation is not free; the break-even is an engineering calculation. Every signal that would have stayed on-die in a monolith now crosses a die boundary, and that crossing costs three things. Area: the D2D PHY (the SerDes or parallel interface) consumes silicon on both dies — beachfront along the die edge that could have been compute. Power: on-die wires cost a fraction of a picojoule per bit; even an excellent D2D link costs more, and at the multi-terabyte-per-second bandwidths a GPU needs internally, the aggregate link power is a real fraction of the package budget. Latency: serialize-cross-deserialize adds cycles that on-die routing never pays.

Hybrid bonding changes the arithmetic. Because it pushes bond pitch into the single-micron range, the 3D "advanced package" UCIe profile gets the per-bit energy and latency close enough to on-die that the crossing nearly disappears — which is precisely why the most aggressive disaggregation (many small dies) pairs with hybrid bonding rather than microbump 2.5D. The decision rule that falls out: disaggregate when the yield-and-reuse savings on the dies exceed the area/power/latency tax of the links between them, and reach for hybrid bonding when the chiplet count is high enough that microbump D2D would eat the savings. A two-die split over a coarse interface can be a net loss; a many-chiplet design over hybrid bonding is the architecture of the frontier accelerator. The standards-war analog for the scale-up fabric (NVLink Fusion vs UALink) lives in Part 8; here the fork is purely about the package.

Thermal, warpage, and yield: the consequences of a coaster-sized package

Large packages punish you three ways, and each is a downstream cost of the area you bought to defeat the reticle wall. Warpage is the first: a large silicon interposer on an organic substrate has a coefficient-of-thermal-expansion mismatch, and as the assembly heats and cools it bows. Past a certain span the bow threatens the solder joints to the board and the bond integrity across the package — which is a major reason CoWoS-L's RDL-plus-bridge construction exists (the organic carrier is more compliant) and why glass cores are attractive (dimensionally stable and flat). Warpage is not a yield footnote; it is a first-order limit on how large a package you can build at all.

Yield is the second, and large packages multiply it sharply. A 12-HBM, multi-reticle, stitched package is the product of many independent yields (each die, each HBM stack, each bond, each stitch seam), and a defect anywhere late in the flow scraps an assembly carrying enormous accumulated value (a dozen HBM stacks and multiple known-good logic dies). This is why known-good-die testing before assembly is non-negotiable and why the largest packages carry the largest scrap-cost exposure: you are gambling the most expensive components on the last and least-reversible step. Thermal is the third: a large, dense package concentrates hundreds of watts into a small footprint with internal hotspots over the 3D-stacked regions, producing heat fluxes that air cannot remove and that even direct-to-chip liquid must be engineered for. The package's heat-flux map is the boundary condition the cold plate inherits — the explicit hand-off to Chapter 5.1 and Chapter 5.4.

Bigger package vs more packages: the goodput fork

When you need more compute and memory in a scale-up domain, you can build a bigger package (more reticles, more stacks, more 3D) or more packages wired together by the scale-up fabric. The bigger package wins on interconnect — on-interposer and hybrid-bonded links are faster and cheaper-per-bit than any board- or cable-level NVLink — but loses on yield, warpage, thermal density, and CoWoS-capacity draw, and every increment pushes onto the newest, most-allocation-starved process node. More packages win on yield and supply and let you spend capacity you can actually get, but pay the goodput tax of crossing the slower scale-up fabric between them. In 2026 the frontier answer is "both, to the limit of yield": push the package as large as it yields, then tie packages together with the densest scale-up fabric available. Where that fabric boundary sits — and the copper-vs-optics decision at the rack — is owned by Chapter 7.13 and Part 8.

How the package decision propagates

Walk the chain forward and the package sits at the center of it. The reticle-multiple you choose sets the interposer area; interposer area sets the HBM-stack count, which sets memory capacity and bandwidth (→ Chapter 7.6); the reticle-multiple also sets which CoWoS family and OSAT can build it, which sets yield and lead time (→ Chapter 2.3); the package's power and heat-flux map sets the cooling boundary condition (→ Chapter 5.1, Chapter 5.4); and the on-package power-delivery and di/dt behavior that a big, dense package demands is its own engineering problem (→ Chapter 7.12). The package is not a step in the flow; it is the substrate every other accelerator decision is mounted on. Get its reticle-multiple and family wrong and you have either an accelerator you cannot get capacity to build or one whose memory and compute ambition you under-shot — and unlike a board respin, a package family change is a multi-quarter, multi-million-dollar reset.

Deep dive: why a CoWoS-L respin is a worse mistake than a node respin

It is tempting to treat the package as a back-end detail you can adjust late. The opposite is true: in 2026 the package family is one of the least-reversible decisions in an accelerator program, for two reasons. First, allocation. CoWoS-S and CoWoS-L slots are booked quarters ahead and fully subscribed; changing reticle-multiple or family mid-program does not just mean re-engineering — it means re-queuing for capacity that is already someone else's, against a roadmap clock that does not stop. A node change at least keeps you in a wafer queue you may already hold; a package-family change can put you at the back of a line for the single scarcest manufacturing resource in AI hardware.

Second, co-design depth. The HBM stack count, the interposer routing, the chiplet floorplan, the power-delivery network, and the thermal solution are co-designed around the chosen package; pulling the reticle-multiple unwinds all of them. A team that scoped for 8 stacks on CoWoS-S and discovers it needs 12 is not editing a parameter — it is changing interposer family, re-floorplanning, re-doing power and thermal, and re-entering the allocation queue. The discipline is the same one Part 1 preaches for the facility: identify the irreversible decision (here, reticle-multiple and CoWoS family) and over-specify or hedge it at scoping time, because re-deciding it costs a generation. The procurement-side mitigation — deposits, allocation locks, design-for-substitution — is owned by Chapter 2.3; the engineering reason it is irreversible is owned here.

The package is the substrate the rest of the silicon stack mounts on. HBM as a top-3 BOM line and the allocation gate it forms is owned by Chapter 7.6; this chapter is the engineering driver behind its stack-count-per-package view. CoWoS/HBM long-lead procurement, deposits, and design-for-substitution live in Chapter 2.3. The hyperscaler XPUs and custom ASICs that make these packaging choices are profiled in Chapter 7.4 and Chapter 7.5. The on-package power delivery and di/dt physics a large dense package demands are in Chapter 7.12; the host-attach and system composition around the package in Chapter 7.8. The large-package heat-flux problem this chapter hands off is engineered in Chapter 5.1 and removed by direct-to-chip liquid in Chapter 5.4. The package-vs-fabric boundary — when to build a bigger package vs wire more packages together — is set at the rack in Chapter 7.13. The consolidated packaging-and-memory roadmap through 2030 is in Chapter 16.2.