Chapter 11.3

Supply-Chain Security & Hardware Provenance

A GPU is the most valuable, most counterfeited, most tamperable industrial component on the planet right now — so the security boundary of an AI data center does not begin at the cage door, it begins at the fab and ends at the shredder, and every link you cannot prove is a link an adversary can substitute.

DENSITY-RAMPGOODPUT

What you'll decide here

Where you draw the trust boundary in your supply chain — at the loading dock (cheap, blind to in-transit and upstream tampering) or at the silicon (provenance-verified from fab to rack, the only posture that survives a nation-state adversary) — because that choice sets every control below it.
Whether you treat provenance as paperwork or as cryptographic evidence — a signed platform certificate and a measured manifest you can re-verify on the floor, versus a PDF certificate of conformance you cannot — because only the former detects a swapped component after delivery.
Which firmware-assurance bar you require of suppliers — an OCP S.A.F.E. audit and a Reference Integrity Manifest you can attest against, or a vendor's word — because firmware is the implant surface that survives every disk wipe and OS reinstall.
How you sanitize media at end-of-life — the IEEE-2883 / NIST 800-88 Rev 2 Purge or Destroy that survives the wear-leveling and over-provisioning of modern SSDs, versus an overwrite that does not — because a single recoverable weights drive in the resale stream undoes every upstream control.
Which links you instrument with tamper-evidence and chain-of-custody now (irreversible to retrofit after a breach) versus which you can defer, and what your country-of-origin and export posture forces on both.

Every other chapter in Part 11 defends the data center as it runs. This one defends the data center before it runs — and after it stops. The threat model here is not the attacker who breaches a live system but the one who arrives inside the hardware you bought, having compromised a component, a firmware image, or a shipment somewhere along a supply chain that now spans a fab in Taiwan, HBM stacks from Korea, substrate and CoWoS packaging, a board house, an integrator, two freight forwarders, and a customs broker — any one of which is a place to substitute a counterfeit, solder in an implant, or flash a malicious firmware that survives every wipe you will ever perform.

The reason this matters more for AI infrastructure than for any prior generation of IT is asset-value density. A single GB200 NVL72 rack is roughly $3–4M of silicon in 1.36 tonnes; a frontier training cluster concentrates billions of dollars of the most allocation-constrained, most export-controlled, most counterfeited components in the world into a few halls. That makes the supply chain both the highest-value target and the longest, least-observable attack surface in the building. This chapter covers that surface: supplier vetting and country-of-origin risk; tamper-evident logistics and chain-of-custody; cryptographic device provenance (NIST), HBOM/SBOM and the Reference Integrity Manifest, OCP S.A.F.E.; and secure decommissioning, media sanitization, and the data-remanence trap on the way out. The through-line: a provenance claim you cannot independently re-verify is not provenance — it is hope with a logo on it.

The master fork: where does your trust boundary start?

Every supply-chain control you will or won't buy is downstream of a single question: where do you begin to trust the hardware? There are three defensible answers, and they are not equally expensive or equally strong.

Trust at the dock is the legacy default: you accept the integrator's certificate of conformance, inspect the pallet for obvious damage, and rack it. It is cheap and fast, and it is blind to everything that happened upstream — a counterfeit memory module that passes power-on, an implant added at a transshipment point, a firmware image flashed before the truck arrived. Trust at acceptance test adds a provenance gate at receiving: you scan each platform, build a hardware manifest, and compare it cryptographically against a signed platform certificate the vendor stored in the device (the NIST SP 1800-34 model). This catches component substitution and in-transit tampering — but only if the vendor populated provenance at manufacture, which is a procurement term you must win at the contract, not the dock. Trust at the silicon is the strongest and the most demanding: a hardware root of trust on each device measures its own firmware at boot and attests it against a Reference Integrity Manifest, so provenance is not a one-time receiving check but a property you can re-verify continuously over the asset's life (the bridge into Chapter 11.4).

The consequence of choosing wrong is asymmetric. Pick dock-trust to save money and a nation-state-tier adversary — RAND's highest attacker tier, the relevant one for frontier weights — walks an implant past you that no perimeter, no network segmentation, and no confidential-computing TEE downstream can detect, because the compromise is beneath all of them. Pick silicon-trust and you pay an upfront premium in procurement terms, receiving labor, and tooling, but you have moved the security boundary to the only place that survives a sophisticated supply-chain attack. The fork is irreversible in practice: provenance you did not capture at manufacture cannot be reconstructed after delivery. → asset taxonomy and attacker tiers in Chapter 11.1.

Provenance is a procurement term, not a receiving task

The single most common and most expensive mistake in this domain is treating provenance as something the security team does at the loading dock. It is not — by the time the pallet arrives, the evidence either exists or it never will. Cryptographic provenance requires the vendor to generate a signed platform certificate, populate a hardware bill of materials (HBOM), and embed measurable identity at manufacture, and to maintain a chain of custody you can audit through every integrator and forwarder. None of that is a thing you can add at receiving; all of it is a thing you must win in the purchase contract as an acceptance criterion, with a right-to-reject for unverifiable lots. Decide your provenance bar before the PO, not after the truck. The same logic governs firmware assurance and tamper-evidence: they are specified upstream or they are absent.

The threat surface: counterfeits, implants, and tampering-in-transit

Three distinct threats live on the supply chain, and conflating them produces the wrong controls. Counterfeiting is the substitution of a fake or repurposed part for a genuine one — re-marked older silicon, recycled e-waste components, cloned modules, factory rejects sold as good. It is overwhelmingly an open-market-procurement risk: when allocation is tight and you buy GPUs, HBM, or power components from brokers and gray-market distributors to hit a ramp date, you walk straight into the channel where counterfeits concentrate. ERAI's 2025 report logged 748 suspect-part submissions, with active (in-production) components at ~36% of reports — and the finding that matters most for AI builders is that ~24% of suspect parts passed electrical test and would have evaded detection if electrical testing were the only screen. A counterfeit that boots is the dangerous kind.

Hardware implants are deliberate malicious additions — an extra component, a modified board, a substituted firmware chip — inserted to create a covert channel or a kill switch. Implants are the nation-state threat, and they are the reason dock-trust fails against the top attacker tier: an implant added at a board house or a transshipment point is invisible to functional test and to every downstream software defense. Tampering-in-transit is the same idea applied to the logistics leg specifically — a shipment opened, modified, and re-sealed between the integrator's dock and yours, exploiting the long, multi-party, low-observability freight path that AI hardware now traverses across borders. The defense against the second and third is the same instrument: tamper-evidence and an unbroken chain of custody, so that any opening of the package between two trusted endpoints leaves a signature you detect on receipt.

The three supply-chain threats and the controls that actually address them

Threat	Primary entry point	What detects it	What does NOT detect it	Residual risk if ignored
Counterfeit component	Open-market / broker procurement under allocation pressure	Authorized-distributor sourcing; incoming inspection (visual + X-ray + decap); device provenance	Electrical test alone (~24% pass it); a PDF certificate of conformance	Field failures, fleet reliability collapse, latent backdoor in re-marked silicon
Hardware implant	Board house, integrator, transshipment point (nation-state)	HBOM vs as-built comparison; X-ray/CT inspection; hardware root-of-trust measurement	Functional test; perimeter and network controls (implant sits beneath them)	Persistent covert channel or kill switch under every downstream defense
Tampering-in-transit	The multi-party cross-border freight leg	Tamper-evident seals/packaging; sealed chain-of-custody; GPS/temperature telemetry	Visual damage check; trusting the carrier's manifest	Re-sealed shipment with a swapped firmware chip or added component
Malicious / outdated firmware	Pre-delivery flashing; un-audited supplier firmware	OCP S.A.F.E. audit; Reference Integrity Manifest + attestation; secure/measured boot	Disk wipe, OS reinstall, antivirus (firmware survives all of them)	Implant that persists across every re-image and ownership change

Controls are not interchangeable. A counterfeit-detection program does not detect an implant; tamper-evidence does not detect a firmware swap performed before sealing. Map control to threat, not threat to convenience.

Supplier vetting and country-of-origin risk

Vetting is the cheapest high-leverage control in this chapter, because it shrinks the attack surface before any tamper-evidence or provenance tooling has to work. The discipline is NIST SP 800-161 cyber supply-chain risk management (C-SCRM): tier your suppliers by criticality, require security attestations proportional to tier, and — critically — extend the requirements to sub-tier suppliers, because the counterfeit or the implant rarely enters at your direct vendor; it enters two or three tiers down, at the component maker, the board house, or the gray-market broker your integrator quietly used to hit allocation. A vetting program that stops at tier-1 misses exactly where the threat enters.

Country-of-origin risk is the axis that turned from a compliance footnote into a board-level constraint over 2024–2026. Two forces collide. First, concentration: leading-edge logic is effectively single-sourced at TSMC, HBM at two Korean suppliers, CoWoS packaging at a handful of lines — so a geopolitical shock to one node is a fleet-wide supply shock, and the pressure to buy from less-trusted channels to compensate is exactly the pressure that lets counterfeits in. Second, export controls: US restrictions gate where the highest-end accelerators can legally sit, which both fragments the legitimate channel and inflates a smuggling/gray-market for controlled GPUs — a market that is, by construction, a counterfeit-and-tamper vector with no provenance guarantees whatsoever. The decision this forces: do you accept gray-market or unauthorized-channel parts to make a ramp date? The honest answer for any facility holding frontier weights is no, and the cost of that answer is a slower ramp — which is precisely the kind of trade the strategist must name explicitly rather than discover after a counterfeit bricks a tray. → procurement framing in Chapter 1.6; the depreciation/refresh side in Chapter 14.9.

The allocation-pressure trap

The single highest-correlation predictor of a counterfeit entering an AI build is schedule pressure under allocation scarcity. When HBM is sold out, CoWoS is the gate, and the GPUs you need are months out through authorized channels, the procurement organization is under enormous pressure to source the gap from brokers and the gray market — the exact channel ERAI's data shows counterfeits concentrate in. The reliability cost compounds the security cost: a fleet seeded with re-marked or recycled silicon does not fail cleanly at install; it degrades goodput for the life of the cluster, dragging down the effective training time the whole economic model depends on. The discipline is to make 'authorized-channel-only' a non-negotiable for anything that holds or trains weights, and to absorb the schedule hit as a known cost — not to let an unsupervised buyer trade it away one PO at a time. → reliability/goodput economics in Chapter 11.1.

Tamper-evident logistics and chain of custody

Between the integrator's verified dock and yours lies the least-observable leg of the entire lifecycle, and it is the leg an implant or tampering attack most cheaply exploits. The defense is to make the freight path evidentiary: tamper-evident seals and packaging that cannot be opened and re-closed without leaving a detectable signature; a documented, signed chain of custody that names every party who touched the shipment; and, for the highest-value lots, in-transit telemetry — GPS, shock, and temperature loggers that detect an unscheduled stop or an opening. The receiving process then becomes a verification step, not an unboxing: seal intact, custody log complete, telemetry clean — or the lot is quarantined and re-verified, not racked.

The decision here is which shipments get the full treatment, because tamper-evident logistics with custody attestation and telemetry is not free and does not scale to every cable reel. The rational policy tiers it by what the shipment can compromise: full chain-of-custody and tamper-evidence for accelerators, server boards, BMCs, and anything firmware-bearing; lighter handling for commodity passive and mechanical components. Skip the tiering and you either overspend on tamper-evidence for power cabling or — far worse — under-protect the one shipment whose compromise matters: the trays that hold the silicon that holds the weights.

Device provenance: from PDF certificate to cryptographic proof

Provenance is, in NIST's framing, the comprehensive history of a device across its entire lifecycle — creation, ownership, and every authorized change — and a delivered device has integrity only if it is genuine and all changes to it were expected. The whole discipline turns on one distinction: provenance as a document versus provenance as evidence. A certificate of conformance is a document; it asserts genuineness but cannot be checked against the device in front of you. Cryptographic provenance is evidence: the NIST SP 1800-34 model has the vendor store signed information inside each device at manufacture — a platform certificate enumerating the expected components — so that at acceptance testing your provisioner scans the machine, builds a hardware manifest of what is actually present, and compares it against the signed certificate. A mismatch is a swapped or added component, detected at receiving rather than after a breach.

The building blocks of an evidentiary supply chain are three layered manifests, each answering a different question:

HBOM (Hardware Bill of Materials) — what physical components is this device supposed to contain? The reference an as-built scan is compared against to detect substitution or an added implant.
SBOM (Software Bill of Materials) — what software/firmware components and versions are present, so a known-vulnerable or unexpected element is visible rather than buried.
RIM (Reference Integrity Manifest) — the signed, expected measurements of firmware/boot components, so a hardware root of trust can measure what actually ran at boot and attest it against the expected values. The RIM is what converts a one-time receiving check into a property you can re-verify for the asset's whole life.

The fork: an HBOM/SBOM-only posture gives you a strong receiving gate but goes quiet after the machine is racked; an HBOM/SBOM + RIM-and-attestation posture stays live, catching a firmware swap that happens in year two. The latter is the bridge into the hardware-root-of-trust and attestation machinery of Chapter 11.4, and it is what a high Weights Security Level (Chapter 11.1) effectively mandates.

Provenance posture ladder — what each level detects

Posture	Evidence	Detects substitution at receiving	Detects post-rack firmware swap	What it costs
Certificate of conformance	PDF assertion of genuineness	No (paper only)	No	Near-zero; near-zero assurance
Incoming inspection	Visual + X-ray/CT + sample decap	Partially (physical anomalies)	No	Lab/tooling + per-lot labor
NIST 1800-34 platform certificate	Signed HBOM verified at acceptance	Yes (component mismatch)	No (one-time check)	Vendor term + receiving tooling
RIM + hardware root of trust	Signed measurements, attested at boot and continuously	Yes	Yes (re-verifiable for asset life)	Silicon RoT + attestation infra (→ 11.4)

Levels are cumulative. The honest question is not 'do we have provenance' but 'which of these can we re-verify on the floor today.'

Firmware assurance: OCP S.A.F.E. and the audit you can inherit

Firmware is the implant surface that survives everything else. A malicious or vulnerable image in a BMC, NIC, SSD controller, or power component persists across every disk wipe, OS reinstall, and ownership transfer, because none of those touch it — which is exactly why it is the favored persistence mechanism for a sophisticated adversary and the reason firmware assurance belongs in the supply-chain chapter, not just the operations one. The problem at fleet scale is that you cannot audit every supplier's firmware yourself, and asking each operator to do so independently is enormous duplicated effort for the same images.

The OCP S.A.F.E. (Security Appraisal Framework and Enablement) program solves this by centralizing the audit: a device or firmware vendor engages an approved, independent Security Review Provider (SRP) to perform a standardized security review against a common checklist, and the resulting endorsement is something you can inherit instead of re-running — with a gap analysis required for each new firmware release so the assurance does not silently expire. The decision this hands the operator is clean: require an OCP S.A.F.E. (or equivalent independently-audited) attestation as a procurement term for firmware-bearing devices, and require a RIM you can attest against — or accept the vendor's unverified word and own the residual. For a facility at a high Weights Security Level, the former is effectively mandatory; the audited-firmware requirement is the supply-chain half of the firmware-integrity story that Chapter 11.4 completes on the platform side.

748

suspect counterfeit-part submissions logged in 2025 (down from 1,055 in 2024, partly a one-off batch); active components ~36% of reports

2025ERAI 2025 Annual Counterfeit Report

~24%

of suspect counterfeit parts that PASSED electrical test — would evade detection if electrical test were the only screen

2025ERAI 2025 report

Dec 2022

NIST SP 1800-34 'Validating the Integrity of Computing Devices' finalized — the platform-certificate / provenance reference architecture

2022NIST / NCCoE SP 1800-34

Sept 2025

NIST SP 800-88 Rev 2 released — media sanitization modernized for encrypted/virtual/cloud media (Clear / Purge / Destroy)

2025NIST SP 800-88 Rev 2

Purge = CE or destroy

IEEE 2883-2022: no overwrite-based method meets the Purge threshold for SSD/NVMe — only verified cryptographic erase or physical destruction qualifies

2025IEEE 2883-2022 / NIST 800-88 r2

42%

of used drives resold on the secondary market found to contain residual recoverable data (PII, financial, IP) — the data-remanence base rate

2019Blancco Technology Group study

$3–4M

approximate silicon value concentrated in a single GB200 NVL72 rack (1.36 t) — the asset-value density driving target priority

2025NVIDIA / SemiAnalysis (derived)

1st Thu/mo

OCP S.A.F.E. project cadence; AMI the first approved independent firmware vendor SRP — the centralized, inheritable firmware-audit framework

2025Open Compute Project S.A.F.E.

Secure decommissioning, media sanitization, and the remanence trap

The supply chain has a tail, and it is where the most catastrophic, most avoidable breach lives: the moment a drive that held weights, checkpoints, or customer data leaves your control. Decommissioning is treated as a cost-recovery exercise — pull the racks, recover residual value, recycle the rest — and security is bolted on as an afterthought. That inversion is how a frontier-model weights shard ends up forensically recoverable on a drive sold into the secondary market. The remanence base rate is not hypothetical: a widely-cited study found 42% of used drives resold still contained recoverable data including PII, financial records, and IP. On an AI fleet, the analogous payload is the crown jewels of Chapter 11.8.

The standard is NIST SP 800-88, and its September 2025 Revision 2 is current-to-2026 and matters because it modernizes the guidance for the encrypted, virtualized, cloud-backed media an AI fleet actually runs. It defines three sanitization levels — Clear (logical overwrite, recoverable by lab techniques), Purge (renders recovery infeasible even with lab techniques), and Destroy (renders the media physically incapable of storing data). The load-bearing engineering fact for AI builders is the SSD trap: modern solid-state media uses wear-leveling and over-provisioning — spare cells that never appear in the user-addressable space and that an overwrite routine cannot reach. As a result, per IEEE 2883-2022 and NIST 800-88 r2, no overwrite-based method satisfies the Purge threshold for SSD/NVMe: Purge requires either verified cryptographic erase (destroying the encryption key so the ciphertext is unrecoverable) or physical destruction. An operator who 'wipes' NVMe drives with a multi-pass overwrite and resells them has met no recognized Purge standard and is shipping recoverable data to strangers.

Sanitization decision — modality vs assurance vs residual value

Method	NIST 800-88 level	Valid for modern SSD/NVMe?	Preserves resale value?	When to use
Single/multi-pass overwrite	Clear	No (cannot reach over-provisioned cells)	Yes	HDDs / low-sensitivity media only
Cryptographic erase (key destruction)	Purge	Yes (if drive was encrypted from deployment)	Yes	Default for encrypted SSDs at scale
Verified block-erase (drive-native)	Purge	Yes (if firmware implements it correctly)	Yes	Where crypto-erase unavailable; verify it worked
Degauss	Purge	No (ineffective on flash; HDD/tape only)	No (destroys drive)	Magnetic media only
Physical destruction (shred/disintegrate)	Destroy	Yes (definitive)	No	Weights-bearing media; classified; unverifiable drives

Crypto-erase is the only method that preserves resale value AND meets Purge for SSDs — but it is contingent on the drive having been encrypted with a properly managed key from day one. That contingency is set at deployment, not at decommission.

The decision tree at end-of-life is therefore a two-axis trade between assurance and residual value, gated by a choice you made years earlier. If every drive in the fleet was encrypted at deployment with a key in a managed KMS/HSM, then crypto-erase gives you a Purge-level sanitization that preserves resale value — the rare case where the secure option is also the economical one. If drives were not encrypted from day one, you are forced to choose between an overwrite that does not actually meet Purge for SSDs (insecure) and physical destruction that meets Destroy but forfeits residual value (expensive). This is why fleet-wide encryption-at-rest is a decommissioning decision disguised as a deployment decision — and why anything that held weights should default to Destroy regardless, accepting the lost residual as cheap insurance against the one breach that ends a company. The whole flow must be wrapped in a chain-of-custody and a certificate of destruction per asset, executed by a vetted ITAD provider (R2v3 / certified), because the decommissioning vendor is itself a supply-chain link with its own insider and substitution risk. → key management and weight-protection in Chapter 11.8; the financial/ITAD mechanics in Chapter 14.9.

Deep dive: why firmware is the implant that outlives the wipe

The instinct is to treat 'sanitization' as a storage problem — wipe the data drives, recover the value, done. That instinct misses the persistence mechanism a sophisticated adversary actually uses. Every server is a federation of small computers running their own firmware: the BMC (a full Linux SoC with out-of-band access to everything), the NIC, the SSD controller, the GPU's own management microcontrollers, the power-conversion devices. Each holds firmware in non-volatile storage that no disk wipe, OS reinstall, or antivirus ever touches. A malicious image planted there — at an un-audited supplier, or by someone with brief physical access in a transshipment point — survives every conventional sanitization and every re-imaging, and re-establishes itself on the next boot.

This is why firmware assurance is a supply-chain control and not merely an operations one, and why it spans both ends of the lifecycle. On the way in, you want an OCP S.A.F.E. audit and a RIM you can attest against, so an unexpected firmware image is visible at acceptance. Across life, you want continuous attestation against that RIM, so a firmware swap in year two is caught (the machinery of Chapter 11.4). On the way out, true sanitization of a weights-bearing host means addressing firmware-resident state too — which in practice means physical destruction of the firmware-bearing components for the highest-sensitivity assets, because re-flashing to a known-good image still trusts the very update path an implant may have subverted. The clean mental model: data sanitization protects the storage; firmware assurance protects the computer that the storage plugs into — and the second is the one an adversary actually hides in.

Deep dive: the inheritable-audit economics of OCP S.A.F.E.

Firmware auditing has a punishing cost structure if every operator does it alone. A serious firmware security review — reverse-engineering a BMC image, auditing the secure-boot chain, probing the update mechanism — is weeks of specialist time per device family per release. Multiply by every NIC, SSD, BMC, and power-controller image in a heterogeneous fleet, then by every firmware revision, then by every operator who buys the same hardware, and the industry is paying for the same audit thousands of times over while most operators, lacking the specialists, simply skip it.

OCP S.A.F.E. restructures that into a once-audited, many-times-inherited model: the vendor pays an approved independent Security Review Provider to review against a common framework, the endorsement is published, and every downstream operator inherits the assurance instead of re-deriving it — with a required gap analysis on each new firmware release so the endorsement tracks the shipping image rather than a stale one. The strategic consequence for a buyer is that firmware assurance becomes a specification rather than a project: you write 'OCP S.A.F.E.-endorsed firmware with current gap analysis' into the procurement contract and shift the audit burden onto the supplier who is best-placed to bear it. The residual you still own is the trust in the SRP and the framework itself — which is why the program's independence and the public-ness of its findings (CVSS-scored, like any vulnerability) determine whether the inherited assurance is worth anything.

This chapter sets the trust boundary; the rest of Part 11 builds on it. The asset taxonomy, attacker tiers, and the Weights Security Levels that dictate how high your provenance bar must go are in Chapter 11.1; the physical-security and transshipment threat model that tamper-evident logistics defends against is in Chapter 11.2. The hardware-root-of-trust, RIM attestation, and firmware-integrity machinery this chapter points to is engineered in Chapter 11.4; the multi-tenant isolation that assumes a trustworthy substrate is in Chapter 11.6; the weights-as-crown-jewels and key-management discipline that decommissioning protects is in Chapter 11.8; and the insider risk that runs through every supply-chain and ITAD link is in Chapter 11.9. The procurement fork that creates allocation-pressure counterfeiting risk is in Chapter 1.6; the depreciation, refresh, and ITAD economics of decommissioning live in Chapter 14.9; and where supply-chain controls become a certification requirement is in Chapter 11.11.