Chapter 11.3
Supply-Chain Security & Hardware Provenance
A GPU is the most valuable, most counterfeited, most tamperable industrial component on the planet right now — so the security boundary of an AI data center does not begin at the cage door, it begins at the fab and ends at the shredder, and every link you cannot prove is a link an adversary can substitute.
What you'll decide here
- Where you draw the trust boundary in your supply chain — at the loading dock (cheap, blind to in-transit and upstream tampering) or at the silicon (provenance-verified from fab to rack, the only posture that survives a nation-state adversary) — because that choice sets every control below it.
- Whether you treat provenance as paperwork or as cryptographic evidence — a signed platform certificate and a measured manifest you can re-verify on the floor, versus a PDF certificate of conformance you cannot — because only the former detects a swapped component after delivery.
- Which firmware-assurance bar you require of suppliers — an OCP S.A.F.E. audit and a Reference Integrity Manifest you can attest against, or a vendor's word — because firmware is the implant surface that survives every disk wipe and OS reinstall.
- How you sanitize media at end-of-life — the IEEE-2883 / NIST 800-88 Rev 2 Purge or Destroy that survives the wear-leveling and over-provisioning of modern SSDs, versus an overwrite that does not — because a single recoverable weights drive in the resale stream undoes every upstream control.
- Which links you instrument with tamper-evidence and chain-of-custody now (irreversible to retrofit after a breach) versus which you can defer, and what your country-of-origin and export posture forces on both.
Every other chapter in Part 11 defends the data center as it runs. This one defends the data center before it runs — and after it stops. The threat model here is not the attacker who breaches a live system but the one who arrives inside the hardware you bought, having compromised a component, a firmware image, or a shipment somewhere along a supply chain that now spans a fab in Taiwan, HBM stacks from Korea, substrate and CoWoS packaging, a board house, an integrator, two freight forwarders, and a customs broker — any one of which is a place to substitute a counterfeit, solder in an implant, or flash a malicious firmware that survives every wipe you will ever perform.
The reason this matters more for AI infrastructure than for any prior generation of IT is asset-value density. A single GB200 NVL72 rack is roughly $3–4M of silicon in 1.36 tonnes; a frontier training cluster concentrates billions of dollars of the most allocation-constrained, most export-controlled, most counterfeited components in the world into a few halls. That makes the supply chain both the highest-value target and the longest, least-observable attack surface in the building. This chapter covers that surface: supplier vetting and country-of-origin risk; tamper-evident logistics and chain-of-custody; cryptographic device provenance (NIST), HBOM/SBOM and the Reference Integrity Manifest, OCP S.A.F.E.; and secure decommissioning, media sanitization, and the data-remanence trap on the way out. The through-line: a provenance claim you cannot independently re-verify is not provenance — it is hope with a logo on it.
The master fork: where does your trust boundary start?
Every supply-chain control you will or won't buy is downstream of a single question: where do you begin to trust the hardware? There are three defensible answers, and they are not equally expensive or equally strong.
Trust at the dock is the legacy default: you accept the integrator's certificate of conformance, inspect the pallet for obvious damage, and rack it. It is cheap and fast, and it is blind to everything that happened upstream — a counterfeit memory module that passes power-on, an implant added at a transshipment point, a firmware image flashed before the truck arrived. Trust at acceptance test adds a provenance gate at receiving: you scan each platform, build a hardware manifest, and compare it cryptographically against a signed platform certificate the vendor stored in the device (the NIST SP 1800-34 model). This catches component substitution and in-transit tampering — but only if the vendor populated provenance at manufacture, which is a procurement term you must win at the contract, not the dock. Trust at the silicon is the strongest and the most demanding: a hardware root of trust on each device measures its own firmware at boot and attests it against a Reference Integrity Manifest, so provenance is not a one-time receiving check but a property you can re-verify continuously over the asset's life (the bridge into Chapter 11.4).
The consequence of choosing wrong is asymmetric. Pick dock-trust to save money and a nation-state-tier adversary — RAND's highest attacker tier, the relevant one for frontier weights — walks an implant past you that no perimeter, no network segmentation, and no confidential-computing TEE downstream can detect, because the compromise is beneath all of them. Pick silicon-trust and you pay an upfront premium in procurement terms, receiving labor, and tooling, but you have moved the security boundary to the only place that survives a sophisticated supply-chain attack. The fork is irreversible in practice: provenance you did not capture at manufacture cannot be reconstructed after delivery. → asset taxonomy and attacker tiers in Chapter 11.1.
The threat surface: counterfeits, implants, and tampering-in-transit
Three distinct threats live on the supply chain, and conflating them produces the wrong controls. Counterfeiting is the substitution of a fake or repurposed part for a genuine one — re-marked older silicon, recycled e-waste components, cloned modules, factory rejects sold as good. It is overwhelmingly an open-market-procurement risk: when allocation is tight and you buy GPUs, HBM, or power components from brokers and gray-market distributors to hit a ramp date, you walk straight into the channel where counterfeits concentrate. ERAI's 2025 report logged 748 suspect-part submissions, with active (in-production) components at ~36% of reports — and the finding that matters most for AI builders is that ~24% of suspect parts passed electrical test and would have evaded detection if electrical testing were the only screen. A counterfeit that boots is the dangerous kind.
Hardware implants are deliberate malicious additions — an extra component, a modified board, a substituted firmware chip — inserted to create a covert channel or a kill switch. Implants are the nation-state threat, and they are the reason dock-trust fails against the top attacker tier: an implant added at a board house or a transshipment point is invisible to functional test and to every downstream software defense. Tampering-in-transit is the same idea applied to the logistics leg specifically — a shipment opened, modified, and re-sealed between the integrator's dock and yours, exploiting the long, multi-party, low-observability freight path that AI hardware now traverses across borders. The defense against the second and third is the same instrument: tamper-evidence and an unbroken chain of custody, so that any opening of the package between two trusted endpoints leaves a signature you detect on receipt.
| Threat | Primary entry point | What detects it | What does NOT detect it | Residual risk if ignored |
|---|---|---|---|---|
| Counterfeit component | Open-market / broker procurement under allocation pressure | Authorized-distributor sourcing; incoming inspection (visual + X-ray + decap); device provenance | Electrical test alone (~24% pass it); a PDF certificate of conformance | Field failures, fleet reliability collapse, latent backdoor in re-marked silicon |
| Hardware implant | Board house, integrator, transshipment point (nation-state) | HBOM vs as-built comparison; X-ray/CT inspection; hardware root-of-trust measurement | Functional test; perimeter and network controls (implant sits beneath them) | Persistent covert channel or kill switch under every downstream defense |
| Tampering-in-transit | The multi-party cross-border freight leg | Tamper-evident seals/packaging; sealed chain-of-custody; GPS/temperature telemetry | Visual damage check; trusting the carrier's manifest | Re-sealed shipment with a swapped firmware chip or added component |
| Malicious / outdated firmware | Pre-delivery flashing; un-audited supplier firmware | OCP S.A.F.E. audit; Reference Integrity Manifest + attestation; secure/measured boot | Disk wipe, OS reinstall, antivirus (firmware survives all of them) | Implant that persists across every re-image and ownership change |
Supplier vetting and country-of-origin risk
Vetting is the cheapest high-leverage control in this chapter, because it shrinks the attack surface before any tamper-evidence or provenance tooling has to work. The discipline is NIST SP 800-161 cyber supply-chain risk management (C-SCRM): tier your suppliers by criticality, require security attestations proportional to tier, and — critically — extend the requirements to sub-tier suppliers, because the counterfeit or the implant rarely enters at your direct vendor; it enters two or three tiers down, at the component maker, the board house, or the gray-market broker your integrator quietly used to hit allocation. A vetting program that stops at tier-1 misses exactly where the threat enters.
Country-of-origin risk is the axis that turned from a compliance footnote into a board-level constraint over 2024–2026. Two forces collide. First, concentration: leading-edge logic is effectively single-sourced at TSMC, HBM at two Korean suppliers, CoWoS packaging at a handful of lines — so a geopolitical shock to one node is a fleet-wide supply shock, and the pressure to buy from less-trusted channels to compensate is exactly the pressure that lets counterfeits in. Second, export controls: US restrictions gate where the highest-end accelerators can legally sit, which both fragments the legitimate channel and inflates a smuggling/gray-market for controlled GPUs — a market that is, by construction, a counterfeit-and-tamper vector with no provenance guarantees whatsoever. The decision this forces: do you accept gray-market or unauthorized-channel parts to make a ramp date? The honest answer for any facility holding frontier weights is no, and the cost of that answer is a slower ramp — which is precisely the kind of trade the strategist must name explicitly rather than discover after a counterfeit bricks a tray. → procurement framing in Chapter 1.6; the depreciation/refresh side in Chapter 14.9.
Tamper-evident logistics and chain of custody
Between the integrator's verified dock and yours lies the least-observable leg of the entire lifecycle, and it is the leg an implant or tampering attack most cheaply exploits. The defense is to make the freight path evidentiary: tamper-evident seals and packaging that cannot be opened and re-closed without leaving a detectable signature; a documented, signed chain of custody that names every party who touched the shipment; and, for the highest-value lots, in-transit telemetry — GPS, shock, and temperature loggers that detect an unscheduled stop or an opening. The receiving process then becomes a verification step, not an unboxing: seal intact, custody log complete, telemetry clean — or the lot is quarantined and re-verified, not racked.
The decision here is which shipments get the full treatment, because tamper-evident logistics with custody attestation and telemetry is not free and does not scale to every cable reel. The rational policy tiers it by what the shipment can compromise: full chain-of-custody and tamper-evidence for accelerators, server boards, BMCs, and anything firmware-bearing; lighter handling for commodity passive and mechanical components. Skip the tiering and you either overspend on tamper-evidence for power cabling or — far worse — under-protect the one shipment whose compromise matters: the trays that hold the silicon that holds the weights.
Device provenance: from PDF certificate to cryptographic proof
Provenance is, in NIST's framing, the comprehensive history of a device across its entire lifecycle — creation, ownership, and every authorized change — and a delivered device has integrity only if it is genuine and all changes to it were expected. The whole discipline turns on one distinction: provenance as a document versus provenance as evidence. A certificate of conformance is a document; it asserts genuineness but cannot be checked against the device in front of you. Cryptographic provenance is evidence: the NIST SP 1800-34 model has the vendor store signed information inside each device at manufacture — a platform certificate enumerating the expected components — so that at acceptance testing your provisioner scans the machine, builds a hardware manifest of what is actually present, and compares it against the signed certificate. A mismatch is a swapped or added component, detected at receiving rather than after a breach.
The building blocks of an evidentiary supply chain are three layered manifests, each answering a different question:
- HBOM (Hardware Bill of Materials) — what physical components is this device supposed to contain? The reference an as-built scan is compared against to detect substitution or an added implant.
- SBOM (Software Bill of Materials) — what software/firmware components and versions are present, so a known-vulnerable or unexpected element is visible rather than buried.
- RIM (Reference Integrity Manifest) — the signed, expected measurements of firmware/boot components, so a hardware root of trust can measure what actually ran at boot and attest it against the expected values. The RIM is what converts a one-time receiving check into a property you can re-verify for the asset's whole life.
The fork: an HBOM/SBOM-only posture gives you a strong receiving gate but goes quiet after the machine is racked; an HBOM/SBOM + RIM-and-attestation posture stays live, catching a firmware swap that happens in year two. The latter is the bridge into the hardware-root-of-trust and attestation machinery of Chapter 11.4, and it is what a high Weights Security Level (Chapter 11.1) effectively mandates.
| Posture | Evidence | Detects substitution at receiving | Detects post-rack firmware swap | What it costs |
|---|---|---|---|---|
| Certificate of conformance | PDF assertion of genuineness | No (paper only) | No | Near-zero; near-zero assurance |
| Incoming inspection | Visual + X-ray/CT + sample decap | Partially (physical anomalies) | No | Lab/tooling + per-lot labor |
| NIST 1800-34 platform certificate | Signed HBOM verified at acceptance | Yes (component mismatch) | No (one-time check) | Vendor term + receiving tooling |
| RIM + hardware root of trust | Signed measurements, attested at boot and continuously | Yes | Yes (re-verifiable for asset life) | Silicon RoT + attestation infra (→ 11.4) |
Firmware assurance: OCP S.A.F.E. and the audit you can inherit
Firmware is the implant surface that survives everything else. A malicious or vulnerable image in a BMC, NIC, SSD controller, or power component persists across every disk wipe, OS reinstall, and ownership transfer, because none of those touch it — which is exactly why it is the favored persistence mechanism for a sophisticated adversary and the reason firmware assurance belongs in the supply-chain chapter, not just the operations one. The problem at fleet scale is that you cannot audit every supplier's firmware yourself, and asking each operator to do so independently is enormous duplicated effort for the same images.
The OCP S.A.F.E. (Security Appraisal Framework and Enablement) program solves this by centralizing the audit: a device or firmware vendor engages an approved, independent Security Review Provider (SRP) to perform a standardized security review against a common checklist, and the resulting endorsement is something you can inherit instead of re-running — with a gap analysis required for each new firmware release so the assurance does not silently expire. The decision this hands the operator is clean: require an OCP S.A.F.E. (or equivalent independently-audited) attestation as a procurement term for firmware-bearing devices, and require a RIM you can attest against — or accept the vendor's unverified word and own the residual. For a facility at a high Weights Security Level, the former is effectively mandatory; the audited-firmware requirement is the supply-chain half of the firmware-integrity story that Chapter 11.4 completes on the platform side.
Secure decommissioning, media sanitization, and the remanence trap
The supply chain has a tail, and it is where the most catastrophic, most avoidable breach lives: the moment a drive that held weights, checkpoints, or customer data leaves your control. Decommissioning is treated as a cost-recovery exercise — pull the racks, recover residual value, recycle the rest — and security is bolted on as an afterthought. That inversion is how a frontier-model weights shard ends up forensically recoverable on a drive sold into the secondary market. The remanence base rate is not hypothetical: a widely-cited study found 42% of used drives resold still contained recoverable data including PII, financial records, and IP. On an AI fleet, the analogous payload is the crown jewels of Chapter 11.8.
The standard is NIST SP 800-88, and its September 2025 Revision 2 is current-to-2026 and matters because it modernizes the guidance for the encrypted, virtualized, cloud-backed media an AI fleet actually runs. It defines three sanitization levels — Clear (logical overwrite, recoverable by lab techniques), Purge (renders recovery infeasible even with lab techniques), and Destroy (renders the media physically incapable of storing data). The load-bearing engineering fact for AI builders is the SSD trap: modern solid-state media uses wear-leveling and over-provisioning — spare cells that never appear in the user-addressable space and that an overwrite routine cannot reach. As a result, per IEEE 2883-2022 and NIST 800-88 r2, no overwrite-based method satisfies the Purge threshold for SSD/NVMe: Purge requires either verified cryptographic erase (destroying the encryption key so the ciphertext is unrecoverable) or physical destruction. An operator who 'wipes' NVMe drives with a multi-pass overwrite and resells them has met no recognized Purge standard and is shipping recoverable data to strangers.
| Method | NIST 800-88 level | Valid for modern SSD/NVMe? | Preserves resale value? | When to use |
|---|---|---|---|---|
| Single/multi-pass overwrite | Clear | No (cannot reach over-provisioned cells) | Yes | HDDs / low-sensitivity media only |
| Cryptographic erase (key destruction) | Purge | Yes (if drive was encrypted from deployment) | Yes | Default for encrypted SSDs at scale |
| Verified block-erase (drive-native) | Purge | Yes (if firmware implements it correctly) | Yes | Where crypto-erase unavailable; verify it worked |
| Degauss | Purge | No (ineffective on flash; HDD/tape only) | No (destroys drive) | Magnetic media only |
| Physical destruction (shred/disintegrate) | Destroy | Yes (definitive) | No | Weights-bearing media; classified; unverifiable drives |
The decision tree at end-of-life is therefore a two-axis trade between assurance and residual value, gated by a choice you made years earlier. If every drive in the fleet was encrypted at deployment with a key in a managed KMS/HSM, then crypto-erase gives you a Purge-level sanitization that preserves resale value — the rare case where the secure option is also the economical one. If drives were not encrypted from day one, you are forced to choose between an overwrite that does not actually meet Purge for SSDs (insecure) and physical destruction that meets Destroy but forfeits residual value (expensive). This is why fleet-wide encryption-at-rest is a decommissioning decision disguised as a deployment decision — and why anything that held weights should default to Destroy regardless, accepting the lost residual as cheap insurance against the one breach that ends a company. The whole flow must be wrapped in a chain-of-custody and a certificate of destruction per asset, executed by a vetted ITAD provider (R2v3 / certified), because the decommissioning vendor is itself a supply-chain link with its own insider and substitution risk. → key management and weight-protection in Chapter 11.8; the financial/ITAD mechanics in Chapter 14.9.
Deep dive: why firmware is the implant that outlives the wipe
The instinct is to treat 'sanitization' as a storage problem — wipe the data drives, recover the value, done. That instinct misses the persistence mechanism a sophisticated adversary actually uses. Every server is a federation of small computers running their own firmware: the BMC (a full Linux SoC with out-of-band access to everything), the NIC, the SSD controller, the GPU's own management microcontrollers, the power-conversion devices. Each holds firmware in non-volatile storage that no disk wipe, OS reinstall, or antivirus ever touches. A malicious image planted there — at an un-audited supplier, or by someone with brief physical access in a transshipment point — survives every conventional sanitization and every re-imaging, and re-establishes itself on the next boot.
This is why firmware assurance is a supply-chain control and not merely an operations one, and why it spans both ends of the lifecycle. On the way in, you want an OCP S.A.F.E. audit and a RIM you can attest against, so an unexpected firmware image is visible at acceptance. Across life, you want continuous attestation against that RIM, so a firmware swap in year two is caught (the machinery of Chapter 11.4). On the way out, true sanitization of a weights-bearing host means addressing firmware-resident state too — which in practice means physical destruction of the firmware-bearing components for the highest-sensitivity assets, because re-flashing to a known-good image still trusts the very update path an implant may have subverted. The clean mental model: data sanitization protects the storage; firmware assurance protects the computer that the storage plugs into — and the second is the one an adversary actually hides in.
Deep dive: the inheritable-audit economics of OCP S.A.F.E.
Firmware auditing has a punishing cost structure if every operator does it alone. A serious firmware security review — reverse-engineering a BMC image, auditing the secure-boot chain, probing the update mechanism — is weeks of specialist time per device family per release. Multiply by every NIC, SSD, BMC, and power-controller image in a heterogeneous fleet, then by every firmware revision, then by every operator who buys the same hardware, and the industry is paying for the same audit thousands of times over while most operators, lacking the specialists, simply skip it.
OCP S.A.F.E. restructures that into a once-audited, many-times-inherited model: the vendor pays an approved independent Security Review Provider to review against a common framework, the endorsement is published, and every downstream operator inherits the assurance instead of re-deriving it — with a required gap analysis on each new firmware release so the endorsement tracks the shipping image rather than a stale one. The strategic consequence for a buyer is that firmware assurance becomes a specification rather than a project: you write 'OCP S.A.F.E.-endorsed firmware with current gap analysis' into the procurement contract and shift the audit burden onto the supplier who is best-placed to bear it. The residual you still own is the trust in the SRP and the framework itself — which is why the program's independence and the public-ness of its findings (CVSS-scored, like any vulnerability) determine whether the inherited assurance is worth anything.