Chapter 11.12

Security Operations, Detection & Incident Response

Every control in Part 11 is a hypothesis about an adversary; security operations is the function that tests those hypotheses in production — and on an AI campus the test is harder because the highest-value asset is a file you cannot watch leave, the highest-consequence attack is one that turns the cooling plant into a weapon, and the detection surface spans a converged cyber-physical estate that no off-the-shelf SOC was designed to watch.

POWER-BOUNDGOODPUT

What you'll decide here

Whether you run a converged SOC that watches IT, OT, and physical security under one incident commander, or isolated towers — because the cyber-physical attacks that matter most on an AI campus (CDU disablement, forced load step, weight exfiltration) are precisely the ones that fall between isolated towers.
Which telemetry you collect and retain at cluster scale — GPU/BMC/firmware attestation, fabric flow records, egress byte-budgets, OT controller logs — and which you knowingly do not, because you cannot detect what you never ingested and you cannot afford to keep everything.
Where the converged escalation trigger lives that promotes a 'cyber alert' to a 'physical-safety incident' the moment a control-plane anomaly touches a power-cap, CDU, or BESS plane — and who has the authority to trip hardwired interlocks without waiting for forensic certainty.
Which incidents you have actually rehearsed — weight theft, firmware implant, kinetic/drone strike, isolation breach, OT/cyber-physical attack — versus the ones you will be improvising on at 3 a.m., because the playbook you have not run is the playbook you do not have.
What you measure the SOC against — facility availability, MTTD/MTTC, or goodput preserved per incident — because optimizing for the wrong number buys nines the workload does not value while badput from a slow detection quietly eats the return.

The preceding eleven chapters of Part 11 build a fortress: a threat model and a target Weights Security Level (Chapter 11.1), concentric physical zones (Chapter 11.2), a hardware root of trust and firmware integrity (Chapter 11.4), GPU confidential computing (Chapter 11.5), tenant isolation (Chapter 11.6), zero-trust segmentation (Chapter 11.7), weight protection (Chapter 11.8), insider controls (Chapter 11.9), OT hardening (Chapter 11.10), and a compliance program that audits all of it (Chapter 11.11). Every one of those controls is a hypothesis: that the adversary will be stopped here, slowed there, made noisy somewhere. Security operations is the function that runs the experiment. It is where prevention's assumptions meet a live adversary, where the residual risk that every control chapter explicitly leaves on the table gets detected, contained, and recovered — or does not.

This chapter is the operational close of Part 11. It treats four decisions and their consequences: the SOC architecture for a campus where IT, OT, and physical security have converged into one attack surface; detection coverage and telemetry at scale, where the volume of fabric and GPU signal defeats a naive SIEM and the most valuable asset emits the least telemetry; the incident-response playbooks for the AI-specific scenarios that generic IR plans never anticipated — weight theft, firmware implant, kinetic strike, isolation breach, cyber-physical OT attack — including forensics on confidential systems that are designed to be opaque even to their operator; and the resilience, red-teaming, threat-intelligence, and metrics program that keeps the whole apparatus honest. The unifying theme is convergence: the AI campus is the first environment where a cyber intrusion can become a kinetic, life-safety, and grid-stability event in the same incident, and the SOC must be built to see that transition before it happens.

The SOC fork: converged vs isolated

The first and most consequential decision is organizational, not technical: does one incident-command structure watch IT security, OT/facility security, and physical security together, or do three separate towers each watch their slice? The legacy answer — separate, because the skills, the tooling, and the reporting lines are different — is exactly wrong for an AI campus, and the reason is structural. The attacks that define this environment do not respect the tower boundaries. A firmware implant on a CDU controller (an OT event) is the prelude to thermal runaway that throttles or destroys a $3M+ rack of allocation-constrained silicon (an IT-asset and availability event). A drone strike on a substation (a physical event) is increasingly paired with a cyber intrusion timed to the chaos. A badge-reader compromise (a physical-access-control event that is really an OT/IT event, because the readers are IP-connected) is an insider's path to a cage. The highest-consequence incidents on an AI campus are precisely the ones that fall into the seams between isolated towers — and a seam is where detection dies.

The converged model puts IT, OT, and physical telemetry into a shared detection and incident-command plane, with one duty officer who can correlate a control-plane anomaly against a power-cap deviation against a door-forced alarm and recognize them as one attack. The cost is real: OT analysts and IT analysts have genuinely different instincts (an OT engineer's reflex is 'do not touch the running process'; an IT analyst's reflex is 'isolate and reimage'), and forcing them into one room without a deliberate operating model produces friction, not coverage. The honest middle path most large operators are converging on is a converged detection plane with federated response: shared visibility and a single incident commander, but OT actions still gated through OT-qualified operators who understand that the wrong containment move on a cooling loop is itself a destructive act. The canonical incident-command structure that this SOC plugs into — the roles, the authority ladder, the unified command for cyber-physical events — is built out in Chapter 14.11; this chapter is the security-detection feed into it.

SOC architecture fork: converged vs isolated towers

Axis	Isolated towers (IT / OT / physical separate)	Fully converged SOC	Converged detection, federated response
Seam coverage	Poor — cyber-physical attacks fall between towers	Strong — one analyst correlates across domains	Strong — shared detection plane, no telemetry silos
Response correctness on OT	OT-correct (OT operators act)	Risk — IT reflexes mis-applied to running OT	OT-correct — OT actions gated to OT operators
Incident command	Three commanders, slow hand-off	One commander, fast	One commander, OT-qualified deputies
Skills / staffing cost	Lowest per-tower, highest total	High — cross-trained scarce talent	Moderate — shared analysts + OT specialists
Tooling overlap	Duplicated SIEM/SOAR per tower	Single pane, integration burden	Shared SIEM, OT-native sensors federated in
Best fit	Legacy enterprise DC, low OT-attack exposure	Small, single-tenant frontier campus	Large multi-tenant AI campus (the default)

The trade is coverage-of-the-seams against operational friction and tooling overlap. Most large 2026 AI operators land on 'converged detection, federated response.' Rows are decision axes, not a maturity ladder.

The converged cyber-physical escalation trigger

Define — in advance, in code, with a named owner — the escalation trigger that promotes a cyber alert to a physical-safety incident the instant a control-plane anomaly touches a safety-relevant plane: the power-cap/firmware plane, a CDU or pump controller, the BESS management system, or the EPMS. When that trigger fires, the incident is no longer 'investigate and confirm before acting' — it is 'assume cyber-physical attack, protect life and the grid first, forensics second.' This inverts the normal IR instinct. A standard cyber playbook preserves evidence and avoids tipping off the adversary; a cyber-physical playbook accepts evidence loss to trip a hardwired interlock before a forced synchronized load step trips the grid (the NERC-flagged failure mode in Chapter 4.5) or a disabled CDU cooks a hall. The safety-instrumented systems must be independent of the compromised control plane by design — see Chapter 11.10 — so that the SOC's escalation can reach an interlock the attacker cannot countermand. Operators who have not pre-wired this trigger discover, mid-incident, that no one has the authority to act and the controls engineer is asleep.

Detection coverage and telemetry at scale

A SOC detects only what it can see, and an AI campus presents a detection surface that is simultaneously enormous and blind in the worst places. Enormous, because a 100k-GPU cluster generates fabric telemetry, GPU and BMC health streams, firmware-attestation records, scheduler events, and OT controller logs at a volume that overwhelms a naive 'log everything into the SIEM' approach — the ingest and retention bill alone forces hard choices about what you keep and for how long. Blind in the worst places, because the single most valuable asset on the campus — the model weights — is, by design, the asset that emits the least useful telemetry: it lives encrypted at rest, decrypts only inside an attested TEE, and the confidential-computing boundary that protects it (Chapter 11.5) is opaque to the very monitoring the SOC would use to watch it. You are asked to detect theft of the thing you are contractually forbidden from observing. That paradox is the defining feature of AI-campus detection engineering.

The resolution is to move detection to the perimeters of the opaque core, not into it. You cannot watch a weight decrypt inside a TEE, but you can watch the egress choke point the weight must cross to leave (the anti-exfiltration linchpin of Chapter 11.7 and Chapter 11.8): a frontier model is hundreds of gigabytes to terabytes, and that mass has to traverse a network boundary, which is detectable as a byte-budget anomaly even when the payload is encrypted. You cannot trust a GPU's self-report if its firmware is compromised, but you can demand continuous attestation against NRAS/RIM golden measurements and alert on a measurement that drifts from baseline (the firmware-integrity discipline of Chapter 11.4). You cannot see inside a confidential VM, but you can watch the scheduler and the access plane for the human and machine behaviors that precede an exfiltration: a privileged account touching a weight store it has never touched, a job submitted to read every shard, a bulk-copy pattern against the storage fabric. The detection strategy is to ring the opaque asset with observable behavior and treat any anomaly at the ring as the signal.

Detection coverage map: signal source → what it catches → blind spot

Telemetry source	Detects	Primary blind spot	Engineered in
Egress byte-budget / DLP at the choke point	Bulk weight/data exfiltration; insider copy-out	Slow low-and-slow leak under the threshold	11.7, 11.8
GPU/firmware attestation (NRAS/RIM drift)	Firmware implant; GPU-bricking; rogue measurement	Implant that forges a valid measurement chain	11.4, 11.5
Fabric flow records (east-west)	Lateral movement; abnormal collectives; recon	In-TEE traffic; encrypted intra-job payloads	11.7
Scheduler / access-plane events	Privilege misuse; anomalous weight-store access	Legitimate-credentials abuse that looks normal	11.6, 11.9
OT controller / BMS / EPMS logs	Power-cap tampering; CDU disablement; load-step setup	Air-gapped or unlogged legacy controllers	11.10
Physical / badge / camera (as OT)	Door-forced; tailgating; cage access; drone telemetry	Aerial threats below sensor coverage	11.2

The recurring pattern: the highest-value asset (weights) emits the least direct telemetry, so detection moves to the observable perimeters around it. Cross-references point to where each control is engineered.

Two scale realities shape the telemetry budget. First, retention is a security control with a price tag. Forensics on a nation-state intrusion routinely requires reconstructing activity from months ago — Volt Typhoon-class actors pre-position in critical infrastructure and dwell quietly, with the explicit goal of remaining undetected for as long as possible (CISA/Microsoft, 2023–2026). The industry's measured breach lifecycle — mean time to identify and contain — sat at 241 days in 2025, the lowest in nine years but still eight months of adversary dwell (IBM Cost of a Data Breach, 2025). If your fabric-flow and access-plane logs roll off at 90 days to save money, you have guaranteed that the most serious intrusions will be un-investigable by the time you notice them. Second, detection at cluster scale must be tiered: high-cardinality, high-volume streams (per-flow fabric records, per-GPU health) are summarized and sampled at the edge and only escalated to full retention on anomaly, while low-volume high-value streams (attestation failures, egress threshold breaches, OT control changes) are kept verbatim and long. Getting this tiering wrong in either direction — keeping everything and going bankrupt, or keeping nothing and going blind — is the most common SOC-scaling failure on an AI campus.

Incident-response playbooks for AI-specific scenarios

A generic enterprise IR plan — built around the NIST incident-handling lifecycle and now restructured by NIST SP 800-61r3 (April 2025) to map onto the CSF 2.0 functions of Govern, Identify, Protect, Detect, Respond, and Recover — is necessary but radically insufficient here. The AI campus has incident classes that no commodity playbook anticipates, and each one inverts a default IR assumption. The five that matter, with the fork each forces:

Weight theft. The crown-jewel scenario. The default IR move — isolate the affected host and preserve it for forensics — is too slow when the asset is a file that can be exfiltrating in real time. The playbook fork is detect-then-contain at the egress, not the host: throttle or sever the egress path the moment a byte-budget anomaly fires, accept the goodput loss of cutting legitimate traffic, and reconstruct the access chain from the access-plane and scheduler logs you (hopefully) retained. The reference framework for what 'theft' means and which adversary tier you are defending against is RAND's Weights Security Level model (Chapter 11.1, Chapter 11.8).
Firmware implant. An implant in GPU, BMC, or NIC firmware survives reimaging and can forge its own health reports. The default 'wipe and restore from backup' does not help if the backup or the supply chain is the vector. The fork is to treat any attestation-chain anomaly as a compromise of the hardware root of trust itself: quarantine the node from the fabric, re-attest from the silicon RoT (Caliptra/DICE, Chapter 11.4), and — if the measurement cannot be re-established — physically retire the unit, because you can no longer trust anything it reports.
Kinetic / drone strike. The March 2026 IRGC drone strikes on AWS facilities in the UAE and Bahrain moved aerial attack from tail-risk to design case (domain research, 2026); most legacy physical controls are built for ground intruders, not aerial ones. The fork here is that the security incident and the facility incident are the same incident — structural and power damage demand the unified cyber-physical command (Chapter 14.11), and the SOC's job is to determine whether the kinetic event is cover for a simultaneous cyber intrusion.
Isolation breach. A documented multi-tenant escape — the 2025 vGPU and uncore-side-channel CVE class (Chapter 11.6) — means tenant A read tenant B's data or denied them service. The fork is blast-radius first: identify every tenant who shared the affected isolation boundary, not just the two named in the alert, because a side-channel does not respect the alert's scope.
Cyber-physical / OT attack. The most dangerous class, treated at length in Chapter 11.10: a compromised control plane forcing a synchronized load step to trip the grid, disabling CDUs to cook a hall, or inducing BESS runaway. This is the one incident where the playbook must accept evidence destruction to protect life and the grid — the escalation trigger above.

The playbook you have not rehearsed is the playbook you do not have

Each of the five scenarios above contains a counter-intuitive move that an analyst will not make under pressure unless they have made it before in a drill: cutting legitimate egress to stop a leak, physically retiring a node over an attestation anomaly, tripping a hardwired interlock and losing evidence, treating a two-tenant alert as an N-tenant blast radius. These instincts are unnatural — they trade availability, goodput, or evidence for containment in ways the daily SOC routine actively trains against. The only way to install them is the tabletop and the live-fire exercise, run against these specific AI scenarios, with the actual incident commander in the chair. An IR plan that has only ever been read is a document; an IR plan that has been run is a capability. Operators consistently discover their gaps not in the breach but in the rehearsal — which is the cheap place to discover them.

Forensics on confidential systems

Confidential computing creates a forensics paradox that is unique to this era. The whole point of a GPU TEE is that no one — not the cloud operator, not the SOC, not a privileged insider — can observe the workload's memory: NVIDIA's Compute Protected Region encrypts roughly 90% of GPU memory and the BAR0 decoupler hides ~99.78% of registers in confidential mode (vs ~7.94% in normal mode) (NVIDIA / arXiv 2507.02770, 2025). That is exactly the property a tenant pays for, and exactly the property that blinds the responder. When the incident is inside the confidential boundary, the operator's normal forensic toolkit — memory capture, process inspection, packet capture of the workload — is structurally unavailable. You cannot dump the memory of a TEE you do not hold the keys to.

The forensic strategy must therefore be designed before the incident, around the boundary rather than across it. Three principles. Attestation logs are the forensic record. The 5-certificate device-identity chain and the 64 structured measurement records validated against NRAS/RIM (domain research, 2026) are, in a confidential incident, often the only operator-side evidence of what ran and whether the platform was intact — so they must be collected, signed, and retained as first-class forensic artifacts, not transient health checks. Metadata leakage is your friend here. The same residual metadata that is a confidentiality weakness — plaintext queue headers and physical-address tables, the bounce-buffer staging across the untrusted PCIe boundary — is also the operator's only observability into an otherwise opaque workload, and the SOC should be instrumented to capture it. The contract is the constraint. Multi-tenant confidential workloads come with contractual and regulatory limits on what the operator may inspect; the IR plan must pre-negotiate, with tenants and with legal, exactly what evidence the operator is permitted to collect and under what incident conditions the TEE may be torn down — because the moment to discover you are not allowed to image a tenant's enclave is not during the breach. Forensic readiness on confidential systems is a design decision made at onboarding (Chapter 10.10 on data governance; isolation and CC internals in Chapter 11.5 and Chapter 11.6), not a capability you can improvise.

Deep dive: chain of custody and forensics in a multi-tenant confidential cluster

Consider the concrete case: a confidential multi-tenant cluster, tenant A's workload flagged for anomalous egress, weights potentially compromised. In a non-confidential environment the responder would snapshot the VM, image GPU memory, and pcap the egress. None of that works cleanly here, and the responder must work the perimeters instead.

What you can collect: the egress flow records and byte-counts at the choke point (encrypted payload, but the volume and timing are evidence); the attestation history for every GPU in tenant A's allocation (did a measurement drift, when, against which RIM baseline); the scheduler and storage-access logs (which shards were read, by which job, under which credential); the host-side metadata that crosses the PCIe boundary in plaintext (queue headers, address tables); the badge/physical-access record correlated to the credential. What you cannot collect without breaking the TEE and the contract: the in-enclave memory, the decrypted weights, the plaintext of the workload's computation.

The chain-of-custody discipline: attestation logs and flow records must be cryptographically signed and write-once at generation, because in a confidential incident they are the evidence a court or an insurer will weigh, and they are useless if the operator could have altered them. The blast-radius question: because uncore side-channels (NVENC/NVDEC/NVJPEG, DRAM-frequency scaling) bypass both MIG and MPS isolation, a 'tenant A' incident may have exposed every tenant who shared the physical GPU's uncore — so the custody and notification scope is the co-residency set, not the named tenant. The tear-down decision: destroying the enclave to fully contain also destroys volatile evidence and the tenant's running work; the playbook must specify who authorizes that trade and on what threshold. This is why forensic readiness for confidential systems is negotiated at onboarding and rehearsed in advance — see Chapter 11.5 for the CC internals and Chapter 11.6 for the isolation-failure modes the responder is reasoning about.

241 days

mean time to identify + contain a breach in 2025 (lowest in 9 years) — the dwell window your retention must outlast

2025IBM Cost of a Data Breach 2025

$4.44M / $10.22M

average breach cost: global down 9% to $4.44M; US at an all-time high of $10.22M

2025IBM Cost of a Data Breach 2025

SP 800-61r3

NIST IR guidance restructured onto CSF 2.0 (Govern/Identify/Protect/Detect/Respond/Recover); first revision since 2012

Apr 2025NIST SP 800-61 Rev. 3

~99.78%

GPU registers hidden by the BAR0 decoupler in confidential mode (vs ~7.94% normal) — the forensic opacity the SOC works around

2025NVIDIA WP-12554 / arXiv 2507.02770

5 / 64

certificate device-identity chain and structured measurement records (NRAS/RIM) that become the forensic record on confidential systems

2026NVIDIA Secure AI whitepaper (domain synthesis)

CVE-2025-23290 / -23285

documented multi-tenant GPU escape (cross-VM disclosure) and cross-tenant DoS — the isolation-breach playbook's design case

2025NVIDIA security bulletins (domain research)

Mar 1, 2026

IRGC drone strikes on AWS facilities (UAE/Bahrain) — aerial/kinetic attack now an IR design case, not tail-risk

2026Domain research / open reporting

~$4B

projected data-center physical-security spend by 2030 (~2x), reflecting the converged cyber-physical posture

2026Security-domain research synthesis

Resilience, red-teaming, threat intelligence, and metrics

Detection and response are the reactive half of security operations; the proactive half is the program that finds the gaps before the adversary does and measures whether the whole apparatus is improving. Four functions.

Resilience and continuity. The SOC is itself an asset an adversary will target — a campaign frequently begins by blinding or degrading the defenders. The operational-security plan must assume the SOC's primary tooling, its telemetry pipeline, or its incident-command channel could be the thing under attack, with out-of-band communications and a documented fallback to manual operation. This dovetails with the facility-resilience program of Part 12 (Chapter 12.3 on disaster recovery and continuity): a security incident and a reliability incident increasingly share a root cause and a recovery path, and the runbooks should be co-designed rather than maintained in separate binders.

Red-teaming. The only honest test of the Part 11 fortress is an adversary actually attacking it. AI-campus red-teaming must exercise the seams the converged SOC was built to cover: a purple-team exercise that starts with a simulated firmware anomaly and walks it through to a cyber-physical escalation tests the detection plane, the playbook, the escalation trigger, and the incident commander in one run. The output is a set of detection gaps and a measured time-to-detect for each attack path, fed back into the telemetry-coverage map above, rather than a vulnerability list.

Threat intelligence. Generic threat feeds under-serve this environment; the relevant intelligence is AI-infrastructure-specific. The dominant nation-state pattern is pre-positioning: living-off-the-land actors using legitimate built-in tools to evade detection and dwell for the long term in critical infrastructure, with the goal of disruption during a future conflict rather than immediate theft (CISA/Microsoft on Volt Typhoon, 2023–2026). That threat model — a patient, stealthy, OT-aware adversary already inside the walls — is what justifies the long retention, the attestation-drift alerting, and the OT telemetry that a theft-focused program would skip. Threat intelligence about which frontier labs and which silicon are being targeted feeds directly into the target Weights Security Level (Chapter 11.1).

Metrics. What you measure the SOC against determines what it optimizes, and the wrong metric is expensive. Borrowing the facility's availability nines as the SOC's target rewards uptime and ignores the slow, silent exfiltration that never causes an outage. The right metric set is detection-and-response-centric — MTTD and MTTC per attack class, detection coverage as a fraction of the mapped attack surface, percentage of playbooks rehearsed in the last quarter — and, in keeping with this guide's recurring lens, goodput preserved per incident: a fast, surgical containment that severs one tenant's egress and keeps the cluster running preserves far more return than a slow, blunt response that halts the campus. The SOC that measures itself on goodput-preserved rather than incidents-closed makes the right trade between containment and continuity, which is the trade this entire chapter has been about.

Detection is the falsification test for the whole security program

Every control in Part 11 is a claim that an attack will be stopped or slowed. Security operations is the only function that can tell you whether the claim is true in production — and the way it tells you is by detecting the attacks that got through. A SOC that never fires is not a sign the fortress is impregnable; on a campus this targeted, it is far more likely a sign the SOC is blind. The maturity signal is a measured, improving time-to-detect across a deliberately mapped attack surface, validated by a red team that keeps finding (and closing) the seams. Treat your detection rate as the experiment that falsifies your prevention hypotheses, and you will spend the next security dollar where the evidence — not the vendor — says the gap actually is.

This chapter is the operational close of Part 11 and feeds the campus incident-command model. The threat model and Weights Security Levels it detects against are in Chapter 11.1; physical and aerial threats in Chapter 11.2; firmware/RoT integrity (the attestation signal) in Chapter 11.4; GPU confidential computing (the forensic opacity) in Chapter 11.5; multi-tenant isolation failures (the isolation-breach playbook) in Chapter 11.6; segmentation and egress control (the exfiltration choke point) in Chapter 11.7; weight protection in Chapter 11.8; insider threat in Chapter 11.9; and cyber-physical/OT attacks (the escalation trigger's design case) in Chapter 11.10, with compliance and audit logging in Chapter 11.11. The unified incident-command interface and converged escalation belong to Chapter 14.11; the grid-trip transient this SOC races to prevent is the physics of Chapter 4.5; continuity and disaster recovery co-designed with the security runbooks live in Chapter 12.3.