Guide › Software, Orchestration & Service Delivery › 10.3

Chapter 10.3

Multi-Tenancy, Isolation & Resource Sharing

Sharing a GPU is two decisions, not one — a performance-isolation choice (whole-GPU, MIG, MPS, time-slicing, fractional) and a security-isolation choice (process, container, VM, confidential VM) — and conflating them is how operators sell a 'partition' as a 'boundary' it was never built to be.

GOODPUTDENSITY-RAMP

What you'll decide here

Where on the sharing spectrum each tenant class sits — whole-GPU for tightly-coupled training, hardware-partitioned MIG for SLO-bound serving, MPS or time-slicing for dev/notebook fleets — and therefore how much of your fleet's idle silicon you can actually reclaim.
Whether MIG/MPS partitioning is being relied on as a performance boundary (defensible) or as a security boundary between mutually-distrusting tenants (it is not — that is a VM/CVM job).
The quota and fairness model — hard partitions vs hierarchical fair-share with preemption vs deadline/priority QoS — and which one your scheduler (Slurm Fair Tree, Kubernetes + KAI/Run:ai) can actually enforce.
The isolation tier per tenant trust level: shared-kernel containers for one trust domain, VMs across trust domains, confidential VMs + GPU TEE when the operator itself must be untrusted.
How you bound the noisy neighbor — at the hardware (MIG), the scheduler (limits, priorities, gang admission), or the fabric (per-tenant QoS) — before a single greedy job silently taxes everyone else's goodput.

Multi-tenancy is the lever that turns a depreciating pile of accelerators into a utility. A frontier pre-training run wants the whole machine to itself; almost everything else — fine-tuning, batch jobs, interactive serving, notebooks, CI — under-fills a modern GPU. A single B200 carries 180–192 GB of HBM and petaFLOP-class throughput; a 7B-parameter serving replica or a data-science notebook touches a fraction of it. Left whole, that GPU runs at single-digit utilization while its 2–3 year economic life burns down (Chapter 1.8). The entire discipline of this chapter is reclaiming that stranded silicon without letting one tenant's work corrupt, starve, or spy on another's.

The trap here is specific and recurring: operators pick a sharing mechanism for a utilization reason, then quietly assume it also delivers isolation, and the two are different axes that happen to share the word "partition." MIG gives you hardware-enforced performance isolation but is not a confidentiality boundary against a determined co-tenant. A container gives you a namespace but shares a kernel and a GPU driver with everyone else on the node. This chapter separates the two axes cleanly: the sharing spectrum (how the silicon is divided), quota and fairness (who gets how much, and when), and isolation (what one tenant can do to another). Get the mapping wrong and you either leave money on the floor or you put two adversarial tenants behind a wall that was only ever a speed bump.

There is no single "GPU sharing" feature. There are five mechanisms that sit at different points on a curve trading isolation strength against packing flexibility, and they compose — you can run MIG instances, each with MPS inside, scheduled by a fractional-GPU plugin. Read them as a spectrum from hardest partition to softest.

Whole-GPU is the degenerate case and the correct default for tightly-coupled training: one job owns the device, the NVLink domain, and the memory. No sharing overhead, no noisy neighbor, full bandwidth for collectives. The cost is utilization — anything that does not saturate the GPU wastes it.

MIG (Multi-Instance GPU) is the only mechanism that partitions the silicon in hardware: it carves the GPU into up to seven instances, each with a dedicated slice of SM compute, L2 cache, memory controllers, and a fenced region of HBM. On a B200 or GB200 you can cut two instances of ~93 GB, four of ~46 GB, or seven of ~23 GB. Because the partition is physical, one instance cannot consume another's memory bandwidth or compute — the strongest performance isolation short of separate GPUs, with a predictable QoS that makes it the default for SLO-bound inference. The cost is rigidity: you must drain and reconfigure the GPU to change the geometry, slices come in fixed sizes, and a job that needs 30 GB on a 7-way split has to take a 46 GB slice and waste the rest.

MPS (Multi-Process Service) lets multiple processes submit kernels to one GPU context concurrently — true spatial sharing of the SMs, not just interleaving. It is the right tool when several small, cooperative, same-trust-domain workloads (e.g. a fleet of tiny inference replicas) can fill a GPU together. The cost is the absence of a hard wall: MPS provides only soft, optional compute/memory limits, an out-of-memory or a fault in one client can take down the shared context, and there is no fault containment between clients. It buys throughput by assuming the co-tenants trust each other.

Time-slicing is the GPU equivalent of a context switch: the scheduler round-robins the whole device between processes, each getting the full GPU for a slice of time. It needs no special hardware and works on any GPU, which is why it is the lowest-common-denominator sharing mode for dev, notebooks, and bursty low-priority work. The cost is latency jitter and zero memory isolation — every tenant sees the full device, can over-allocate HBM, and pays context-switch overhead; it is unusable for anything with a tight tail-latency SLO.

Fractional GPU is the scheduler-level abstraction (Run:ai, KAI, and similar) that lets you request "0.5 of a GPU" or "10 GB of a GPU" and have the platform place the workload via MPS or memory limits under the hood. It is the most flexible for bin-packing a mixed fleet and the friendliest developer experience, but its isolation is only as strong as the primitive it lands on — usually a soft MPS-class limit, occasionally MIG. Treat the fraction as a billing and packing construct, not a security boundary.

The GPU sharing spectrum — mechanism vs guarantee

Mechanism	Layer	Memory isolation	Compute isolation	Reconfig cost	Best fit
Whole-GPU	Device	Total (sole owner)	Total	None — it is the whole device	Tightly-coupled training; max-throughput serving
MIG	Hardware partition	Hard — fenced HBM per instance	Hard — dedicated SMs + L2	High — drain + reconfigure geometry	SLO-bound multi-tenant inference; predictable QoS
MPS	Process / CUDA context	Soft — optional limits, shared context	Concurrent SMs; soft caps only	Low — per-process	Small same-trust co-tenants filling one GPU
Time-slicing	Scheduler (temporal)	None — full device per slice	None — round-robin whole GPU	None	Dev, notebooks, bursty low-priority batch
Fractional GPU	Cluster scheduler	Depends on backing primitive	Depends on backing primitive	Low — logical request	Bin-packing a mixed dev/serving fleet

Hardware partition = MIG only. MPS/time-slicing/fractional give utilization, not a trust boundary. Instance counts are 2026 Blackwell-class (B200/GB200, up to 7 MIG instances); see keynumbers for sources.

The fork: hardware partition vs scheduler fraction

The cleanest sharing decision is whether you need a hardware partition (MIG) or a scheduler fraction (MPS/time-slicing/fractional). Choose MIG when you owe a tenant a guaranteed SLO and must prevent a neighbor from stealing bandwidth or compute — the partition is physical and the QoS is deterministic, at the price of fixed slice sizes and a drain-to-reconfigure penalty. Choose a scheduler fraction when you are packing many cooperative, same-trust-domain, bursty small jobs and want elastic bin-packing — you reclaim more silicon, but the isolation is soft and one greedy client can tax the others. The downstream cost of getting this backwards: MIG on a bursty dev fleet strands slices that sit idle inside fixed geometry; soft fractions under a latency SLO blow your p99 the first time a neighbor spikes.

Quota and fairness: who gets how much, and when

Dividing a single GPU is the easy half. The hard half is governing a 10,000-GPU fleet so that twenty teams, each convinced their work is most urgent, share it without a tragedy of the commons. Three models dominate, and the right one depends on whether your tenants distrust each other and whether your workloads are elastic.

Hard partitions (static quota). Each tenant gets a fixed slice of the cluster — N GPUs, full stop — enforced by namespace or partition. Dead simple, perfectly predictable, and the right model when tenants are external customers paying for guaranteed capacity (the neocloud and colo posture; Chapter 10.9). The cost is stranded capacity. When Tenant A is idle and Tenant B is queued, the idle GPUs sit dark because the partition forbids lending. Fleet utilization caps out well below what the silicon could deliver.

Hierarchical fair-share with preemption. The HPC-and-internal-platform model: tenants get a target share, not a hard cap, and the scheduler lets anyone borrow idle capacity, then preempts the borrower when the rightful owner returns. Slurm implements this as multifactor priority and Fair Tree — a tenant that has under-consumed its share recently floats to the top of the queue; an over-consumer sinks. Kubernetes-native schedulers (KAI, Run:ai, Volcano) implement the same idea as hierarchical queues with reclaim. This is the model that actually drives high utilization, because idle silicon is always lendable. The cost is complexity and the preemption tax: a borrowed, preempted job must checkpoint and resume, so it is only sane for interruption-tolerant work (Chapter 9.4).

Priority / deadline QoS. Layered on top of either: jobs carry a priority or a deadline, and the scheduler admits, preempts, and orders accordingly — production-serving jobs outrank experiments, a deadline'd batch sweep gets escalated as its window closes. This is where you encode that an SLO-bound inference tenant must never be preempted by a speculative training run. The cost is that priority is only meaningful if it is scarce and policed; the moment every team sets priority=high, you are back to FIFO.

One non-negotiable couples all three to the prior chapter: GPU jobs are usually gang-scheduled — a 64-GPU job needs all 64 at once or none, or it deadlocks holding resources it cannot use. Quota and fairness must therefore admit and preempt at the granularity of the whole gang, topology-aware, or the fairness model fights the placement model (Chapter 10.2).

Quota / fairness models — the utilization-vs-predictability fork

Model	Utilization	Predictability	Preemption	Enforced by	Best fit
Hard partition (static quota)	Low — idle capacity stranded	Highest — guaranteed N GPUs	None	Namespace / Slurm partition	External paying tenants; guaranteed capacity
Hierarchical fair-share	High — idle GPUs lendable	Soft — target share, not a cap	Required (reclaim)	Slurm Fair Tree; KAI / Run:ai queues	Internal multi-team platforms; elastic work
Priority / deadline QoS	Tunable	Conditional on priority discipline	Priority-driven	Job priority + scheduler policy	Mixed prod-serving + research on one fleet

Choose by tenant trust and workload elasticity. Fair-share drives utilization; hard partitions strand it. Enforcement examples are 2026-current.

up to 7

MIG instances per GPU (B200/GB200): 2×~93GB, 4×~46GB, or 7×~23GB profiles

2025NVIDIA MIG User Guide (r580); MIG supported-profiles docs

180–192 GB

HBM per Blackwell GPU available to partition across tenants (B200/GB200 class)

2026NVIDIA Blackwell datasheets; provenance.js HBM trajectory

CVSS 9.0

NVIDIAScape (CVE-2025-23266) Container Toolkit escape — container-to-host on shared GPU nodes

2025Wiz Research; NVIDIA security bulletin

CVSS 2.5

CVE-2025-23290 — first acknowledged cross-VM GPU-metric leak via vGPU Manager (co-tenant side channel)

2025NVIDIA security bulletin; Tenable

~70% / ~20%

Slurm vs Kubernetes share of AI clusters — the two quota/fairness enforcement planes operators must master

2026HPCwire, 'Slurm vs Kubernetes in the Age of AI'

~2/3

inference share of AI compute in 2026 — the workload class that most rewards fractional/MIG sharing

2026Deloitte TMT Predictions 2026; McKinsey

2–3 yr

accelerated GPU economic life — the depreciation clock that makes reclaiming idle silicon urgent

2025Goldman Sachs; secondary-market analyses

The costliest error, and the one that shows up in real CVEs, is treating a sharing mechanism as an isolation boundary. Sharing answers "how is the silicon divided?" Isolation answers "what can one tenant do to another?" — corrupt their data, starve their compute, read their memory, or escape onto the host. The two axes are orthogonal, and the strength you need on the isolation axis is set by tenant trust, not by how you happen to be packing the GPU.

Stack the isolation tiers from weakest to strongest. Process isolation (bare processes, or MPS clients) shares a kernel, a driver, and frequently a CUDA context — fine for one team's own jobs, useless across trust domains. Container isolation adds namespaces and cgroups but still shares the host kernel and the GPU driver/runtime; the attack surface is the container runtime and the NVIDIA stack itself — exactly the surface the 2025 NVIDIAScape vulnerability (CVE-2025-23266, CVSS 9.0) punctured, letting a crafted container escape to the host on shared GPU nodes. VM isolation gives each tenant its own kernel behind a hypervisor; the GPU is passed through (full device) or virtualized (vGPU/MIG-backed). Stronger — but the vGPU manager is itself a shared component, and CVE-2025-23290 showed a guest reading global GPU metrics influenced by co-tenants, the first acknowledged cross-VM leakage through that layer. Confidential VM + GPU TEE is the top tier: the workload runs in a CPU trusted execution environment, the GPU runs in confidential-compute mode with encrypted HBM and an attested, encrypted PCIe/NVLink channel, so even the operator — hypervisor, host OS, cloud admin — cannot read tenant data or weights. This is the canonical home of Chapter 11.5 (GPU confidential computing & attestation); the multi-tenant security architecture that wraps it is Chapter 11.6, and the network half — per-tenant microsegmentation, east-west zero-trust on the storage and management planes — is Chapter 11.7.

The decision rule is blunt: match the isolation tier to the trust boundary, then choose any sharing mechanism that fits inside it. Two jobs from the same team can share an MPS context behind a single container — the trust domain is one, so weak isolation is fine and you maximize packing. Two mutually-distrusting external customers must never share a kernel: that is a VM boundary at minimum, and a confidential VM when your own platform must be untrusted (the regulated, sovereign-AI, and model-weight-protection cases). MIG and vGPU sit awkwardly in between — strong performance isolation, but documented side channels (uncore counters, shared metrics, microarchitectural leakage) mean you do not sell them as a confidentiality boundary between adversaries. They are a QoS boundary that happens to live in hardware.

MIG is a performance boundary, not a confidentiality boundary

It is tempting to market MIG or vGPU partitioning as tenant isolation — it carves the GPU in hardware, so surely it walls tenants off. It walls off bandwidth and compute. It does not, by itself, defeat a determined co-tenant trying to infer or exfiltrate: published work demonstrates covert and side channels (shared performance counters, uncore contention, microarchitectural timing) that cross MPS and MIG partitions, and 2025's vGPU-manager CVEs (cross-VM metric leakage; multiple driver vulnerabilities) show the shared virtualization layer is a real attack surface. The consequence of overselling it: an auditor or a breach discovers that two adversarial tenants were separated by a QoS feature, not a trust boundary. If the threat model includes mutually-distrusting tenants or the operator-as-adversary, the boundary is a VM or a confidential VM with GPU TEE and attestation (Chapter 11.5) — MIG can live inside that, but never instead of it.

Deep dive: the three real 2025–26 multi-tenant isolation failures and what each one teaches

The "is partitioning a security boundary?" debate stopped being theoretical in 2025. Three classes of failure, each pointing at a different layer of the stack:

1. Container-escape (NVIDIAScape, CVE-2025-23266, CVSS 9.0). A flaw in the NVIDIA Container Toolkit let a malicious container break out to the host on shared GPU nodes — the worst outcome in multi-tenancy, because host compromise means access to every other tenant on the box. The lesson: the GPU runtime and toolkit are part of your trust boundary, not just the kernel. Container isolation across trust domains is only as strong as the GPU plumbing underneath it, and that plumbing gets CVEs like any other privileged software. Patch cadence on the GPU Operator and Container Toolkit is a multi-tenancy security control, not a hygiene nicety (Chapter 10.4).

2. Cross-VM metric leakage (CVE-2025-23290). A guest VM could read global GPU metrics influenced by other VMs — low severity (CVSS 2.5) but conceptually important: it was the first acknowledged leak of co-tenant activity through the vGPU manager. The lesson: even a VM boundary leaks signal through shared observability surfaces. A side channel does not need to read your data to hurt you; reading your load can be enough in some threat models.

3. Microarchitectural side channels across MPS/MIG. Academic work ('Spy in the GPU-box' and uncore side-channel studies) demonstrated covert and timing channels that survive MPS and MIG partitioning. The lesson, and the through-line of all three: performance partitioning is not confidentiality. If your tenants are adversarial, you climb to VM or confidential-VM isolation; if they are one trust domain, all of these are acceptable risks and you optimize for packing. Know which world you are in, and never let a sales sheet blur the two.

Noisy neighbors and QoS guarantees

Even inside a single trust domain, where security is not the concern, multi-tenancy has a performance pathology: the noisy neighbor. One tenant's job saturates a shared resource and silently taxes everyone else's goodput — and on a shared GPU node the contended resource is rarely the one people watch. It is memory bandwidth (HBM is the bottleneck for most inference and the noisiest shared resource under MPS/time-slicing), PCIe/NVLink (a greedy host-to-device copy starves a neighbor's transfers), the shared fabric (a tenant's all-to-all floods the back-end and inflates everyone's collective time — congestion engineering in Chapter 8.6), and shared storage and the loader (one tenant's checkpoint write or dataset scan blows the cache for the rest).

You bound the noisy neighbor at one of three layers, hardest to softest. At the hardware: MIG, full stop — physical partitions make the noisy neighbor impossible by construction, which is exactly why it is the default for tenants you owe a hard SLO. At the scheduler: resource limits and requests, priority and preemption so a low-priority neighbor yields, and gang-aware admission so a job either gets its full topology-clean allocation or waits rather than half-landing and contending (Chapter 10.2). At the fabric and storage: per-tenant QoS classes, rate limits, and bandwidth reservations so no one tenant monopolizes the network or the loader. The consequence of skipping all three is the failure mode that is hardest to debug because nothing errors: every tenant's p99 latency quietly degrades, the SLO dashboard goes amber, and the cause is invisible unless your telemetry attributes contention per tenant — which is why per-tenant goodput and contention metrics belong in the observability plane from day one (Chapter 10.6).

The honest QoS guarantee follows from the mechanism. A hard guarantee ("this tenant gets exactly this SLO regardless of neighbors") requires a hard partition — MIG, or whole-GPU. A soft guarantee ("this tenant gets a target share, best-effort, degrading gracefully under contention") is what MPS, time-slicing, and fractional GPUs can deliver. Selling a hard SLO on a soft mechanism is the noisy-neighbor trap dressed up as a contract: it works in the demo, when the GPU is half-empty, and breaches the first time the cluster fills.

The reclaim ladder: harvest idle silicon without breaking the SLO

The economically optimal multi-tenant fleet is rarely one mechanism — it is a ladder. Run SLO-bound serving on MIG instances with hard QoS (the noisy neighbor cannot touch them). Backfill the gaps — idle MIG slices, drained training nodes, off-peak capacity — with preemptible, interruption-tolerant work (batch inference, fine-tuning, eval sweeps) via hierarchical fair-share, so the reclaimed silicon turns into goodput instead of waste heat. The preemptible tier yields instantly when a guaranteed tenant needs the capacity back, checkpointing on the way out (Chapter 9.4). This is how a fleet hits high utilization and honors hard SLOs at once: the hard guarantees ride hardware partitions, the reclaim rides preemptible fair-share, and the depreciation clock (2–3 yr economic life) stops running against dark GPUs.

Putting it together: a per-tenant-class decision

The mistake is to pick one sharing mechanism, one quota model, and one isolation tier for the whole fleet. The defensible design classifies tenants and maps each class down all three axes at once. A frontier training tenant: whole-GPU, hard partition or a top-priority gang reservation, container isolation (one trust domain), noisy neighbor bounded by sole ownership. An external SLO-bound serving tenant: MIG for hard QoS, static quota for guaranteed capacity, VM-or-CVM isolation if mutually distrusting, noisy neighbor impossible by hardware partition. An internal research/dev fleet: fractional GPU and time-slicing for packing, hierarchical fair-share with preemption for utilization, container isolation, noisy neighbor bounded by scheduler limits. A preemptible batch tier: whatever packs tightest, lowest priority with instant reclaim, container isolation, explicitly soft QoS. The same physical fleet runs all four — the art is keeping each class's mechanism, quota, and isolation tier internally consistent so the guarantees you sell are the guarantees the silicon actually enforces.

This chapter is the sharing-and-isolation framework; the pieces are engineered elsewhere. The scheduling plane that enforces quota and gang admission is Chapter 10.1; the topology-aware placement that fairness must respect is Chapter 10.2; the node software stack (driver, GPU Operator, Container Toolkit) whose CVEs define the container trust boundary is Chapter 10.4; per-tenant contention and goodput telemetry live in Chapter 10.6. The security axis is canonical downstream: GPU confidential computing, encrypted HBM, and attestation in Chapter 11.5; the multi-tenant workload-isolation security architecture in Chapter 11.6; network microsegmentation and zero-trust in Chapter 11.7. Fabric congestion that creates cross-tenant noise is Chapter 8.6; the checkpoint math that makes preemption affordable is Chapter 9.4; and the commercial terms that turn these guarantees into a product are Chapter 10.9.