Chapter 10.3
Multi-Tenancy, Isolation & Resource Sharing
Sharing a GPU is two decisions, not one — a performance-isolation choice (whole-GPU, MIG, MPS, time-slicing, fractional) and a security-isolation choice (process, container, VM, confidential VM) — and conflating them is how operators sell a 'partition' as a 'boundary' it was never built to be.
What you'll decide here
- Where on the sharing spectrum each tenant class sits — whole-GPU for tightly-coupled training, hardware-partitioned MIG for SLO-bound serving, MPS or time-slicing for dev/notebook fleets — and therefore how much of your fleet's idle silicon you can actually reclaim.
- Whether MIG/MPS partitioning is being relied on as a performance boundary (defensible) or as a security boundary between mutually-distrusting tenants (it is not — that is a VM/CVM job).
- The quota and fairness model — hard partitions vs hierarchical fair-share with preemption vs deadline/priority QoS — and which one your scheduler (Slurm Fair Tree, Kubernetes + KAI/Run:ai) can actually enforce.
- The isolation tier per tenant trust level: shared-kernel containers for one trust domain, VMs across trust domains, confidential VMs + GPU TEE when the operator itself must be untrusted.
- How you bound the noisy neighbor — at the hardware (MIG), the scheduler (limits, priorities, gang admission), or the fabric (per-tenant QoS) — before a single greedy job silently taxes everyone else's goodput.
Multi-tenancy is the lever that turns a depreciating pile of accelerators into a utility. A frontier pre-training run wants the whole machine to itself; almost everything else — fine-tuning, batch jobs, interactive serving, notebooks, CI — under-fills a modern GPU. A single B200 carries 180–192 GB of HBM and petaFLOP-class throughput; a 7B-parameter serving replica or a data-science notebook touches a fraction of it. Left whole, that GPU runs at single-digit utilization while its 2–3 year economic life burns down (Chapter 1.8). The entire discipline of this chapter is reclaiming that stranded silicon without letting one tenant's work corrupt, starve, or spy on another's.
The trap here is specific and recurring: operators pick a sharing mechanism for a utilization reason, then quietly assume it also delivers isolation, and the two are different axes that happen to share the word "partition." MIG gives you hardware-enforced performance isolation but is not a confidentiality boundary against a determined co-tenant. A container gives you a namespace but shares a kernel and a GPU driver with everyone else on the node. This chapter separates the two axes cleanly: the sharing spectrum (how the silicon is divided), quota and fairness (who gets how much, and when), and isolation (what one tenant can do to another). Get the mapping wrong and you either leave money on the floor or you put two adversarial tenants behind a wall that was only ever a speed bump.
The sharing spectrum: five ways to divide a GPU
There is no single "GPU sharing" feature. There are five mechanisms that sit at different points on a curve trading isolation strength against packing flexibility, and they compose — you can run MIG instances, each with MPS inside, scheduled by a fractional-GPU plugin. Read them as a spectrum from hardest partition to softest.
Whole-GPU is the degenerate case and the correct default for tightly-coupled training: one job owns the device, the NVLink domain, and the memory. No sharing overhead, no noisy neighbor, full bandwidth for collectives. The cost is utilization — anything that does not saturate the GPU wastes it.
MIG (Multi-Instance GPU) is the only mechanism that partitions the silicon in hardware: it carves the GPU into up to seven instances, each with a dedicated slice of SM compute, L2 cache, memory controllers, and a fenced region of HBM. On a B200 or GB200 you can cut two instances of ~93 GB, four of ~46 GB, or seven of ~23 GB. Because the partition is physical, one instance cannot consume another's memory bandwidth or compute — the strongest performance isolation short of separate GPUs, with a predictable QoS that makes it the default for SLO-bound inference. The cost is rigidity: you must drain and reconfigure the GPU to change the geometry, slices come in fixed sizes, and a job that needs 30 GB on a 7-way split has to take a 46 GB slice and waste the rest.
MPS (Multi-Process Service) lets multiple processes submit kernels to one GPU context concurrently — true spatial sharing of the SMs, not just interleaving. It is the right tool when several small, cooperative, same-trust-domain workloads (e.g. a fleet of tiny inference replicas) can fill a GPU together. The cost is the absence of a hard wall: MPS provides only soft, optional compute/memory limits, an out-of-memory or a fault in one client can take down the shared context, and there is no fault containment between clients. It buys throughput by assuming the co-tenants trust each other.
Time-slicing is the GPU equivalent of a context switch: the scheduler round-robins the whole device between processes, each getting the full GPU for a slice of time. It needs no special hardware and works on any GPU, which is why it is the lowest-common-denominator sharing mode for dev, notebooks, and bursty low-priority work. The cost is latency jitter and zero memory isolation — every tenant sees the full device, can over-allocate HBM, and pays context-switch overhead; it is unusable for anything with a tight tail-latency SLO.
Fractional GPU is the scheduler-level abstraction (Run:ai, KAI, and similar) that lets you request "0.5 of a GPU" or "10 GB of a GPU" and have the platform place the workload via MPS or memory limits under the hood. It is the most flexible for bin-packing a mixed fleet and the friendliest developer experience, but its isolation is only as strong as the primitive it lands on — usually a soft MPS-class limit, occasionally MIG. Treat the fraction as a billing and packing construct, not a security boundary.
| Mechanism | Layer | Memory isolation | Compute isolation | Reconfig cost | Best fit |
|---|---|---|---|---|---|
| Whole-GPU | Device | Total (sole owner) | Total | None — it is the whole device | Tightly-coupled training; max-throughput serving |
| MIG | Hardware partition | Hard — fenced HBM per instance | Hard — dedicated SMs + L2 | High — drain + reconfigure geometry | SLO-bound multi-tenant inference; predictable QoS |
| MPS | Process / CUDA context | Soft — optional limits, shared context | Concurrent SMs; soft caps only | Low — per-process | Small same-trust co-tenants filling one GPU |
| Time-slicing | Scheduler (temporal) | None — full device per slice | None — round-robin whole GPU | None | Dev, notebooks, bursty low-priority batch |
| Fractional GPU | Cluster scheduler | Depends on backing primitive | Depends on backing primitive | Low — logical request | Bin-packing a mixed dev/serving fleet |
Quota and fairness: who gets how much, and when
Dividing a single GPU is the easy half. The hard half is governing a 10,000-GPU fleet so that twenty teams, each convinced their work is most urgent, share it without a tragedy of the commons. Three models dominate, and the right one depends on whether your tenants distrust each other and whether your workloads are elastic.
Hard partitions (static quota). Each tenant gets a fixed slice of the cluster — N GPUs, full stop — enforced by namespace or partition. Dead simple, perfectly predictable, and the right model when tenants are external customers paying for guaranteed capacity (the neocloud and colo posture; Chapter 10.9). The cost is stranded capacity. When Tenant A is idle and Tenant B is queued, the idle GPUs sit dark because the partition forbids lending. Fleet utilization caps out well below what the silicon could deliver.
Hierarchical fair-share with preemption. The HPC-and-internal-platform model: tenants get a target share, not a hard cap, and the scheduler lets anyone borrow idle capacity, then preempts the borrower when the rightful owner returns. Slurm implements this as multifactor priority and Fair Tree — a tenant that has under-consumed its share recently floats to the top of the queue; an over-consumer sinks. Kubernetes-native schedulers (KAI, Run:ai, Volcano) implement the same idea as hierarchical queues with reclaim. This is the model that actually drives high utilization, because idle silicon is always lendable. The cost is complexity and the preemption tax: a borrowed, preempted job must checkpoint and resume, so it is only sane for interruption-tolerant work (Chapter 9.4).
Priority / deadline QoS. Layered on top of either: jobs carry a priority or a deadline, and the scheduler admits, preempts, and orders accordingly — production-serving jobs outrank experiments, a deadline'd batch sweep gets escalated as its window closes. This is where you encode that an SLO-bound inference tenant must never be preempted by a speculative training run. The cost is that priority is only meaningful if it is scarce and policed; the moment every team sets priority=high, you are back to FIFO.
One non-negotiable couples all three to the prior chapter: GPU jobs are usually gang-scheduled — a 64-GPU job needs all 64 at once or none, or it deadlocks holding resources it cannot use. Quota and fairness must therefore admit and preempt at the granularity of the whole gang, topology-aware, or the fairness model fights the placement model (Chapter 10.2).
| Model | Utilization | Predictability | Preemption | Enforced by | Best fit |
|---|---|---|---|---|---|
| Hard partition (static quota) | Low — idle capacity stranded | Highest — guaranteed N GPUs | None | Namespace / Slurm partition | External paying tenants; guaranteed capacity |
| Hierarchical fair-share | High — idle GPUs lendable | Soft — target share, not a cap | Required (reclaim) | Slurm Fair Tree; KAI / Run:ai queues | Internal multi-team platforms; elastic work |
| Priority / deadline QoS | Tunable | Conditional on priority discipline | Priority-driven | Job priority + scheduler policy | Mixed prod-serving + research on one fleet |
Isolation models: the axis everyone conflates with sharing
The costliest error, and the one that shows up in real CVEs, is treating a sharing mechanism as an isolation boundary. Sharing answers "how is the silicon divided?" Isolation answers "what can one tenant do to another?" — corrupt their data, starve their compute, read their memory, or escape onto the host. The two axes are orthogonal, and the strength you need on the isolation axis is set by tenant trust, not by how you happen to be packing the GPU.
Stack the isolation tiers from weakest to strongest. Process isolation (bare processes, or MPS clients) shares a kernel, a driver, and frequently a CUDA context — fine for one team's own jobs, useless across trust domains. Container isolation adds namespaces and cgroups but still shares the host kernel and the GPU driver/runtime; the attack surface is the container runtime and the NVIDIA stack itself — exactly the surface the 2025 NVIDIAScape vulnerability (CVE-2025-23266, CVSS 9.0) punctured, letting a crafted container escape to the host on shared GPU nodes. VM isolation gives each tenant its own kernel behind a hypervisor; the GPU is passed through (full device) or virtualized (vGPU/MIG-backed). Stronger — but the vGPU manager is itself a shared component, and CVE-2025-23290 showed a guest reading global GPU metrics influenced by co-tenants, the first acknowledged cross-VM leakage through that layer. Confidential VM + GPU TEE is the top tier: the workload runs in a CPU trusted execution environment, the GPU runs in confidential-compute mode with encrypted HBM and an attested, encrypted PCIe/NVLink channel, so even the operator — hypervisor, host OS, cloud admin — cannot read tenant data or weights. This is the canonical home of Chapter 11.5 (GPU confidential computing & attestation); the multi-tenant security architecture that wraps it is Chapter 11.6, and the network half — per-tenant microsegmentation, east-west zero-trust on the storage and management planes — is Chapter 11.7.
The decision rule is blunt: match the isolation tier to the trust boundary, then choose any sharing mechanism that fits inside it. Two jobs from the same team can share an MPS context behind a single container — the trust domain is one, so weak isolation is fine and you maximize packing. Two mutually-distrusting external customers must never share a kernel: that is a VM boundary at minimum, and a confidential VM when your own platform must be untrusted (the regulated, sovereign-AI, and model-weight-protection cases). MIG and vGPU sit awkwardly in between — strong performance isolation, but documented side channels (uncore counters, shared metrics, microarchitectural leakage) mean you do not sell them as a confidentiality boundary between adversaries. They are a QoS boundary that happens to live in hardware.
Deep dive: the three real 2025–26 multi-tenant isolation failures and what each one teaches
The "is partitioning a security boundary?" debate stopped being theoretical in 2025. Three classes of failure, each pointing at a different layer of the stack:
1. Container-escape (NVIDIAScape, CVE-2025-23266, CVSS 9.0). A flaw in the NVIDIA Container Toolkit let a malicious container break out to the host on shared GPU nodes — the worst outcome in multi-tenancy, because host compromise means access to every other tenant on the box. The lesson: the GPU runtime and toolkit are part of your trust boundary, not just the kernel. Container isolation across trust domains is only as strong as the GPU plumbing underneath it, and that plumbing gets CVEs like any other privileged software. Patch cadence on the GPU Operator and Container Toolkit is a multi-tenancy security control, not a hygiene nicety (Chapter 10.4).
2. Cross-VM metric leakage (CVE-2025-23290). A guest VM could read global GPU metrics influenced by other VMs — low severity (CVSS 2.5) but conceptually important: it was the first acknowledged leak of co-tenant activity through the vGPU manager. The lesson: even a VM boundary leaks signal through shared observability surfaces. A side channel does not need to read your data to hurt you; reading your load can be enough in some threat models.
3. Microarchitectural side channels across MPS/MIG. Academic work ('Spy in the GPU-box' and uncore side-channel studies) demonstrated covert and timing channels that survive MPS and MIG partitioning. The lesson, and the through-line of all three: performance partitioning is not confidentiality. If your tenants are adversarial, you climb to VM or confidential-VM isolation; if they are one trust domain, all of these are acceptable risks and you optimize for packing. Know which world you are in, and never let a sales sheet blur the two.
Noisy neighbors and QoS guarantees
Even inside a single trust domain, where security is not the concern, multi-tenancy has a performance pathology: the noisy neighbor. One tenant's job saturates a shared resource and silently taxes everyone else's goodput — and on a shared GPU node the contended resource is rarely the one people watch. It is memory bandwidth (HBM is the bottleneck for most inference and the noisiest shared resource under MPS/time-slicing), PCIe/NVLink (a greedy host-to-device copy starves a neighbor's transfers), the shared fabric (a tenant's all-to-all floods the back-end and inflates everyone's collective time — congestion engineering in Chapter 8.6), and shared storage and the loader (one tenant's checkpoint write or dataset scan blows the cache for the rest).
You bound the noisy neighbor at one of three layers, hardest to softest. At the hardware: MIG, full stop — physical partitions make the noisy neighbor impossible by construction, which is exactly why it is the default for tenants you owe a hard SLO. At the scheduler: resource limits and requests, priority and preemption so a low-priority neighbor yields, and gang-aware admission so a job either gets its full topology-clean allocation or waits rather than half-landing and contending (Chapter 10.2). At the fabric and storage: per-tenant QoS classes, rate limits, and bandwidth reservations so no one tenant monopolizes the network or the loader. The consequence of skipping all three is the failure mode that is hardest to debug because nothing errors: every tenant's p99 latency quietly degrades, the SLO dashboard goes amber, and the cause is invisible unless your telemetry attributes contention per tenant — which is why per-tenant goodput and contention metrics belong in the observability plane from day one (Chapter 10.6).
The honest QoS guarantee follows from the mechanism. A hard guarantee ("this tenant gets exactly this SLO regardless of neighbors") requires a hard partition — MIG, or whole-GPU. A soft guarantee ("this tenant gets a target share, best-effort, degrading gracefully under contention") is what MPS, time-slicing, and fractional GPUs can deliver. Selling a hard SLO on a soft mechanism is the noisy-neighbor trap dressed up as a contract: it works in the demo, when the GPU is half-empty, and breaches the first time the cluster fills.
Putting it together: a per-tenant-class decision
The mistake is to pick one sharing mechanism, one quota model, and one isolation tier for the whole fleet. The defensible design classifies tenants and maps each class down all three axes at once. A frontier training tenant: whole-GPU, hard partition or a top-priority gang reservation, container isolation (one trust domain), noisy neighbor bounded by sole ownership. An external SLO-bound serving tenant: MIG for hard QoS, static quota for guaranteed capacity, VM-or-CVM isolation if mutually distrusting, noisy neighbor impossible by hardware partition. An internal research/dev fleet: fractional GPU and time-slicing for packing, hierarchical fair-share with preemption for utilization, container isolation, noisy neighbor bounded by scheduler limits. A preemptible batch tier: whatever packs tightest, lowest priority with instant reclaim, container isolation, explicitly soft QoS. The same physical fleet runs all four — the art is keeping each class's mechanism, quota, and isolation tier internally consistent so the guarantees you sell are the guarantees the silicon actually enforces.