Guide › Part 10

Part 10

Software, Orchestration & Service Delivery

11 chapters

Orchestration Architecture & the Scheduling Plane

The scheduler is the layer that decides whether your GPUs run jobs or sit idle, and the choice between an HPC batch scheduler and a cloud-native one is a bet on what your fleet actually does, how much it costs to run two control planes, and whether you can place work on the fabric without leaving bandwidth on the floor.

Topology-Aware & Rack-Scale Scheduling

On a rack-scale machine the placement of a job is a hard performance contract: land a tightly-coupled job inside one NVLink domain and it runs at full bandwidth; split it across the domain boundary and you fall off an order-of-magnitude bandwidth cliff that no amount of tuning recovers. The scheduler's real job is to defend topology, not to pack nodes.

Multi-Tenancy, Isolation & Resource Sharing

Sharing a GPU is two decisions, not one — a performance-isolation choice (whole-GPU, MIG, MPS, time-slicing, fractional) and a security-isolation choice (process, container, VM, confidential VM) — and conflating them is how operators sell a 'partition' as a 'boundary' it was never built to be.

Node Software Stack: Drivers, CUDA/ROCm, NCCL & Firmware

The node software stack is a single versioned organism — driver, CUDA/ROCm runtime, NCCL/RCCL, and firmware must move together as one pinned, attested artifact across every node, because a one-line version skew in a synchronous cluster does not slow the job, it hangs it.

Provisioning, Bring-Up & Infrastructure as Code

Provisioning is where a rack of stranded silicon becomes a revenue-earning node — and the choice between treating physical hardware as artisanal pets or as programmable, declaratively-described cattle decides how many GPU-hours you burn between 'powered on' and 'production', every refresh, forever.

Observability, Telemetry & GPU Health

Observability for a GPU fleet is the closed loop that converts a noisy hardware-failure stream into goodput, and the decisions you make about what to detect, how fast, and what to store determine whether your cluster spends its life training or restarting.

Fleet Reliability, Fault Tolerance & Autonomous Recovery

At fleet scale a synchronous training job fails roughly every few hours and a 100k-GPU run more than once an hour, so reliability stops being a facility-availability number and becomes a software-and-control-plane problem: the operator that detects a fault, ejects the node, and restarts from a recent checkpoint in minutes keeps its goodput, and the one that waits for a human loses it.

MLOps & Training Frameworks

The training framework and the control plane around it convert raw FLOPS into trained weights — pick the parallelism strategy and the orchestrator wrong and you do not get a slower run, you get a fleet that bills full-price for half its arithmetic.

Customer Onboarding, Delivery & Productization

A GPU cluster is not a product until someone can buy it, get a job running on it, be metered fairly for it, and leave it — and where you sit on the value-stack ladder (bare-metal up to serverless) deterministically sets your isolation model, your SLA exposure, your billing engine, and the gross margin you keep on every GPU-hour you sell.

Data Governance, Privacy & the Training-Data Legal Regime

The legal and privacy posture of the data flowing through an AI data center is not a paperwork problem bolted on at the end — it is an architecture decision that, made wrong, can require you to retrain a model, geo-fence a hall, or hand twenty million customer conversations to opposing counsel.

Inference Serving Engineering: SLOs, Batching, Disaggregation & Goodput-Optimal Scheduling

Inference serving is the constrained optimization of serving the most tokens that meet your SLO — not a throughput problem and not a latency problem on its own — and every lever (batching, chunking, disaggregation, speculation, routing) is a different bet on where that goodput-optimal point sits for your model, your traffic, and your fleet.