Chapter 12.4
SLAs, Goodput Contracts & Availability Commitments
The SLA is where the reliability rethink becomes a number on a contract — and the operator who promises facility availability for a workload that is actually paying for goodput has signed a contract that is simultaneously unenforceable by the customer and unprofitable for the provider.
What you'll decide here
- Whether you are committing to availability (the facility is energized and reachable) or to goodput (the customer's job makes effective forward progress) — the two diverge sharply for AI workloads and govern entirely different penalty mechanics.
- The measurement basis and attribution rules for any goodput or job-success commitment: what counts as badput, who owns each badput class (provider vs tenant vs force majeure), and from what baseline the shortfall is computed.
- The shape of the service-credit ladder — linear vs stepped, capped vs uncapped, credits-only vs termination rights — and the maximum monthly exposure you are willing to underwrite against your own failure environment.
- Which acceptance/commissioning gate establishes the contractual goodput baseline, so the SLA is measured against a number both parties signed at go-live rather than against marketing.
- How the customer-facing commitment is reconciled against the redundancy design-basis and the modeled availability — never promise a tier of continuity the physical plant and the failure environment cannot deliver at a profit.
An SLA is a reliability model with money attached. Everything in Part 12 up to this point — the goodput-vs-availability rethink (Chapter 12.2), the redundancy primer (Chapter 0.5), the facility tier standards (Chapter 12.1) — exists in the engineering domain, where being wrong costs an outage. The SLA drags all of it into the commercial domain, where being wrong costs a service credit, a churned tenant, or a take-or-pay dispute. This chapter is about the translation: how a failure environment becomes a promise, how that promise is measured and attributed, and how the penalty structure is shaped so that it disciplines the provider without bankrupting them on a bad month.
The recurring fork is the one Part 12 has been building toward. Availability asks: was the facility energized, cooled, and reachable? Goodput asks: did the customer's job make effective forward progress? For a web service those two questions have nearly the same answer. For an AI workload they diverge violently — a cluster can be 100% available at the facility meter and delivering 70% goodput because a single GPU's silent data corruption is poisoning every synchronous step, or be down for ninety seconds of cooling-pump ride-through and lose nothing because the job checkpointed cleanly. The SLA that measures the wrong one of these is worse than no SLA: it gives the customer a number that does not track their pain and gives the provider an exposure that does not track their control.
What is actually being promised: availability vs goodput
The legacy data-center SLA promises availability — a Monthly Uptime Percentage measured at the facility or the instance, with a service-credit ladder if it falls short. This is the world of the Uptime tiers: Tier III at 99.982% (~1.6 hr/yr of downtime), Tier IV at 99.995% (~26 min/yr), each "nine" a downtime budget the operator commits not to exceed. It is a clean, well-understood, auditable promise, and for online inference it is still the right primary metric: an inference endpoint that is unreachable is lost revenue and a breached latency SLO, and the customer experiences facility downtime directly. → serving-engineering SLOs in Chapter 10.11.
The training SLA is a different animal, because the customer is not buying uptime — they are buying effective compute. Goodput is the fraction of wall-clock GPU time that produces useful, retained forward progress, after subtracting every form of badput: failed steps that get rolled back, time lost to checkpoint/restore, stragglers throttling a synchronous collective, idle GPUs waiting on a re-scheduled node, and work discarded because it ran on a silently-corrupting device. Google's formalization is the cleanest reference: goodput is productive time divided by total time, with badput enumerated by category so each loss has an owner. The industry runs around 90% goodput on average; the best-in-class operators market ~96%; the reliability overhead that the gap represents is 6–21% of TCO. A 6-point goodput improvement on a 20,000-GPU cluster is worth more than several nines of facility availability the training job would never have noticed.
| Dimension | Availability commitment | Goodput contract |
|---|---|---|
| What is promised | Facility/node/instance energized & reachable | Effective forward progress of the customer's job |
| Right for | Online inference, hosting, control plane | Pre-training, large-scale RL, long batch |
| Unit of measure | Monthly Uptime % (downtime minutes) | Goodput % = productive time / total time |
| Typical 2026 target | 99.9% node / 99% rack (ClusterMAX baseline); 99.99% region (hyperscaler) | 90% industry avg; ~96% best-in-class; negotiated floor often 90–95% |
| Measurement point | Meter, hypervisor, or endpoint probe — auditable | Inside the job: telemetry, checkpoint logs, step counters |
| Attribution difficulty | Low — outage is observable at the boundary | High — badput must be classed by owner (provider/tenant/force majeure) |
| Failure that breaches it | Power/cooling loss, network partition, node-down | SDC, straggler throttling, slow recovery, fabric BER, checkpoint stalls |
| Provider's lever to defend it | Redundancy (2N power, N+1 cooling) | Health-checking, hot spares, fast checkpoint/restore, node drain |
The two columns reward different capital. An availability commitment is defended with redundancy — 2N power, N+1 cooling, dual fabric — and its cost lands in the MEP construction budget (2N can swing MEP cost 30–50% over N+1 and leaves ~50% of capacity idle). A goodput contract is defended with operational reliability — passive health-checks every few seconds, automatic node drain on a ~15% degradation-vs-golden-reference trigger, 3–5% hot spares standing by for a ~90-second node swap, and multi-tier checkpointing that cuts mean-time-to-recover from 15–30 minutes down to under 2. The mistake operators make is buying the first kind of insurance for a workload that needs the second. A training tenant does not value the 26-minute-per-year facility downtime budget of Tier IV; they value never losing a 40-minute checkpoint interval to a slow restart. → the goodput-vs-availability tradeoff curve in Chapter 12.2, quantified by the model in Chapter 12.5.
Penalty and credit structures: the service-credit ladder
A commitment without a penalty is marketing. The penalty in a standard SLA is the service credit: a percentage of the monthly bill refunded (as credit, almost never cash) when the measured metric falls below a threshold. The ladder is a step function — miss the target band and you owe a credit; miss it badly and you owe more. The reference shape, from the hyperscaler compute SLAs that anchor the market, is a three-rung ladder: a small credit (≈10%) for slipping just under the target, a larger one (≈25%) for a material breach, and a near-total credit (≈100%) for a catastrophic month. The GPU-neocloud ladders mirror this, scaled to node and rack uptime: ClusterMAX baseline is 99.9% node / 99% rack uptime with penalties, and the dual-SLA pattern on NVL72-class racks typically pairs a node-level commitment with a separate rack-level one.
The design decisions inside the ladder are where the money and the disputes live. Stepped vs linear: a stepped ladder is simple to administer but creates cliff incentives — a provider one minute the wrong side of a threshold owes the same as one an hour past it, which can perversely make them stop fighting an outage once a rung is lost. A linear (or finely-stepped) ladder tracks pain more honestly at the cost of administrative complexity. Capped vs uncapped: nearly every provider caps total monthly credits (commonly at 100% of the monthly fee for the affected service), because an uncapped credit converts a bad-luck month into an existential loss and is uninsurable. Credits vs termination: the customer's real remedy for chronic underperformance is not the credit — which rarely exceeds a month's fee — but a termination-for-repeated-breach right (e.g., three breaches in a rolling quarter), which is the clause that actually disciplines a provider, and the one to negotiate hardest on either side.
| Ladder design | How a shortfall is paid | Who it favors | Failure mode to watch |
|---|---|---|---|
| Stepped (3-rung: ~10% / ~25% / ~100%) | Fixed credit % per uptime band missed | Provider — simple, predictable exposure | Cliff incentive: no marginal reason to recover once a rung is lost |
| Finely-stepped / linear | Credit scales with downtime minutes or goodput gap | Customer — tracks actual harm | Administratively heavy; needs trusted measurement |
| Capped at 100% of monthly fee | Total credits never exceed the period's bill | Provider — bounds catastrophic months | Under-compensates a customer whose loss dwarfs the fee |
| Uncapped / multiplied credits | Penalty can exceed the fee | Customer — real teeth | Uninsurable for provider; rare outside bespoke take-or-pay |
| Credits-only | Refund as future credit, no cash, no exit | Provider — retains revenue & tenant | Toothless against chronic underperformance |
| Credits + termination-for-repeated-breach | Credit ladder plus exit right after N breaches/quarter | Customer — escape from a bad provider | Provider churn risk; the clause both sides fight over |
Measuring and attributing the shortfall: goodput accounting and badput
An availability metric is observable at a boundary — the meter trips, the endpoint stops answering, and the downtime minutes are not in serious dispute. A goodput metric is the opposite: it is measured inside the customer's job, where the provider and the tenant share the failure surface, and the entire commercial value of the contract turns on attribution — deciding, for every minute of badput, whose fault it was. This is the hardest part of an AI SLA and the part most contracts get dangerously vague on.
The contractual measurement basis is goodput accounting: instrument the job to produce a defensible ledger of total GPU-time, productive GPU-time, and each class of badput. The badput taxonomy is the heart of it, and each class has a natural owner. Hardware badput — a failed GPU, an HBM error, an SDC event auto-drained by the health-checker — is the provider's: it is their silicon and their fleet management. Recovery badput — checkpoint/restore latency, node-swap time, re-scheduling delay — is shared, and the split depends on whether the provider supplied the checkpointing stack or the tenant did. Workload badput — a tenant's inefficient parallelism, a bad hyperparameter that diverges, a job that simply ran slow — is the tenant's, and the provider must be able to fence it out or they are underwriting the customer's ML engineering. The contract must name these classes, name the attribution method (whose telemetry, what arbitration if the logs disagree), and name the baseline.
Deep dive: badput attribution and the silent-data-corruption problem
The clean cases are easy. A node hard-fails, the health-checker drains it, the job restarts from the last checkpoint — that is provider hardware badput, the minutes are logged, and the credit is owed. The pathological case, and the one that makes goodput contracts genuinely hard to write, is silent data corruption: a GPU that produces wrong results without erroring, poisoning gradients across a synchronous run until someone notices the loss curve has gone strange. Two attribution problems collide here. First, detection lag: the corruption may have been retained into checkpoints for hours before discovery, so the badput is not the few minutes to swap the bad device but the entire window of poisoned work that must be rolled back. Second, blame ambiguity: a diverging loss curve looks identical whether the cause is the provider's faulty silicon or the tenant's unstable training recipe, and the only thing that disambiguates it is reference-comparison telemetry the provider must run continuously and the tenant must agree to trust.
This is why the operational practices and the contract are inseparable. The SLA's goodput floor is only defensible if the provider runs the machinery that produces clean attribution: passive health-checks every few seconds, periodic deep node diagnostics (DCGM-class), SDC-detection via golden-reference comparison, and the auto-drain trigger at ~15% degradation. Without that instrumentation, every badput dispute degenerates into a finger-pointing exercise the provider loses (because they cannot prove it was the tenant) or the tenant loses (because they cannot prove it was the provider). The acceptance-gate fingerprint — the all-reduce busbw at ~92% of theoretical, the per-node nvbandwidth numbers, the fabric BER floor — is what both sides point back to when they disagree. → commissioning fingerprint capture in Chapter 13.2; the failure-rate inputs that set the badput baseline in Chapter 14.3.
Tying the SLA to the commissioning baseline
A goodput contract measured against "the industry says ~90%" is a contract measured against nothing — it invites a dispute the day the first shortfall is claimed. The discipline that makes it enforceable is to anchor the SLA to a baseline captured at go-live: the commissioning process produces a quantitative fingerprint of the as-built cluster, and that fingerprint becomes the contractual reference the SLA is measured against. The acceptance gate is not just a build checkpoint — it is the moment the SLA's denominator is fixed.
Concretely, the go-live fingerprint that the SLA should cite includes the NCCL all-reduce busbw the fabric actually achieved (~92% of theoretical scaling from two nodes to the full cluster is the acceptance norm), the per-node intra-node bandwidth, the fabric bit-error-rate floor, the burn-in soak result (72–168 hours, designed to remove ~98% of infant-mortality failures before tenant handoff), and the measured goodput on a representative reference job. Writing the SLA against this number — "goodput shall not fall below 95% of the commissioned baseline reference" — converts an unfalsifiable marketing claim into an auditable commitment with an agreed starting point. It also protects the provider: a tenant who later runs a pathological workload cannot claim the cluster regressed, because the baseline was established on a known-good reference job both parties signed. → acceptance scripts and pass/fail gates in Chapter 13.2.
Mapping commitments to productization and serving SLOs
The SLA does not live in a vacuum — it is the contractual face of two things engineered elsewhere. Upstream of the customer, it is the productization of capacity: the service tiers, the pricing, the reserved-vs-on-demand structure, and the onboarding commitments that turn a cluster into a sellable product. A reserved or take-or-pay commitment justifies a stronger SLA (the customer is locked in, so the provider can underwrite more); an on-demand spot tier carries little or no availability promise by design. The SLA tier and the commercial tier must be co-designed or they contradict each other. → customer delivery and productization in Chapter 10.9.
Downstream, for inference, the availability SLA is only the outer envelope; the metric the customer actually experiences is the serving SLO — time-to-first-token, time-per-output-token, p99 latency under load. A 99.99% availability commitment is worthless to an inference tenant if the endpoint is technically "up" but blowing its latency budget during every traffic peak. The contract must therefore reconcile the facility-availability layer with the serving-engineering layer: availability is necessary but not sufficient, and a mature inference SLA commits to both an uptime floor and a latency-SLO floor, with separate credits. → serving-engineering SLOs and latency budgets in Chapter 10.11.
Negotiating realistic commitments against the failure environment
The final discipline is the one that protects the provider from their own sales team: never promise a tier of reliability the physical plant and the failure environment cannot deliver at a profit. The SLA is the output of the reliability model. You start from the design-basis — the redundancy topology (Chapter 0.5), the facility tier (Chapter 12.1), the measured component AFRs (Chapter 14.3) — run the availability-and-goodput model (Chapter 12.5), and only then write a commitment with margin between the modeled number and the promised number. Promise the modeled number with no margin and a single bad-luck month — well within the variance the Monte-Carlo predicts — turns into a service-credit hit you did not price.
The failure environment is harsher than the marketing instinct assumes. At the cluster scale where these SLAs apply, interruptions are routine, not exceptional: a 16,384-GPU run logged an interruption roughly every three hours, 78% hardware-attributed; best-in-class MTBF is ~7 days per 512 GPUs, which at 16,000 GPUs is a hardware event every few hours by construction. A goodput contract that promises 96% to a tenant is promising to absorb that entire failure environment minus 4 points — which is achievable only if the operational reliability machinery (health-checks, hot spares, fast checkpointing) is genuinely best-in-class, and suicidal if it is not. The negotiation move for the provider is to set the goodput floor at the number the model supports with margin (often 90–93%, not 96%), reserve the 96% for a premium tier backed by extra hot spares and dedicated checkpointing, and price the difference. The move for the customer is to demand the measurement and attribution regime — because a high number with weak attribution is worth less than a modest number with airtight badput accounting.