Guide › Reliability, Resilience & Standards › 12.4

Chapter 12.4

SLAs, Goodput Contracts & Availability Commitments

The SLA is where the reliability rethink becomes a number on a contract — and the operator who promises facility availability for a workload that is actually paying for goodput has signed a contract that is simultaneously unenforceable by the customer and unprofitable for the provider.

GOODPUTPOWER-BOUNDDENSITY-RAMP

What you'll decide here

Whether you are committing to availability (the facility is energized and reachable) or to goodput (the customer's job makes effective forward progress) — the two diverge sharply for AI workloads and govern entirely different penalty mechanics.
The measurement basis and attribution rules for any goodput or job-success commitment: what counts as badput, who owns each badput class (provider vs tenant vs force majeure), and from what baseline the shortfall is computed.
The shape of the service-credit ladder — linear vs stepped, capped vs uncapped, credits-only vs termination rights — and the maximum monthly exposure you are willing to underwrite against your own failure environment.
Which acceptance/commissioning gate establishes the contractual goodput baseline, so the SLA is measured against a number both parties signed at go-live rather than against marketing.
How the customer-facing commitment is reconciled against the redundancy design-basis and the modeled availability — never promise a tier of continuity the physical plant and the failure environment cannot deliver at a profit.

An SLA is a reliability model with money attached. Everything in Part 12 up to this point — the goodput-vs-availability rethink (Chapter 12.2), the redundancy primer (Chapter 0.5), the facility tier standards (Chapter 12.1) — exists in the engineering domain, where being wrong costs an outage. The SLA drags all of it into the commercial domain, where being wrong costs a service credit, a churned tenant, or a take-or-pay dispute. This chapter is about the translation: how a failure environment becomes a promise, how that promise is measured and attributed, and how the penalty structure is shaped so that it disciplines the provider without bankrupting them on a bad month.

The recurring fork is the one Part 12 has been building toward. Availability asks: was the facility energized, cooled, and reachable? Goodput asks: did the customer's job make effective forward progress? For a web service those two questions have nearly the same answer. For an AI workload they diverge violently — a cluster can be 100% available at the facility meter and delivering 70% goodput because a single GPU's silent data corruption is poisoning every synchronous step, or be down for ninety seconds of cooling-pump ride-through and lose nothing because the job checkpointed cleanly. The SLA that measures the wrong one of these is worse than no SLA: it gives the customer a number that does not track their pain and gives the provider an exposure that does not track their control.

What is actually being promised: availability vs goodput

The legacy data-center SLA promises availability — a Monthly Uptime Percentage measured at the facility or the instance, with a service-credit ladder if it falls short. This is the world of the Uptime tiers: Tier III at 99.982% (~1.6 hr/yr of downtime), Tier IV at 99.995% (~26 min/yr), each "nine" a downtime budget the operator commits not to exceed. It is a clean, well-understood, auditable promise, and for online inference it is still the right primary metric: an inference endpoint that is unreachable is lost revenue and a breached latency SLO, and the customer experiences facility downtime directly. → serving-engineering SLOs in Chapter 10.11.

The training SLA is a different animal, because the customer is not buying uptime — they are buying effective compute. Goodput is the fraction of wall-clock GPU time that produces useful, retained forward progress, after subtracting every form of badput: failed steps that get rolled back, time lost to checkpoint/restore, stragglers throttling a synchronous collective, idle GPUs waiting on a re-scheduled node, and work discarded because it ran on a silently-corrupting device. Google's formalization is the cleanest reference: goodput is productive time divided by total time, with badput enumerated by category so each loss has an owner. The industry runs around 90% goodput on average; the best-in-class operators market ~96%; the reliability overhead that the gap represents is 6–21% of TCO. A 6-point goodput improvement on a 20,000-GPU cluster is worth more than several nines of facility availability the training job would never have noticed.

The master SLA fork: availability commitment vs goodput contract

Decide this before you draft a single clause. An availability commitment is measured at the meter or the endpoint, is trivially auditable, and is the correct primary metric for online inference — the customer feels facility downtime directly. A goodput contract is measured inside the job, requires an agreed instrumentation and attribution regime, and is the correct primary metric for training and large-scale RL — the customer feels lost effective compute, not facility minutes. Promise availability to a training tenant and you have given them a number that stays green while their MFU collapses; promise goodput to an inference tenant and you have signed up to underwrite their model-serving code. The most common contract in 2026 is a hybrid: a hard availability floor at the facility/node layer, plus a softer goodput or job-success target at the cluster layer, with distinct ladders and distinct attribution. Get the primary metric wrong and every downstream clause inherits the error.

Availability SLA vs goodput contract — what each actually commits

Dimension	Availability commitment	Goodput contract
What is promised	Facility/node/instance energized & reachable	Effective forward progress of the customer's job
Right for	Online inference, hosting, control plane	Pre-training, large-scale RL, long batch
Unit of measure	Monthly Uptime % (downtime minutes)	Goodput % = productive time / total time
Typical 2026 target	99.9% node / 99% rack (ClusterMAX baseline); 99.99% region (hyperscaler)	90% industry avg; ~96% best-in-class; negotiated floor often 90–95%
Measurement point	Meter, hypervisor, or endpoint probe — auditable	Inside the job: telemetry, checkpoint logs, step counters
Attribution difficulty	Low — outage is observable at the boundary	High — badput must be classed by owner (provider/tenant/force majeure)
Failure that breaches it	Power/cooling loss, network partition, node-down	SDC, straggler throttling, slow recovery, fabric BER, checkpoint stalls
Provider's lever to defend it	Redundancy (2N power, N+1 cooling)	Health-checking, hot spares, fast checkpoint/restore, node drain

Two genuinely different instruments. Most 2026 GPU-cloud contracts layer them: availability floor at the node/rack, goodput or job-success target at the cluster. Figures are 2026-current practitioner reference points; see keynumbers for sources.

The two columns reward different capital. An availability commitment is defended with redundancy — 2N power, N+1 cooling, dual fabric — and its cost lands in the MEP construction budget (2N can swing MEP cost 30–50% over N+1 and leaves ~50% of capacity idle). A goodput contract is defended with operational reliability — passive health-checks every few seconds, automatic node drain on a ~15% degradation-vs-golden-reference trigger, 3–5% hot spares standing by for a ~90-second node swap, and multi-tier checkpointing that cuts mean-time-to-recover from 15–30 minutes down to under 2. The mistake operators make is buying the first kind of insurance for a workload that needs the second. A training tenant does not value the 26-minute-per-year facility downtime budget of Tier IV; they value never losing a 40-minute checkpoint interval to a slow restart. → the goodput-vs-availability tradeoff curve in Chapter 12.2, quantified by the model in Chapter 12.5.

Penalty and credit structures: the service-credit ladder

A commitment without a penalty is marketing. The penalty in a standard SLA is the service credit: a percentage of the monthly bill refunded (as credit, almost never cash) when the measured metric falls below a threshold. The ladder is a step function — miss the target band and you owe a credit; miss it badly and you owe more. The reference shape, from the hyperscaler compute SLAs that anchor the market, is a three-rung ladder: a small credit (≈10%) for slipping just under the target, a larger one (≈25%) for a material breach, and a near-total credit (≈100%) for a catastrophic month. The GPU-neocloud ladders mirror this, scaled to node and rack uptime: ClusterMAX baseline is 99.9% node / 99% rack uptime with penalties, and the dual-SLA pattern on NVL72-class racks typically pairs a node-level commitment with a separate rack-level one.

The design decisions inside the ladder are where the money and the disputes live. Stepped vs linear: a stepped ladder is simple to administer but creates cliff incentives — a provider one minute the wrong side of a threshold owes the same as one an hour past it, which can perversely make them stop fighting an outage once a rung is lost. A linear (or finely-stepped) ladder tracks pain more honestly at the cost of administrative complexity. Capped vs uncapped: nearly every provider caps total monthly credits (commonly at 100% of the monthly fee for the affected service), because an uncapped credit converts a bad-luck month into an existential loss and is uninsurable. Credits vs termination: the customer's real remedy for chronic underperformance is not the credit — which rarely exceeds a month's fee — but a termination-for-repeated-breach right (e.g., three breaches in a rolling quarter), which is the clause that actually disciplines a provider, and the one to negotiate hardest on either side.

Service-credit ladder — reference structures and their consequences

Ladder design	How a shortfall is paid	Who it favors	Failure mode to watch
Stepped (3-rung: ~10% / ~25% / ~100%)	Fixed credit % per uptime band missed	Provider — simple, predictable exposure	Cliff incentive: no marginal reason to recover once a rung is lost
Finely-stepped / linear	Credit scales with downtime minutes or goodput gap	Customer — tracks actual harm	Administratively heavy; needs trusted measurement
Capped at 100% of monthly fee	Total credits never exceed the period's bill	Provider — bounds catastrophic months	Under-compensates a customer whose loss dwarfs the fee
Uncapped / multiplied credits	Penalty can exceed the fee	Customer — real teeth	Uninsurable for provider; rare outside bespoke take-or-pay
Credits-only	Refund as future credit, no cash, no exit	Provider — retains revenue & tenant	Toothless against chronic underperformance
Credits + termination-for-repeated-breach	Credit ladder plus exit right after N breaches/quarter	Customer — escape from a bad provider	Provider churn risk; the clause both sides fight over

Illustrative ladders synthesized from hyperscaler compute SLAs and GPU-neocloud terms (2026). Exact thresholds and percentages are negotiated; the structural fork is what matters.

The exclusions are the SLA

Read the exclusions before the headline number. Every availability SLA carves out scheduled maintenance windows, force majeure, customer-caused outages, problems in the customer's own code or configuration, and — critically for AI — anything the provider can attribute to the tenant's job rather than the infrastructure. A 99.99% commitment with a generous maintenance-window carve-out and a broad "customer environment" exclusion can deliver materially less than a 99.9% commitment with tight exclusions. For goodput contracts this is sharper still: if the provider can class a straggler as "customer model code" rather than "hardware degradation," the badput leaves their side of the ledger entirely. The negotiation that matters is not the percentage on the cover — it is the attribution and exclusion language three pages in.

90% / ~96%

training goodput: industry average vs best-in-class marketed (CoreWeave); the gap the contract prices

2025SemiAnalysis ClusterMAX 2.0 / CoreWeave

99.9% / 99%

GPU-cloud SLA baseline: node uptime / rack uptime, with penalties (ClusterMAX baseline)

2025SemiAnalysis ClusterMAX

99.99% / 99.5%

hyperscaler compute SLA: multi-AZ region-level vs single-instance Monthly Uptime

2026Amazon EC2 / Compute SLA

~10% / 25% / 100%

reference service-credit ladder rungs (% of monthly bill) as uptime falls through bands

2026Amazon EC2 / Compute SLA

99.982% / 99.995%

Uptime Tier III vs Tier IV availability (~1.6 hr vs ~26 min downtime/yr); Uptime now disavows the %

2025Uptime Institute Tier Standard

~7 days

best-in-class H100 MTBF per 512 GPUs — the failure environment any cluster SLA is written against

2025SemiAnalysis (100k H100 clusters)

~1 / 3 hr

Llama-3 405B interruption rate (16,384 H100, 54 days): 466 interruptions, 78% hardware

2024Meta Llama 3 Herd of Models

6–21%

reliability overhead as a share of cluster TCO — the cost of closing the goodput gap

2025SemiAnalysis ClusterMAX

Measuring and attributing the shortfall: goodput accounting and badput

An availability metric is observable at a boundary — the meter trips, the endpoint stops answering, and the downtime minutes are not in serious dispute. A goodput metric is the opposite: it is measured inside the customer's job, where the provider and the tenant share the failure surface, and the entire commercial value of the contract turns on attribution — deciding, for every minute of badput, whose fault it was. This is the hardest part of an AI SLA and the part most contracts get dangerously vague on.

The contractual measurement basis is goodput accounting: instrument the job to produce a defensible ledger of total GPU-time, productive GPU-time, and each class of badput. The badput taxonomy is the heart of it, and each class has a natural owner. Hardware badput — a failed GPU, an HBM error, an SDC event auto-drained by the health-checker — is the provider's: it is their silicon and their fleet management. Recovery badput — checkpoint/restore latency, node-swap time, re-scheduling delay — is shared, and the split depends on whether the provider supplied the checkpointing stack or the tenant did. Workload badput — a tenant's inefficient parallelism, a bad hyperparameter that diverges, a job that simply ran slow — is the tenant's, and the provider must be able to fence it out or they are underwriting the customer's ML engineering. The contract must name these classes, name the attribution method (whose telemetry, what arbitration if the logs disagree), and name the baseline.

Deep dive: badput attribution and the silent-data-corruption problem

The clean cases are easy. A node hard-fails, the health-checker drains it, the job restarts from the last checkpoint — that is provider hardware badput, the minutes are logged, and the credit is owed. The pathological case, and the one that makes goodput contracts genuinely hard to write, is silent data corruption: a GPU that produces wrong results without erroring, poisoning gradients across a synchronous run until someone notices the loss curve has gone strange. Two attribution problems collide here. First, detection lag: the corruption may have been retained into checkpoints for hours before discovery, so the badput is not the few minutes to swap the bad device but the entire window of poisoned work that must be rolled back. Second, blame ambiguity: a diverging loss curve looks identical whether the cause is the provider's faulty silicon or the tenant's unstable training recipe, and the only thing that disambiguates it is reference-comparison telemetry the provider must run continuously and the tenant must agree to trust.

This is why the operational practices and the contract are inseparable. The SLA's goodput floor is only defensible if the provider runs the machinery that produces clean attribution: passive health-checks every few seconds, periodic deep node diagnostics (DCGM-class), SDC-detection via golden-reference comparison, and the auto-drain trigger at ~15% degradation. Without that instrumentation, every badput dispute degenerates into a finger-pointing exercise the provider loses (because they cannot prove it was the tenant) or the tenant loses (because they cannot prove it was the provider). The acceptance-gate fingerprint — the all-reduce busbw at ~92% of theoretical, the per-node nvbandwidth numbers, the fabric BER floor — is what both sides point back to when they disagree. → commissioning fingerprint capture in Chapter 13.2; the failure-rate inputs that set the badput baseline in Chapter 14.3.

Tying the SLA to the commissioning baseline

A goodput contract measured against "the industry says ~90%" is a contract measured against nothing — it invites a dispute the day the first shortfall is claimed. The discipline that makes it enforceable is to anchor the SLA to a baseline captured at go-live: the commissioning process produces a quantitative fingerprint of the as-built cluster, and that fingerprint becomes the contractual reference the SLA is measured against. The acceptance gate is not just a build checkpoint — it is the moment the SLA's denominator is fixed.

Concretely, the go-live fingerprint that the SLA should cite includes the NCCL all-reduce busbw the fabric actually achieved (~92% of theoretical scaling from two nodes to the full cluster is the acceptance norm), the per-node intra-node bandwidth, the fabric bit-error-rate floor, the burn-in soak result (72–168 hours, designed to remove ~98% of infant-mortality failures before tenant handoff), and the measured goodput on a representative reference job. Writing the SLA against this number — "goodput shall not fall below 95% of the commissioned baseline reference" — converts an unfalsifiable marketing claim into an auditable commitment with an agreed starting point. It also protects the provider: a tenant who later runs a pathological workload cannot claim the cluster regressed, because the baseline was established on a known-good reference job both parties signed. → acceptance scripts and pass/fail gates in Chapter 13.2.

Mapping commitments to productization and serving SLOs

The SLA does not live in a vacuum — it is the contractual face of two things engineered elsewhere. Upstream of the customer, it is the productization of capacity: the service tiers, the pricing, the reserved-vs-on-demand structure, and the onboarding commitments that turn a cluster into a sellable product. A reserved or take-or-pay commitment justifies a stronger SLA (the customer is locked in, so the provider can underwrite more); an on-demand spot tier carries little or no availability promise by design. The SLA tier and the commercial tier must be co-designed or they contradict each other. → customer delivery and productization in Chapter 10.9.

Downstream, for inference, the availability SLA is only the outer envelope; the metric the customer actually experiences is the serving SLO — time-to-first-token, time-per-output-token, p99 latency under load. A 99.99% availability commitment is worthless to an inference tenant if the endpoint is technically "up" but blowing its latency budget during every traffic peak. The contract must therefore reconcile the facility-availability layer with the serving-engineering layer: availability is necessary but not sufficient, and a mature inference SLA commits to both an uptime floor and a latency-SLO floor, with separate credits. → serving-engineering SLOs and latency budgets in Chapter 10.11.

Negotiating realistic commitments against the failure environment

The final discipline is the one that protects the provider from their own sales team: never promise a tier of reliability the physical plant and the failure environment cannot deliver at a profit. The SLA is the output of the reliability model. You start from the design-basis — the redundancy topology (Chapter 0.5), the facility tier (Chapter 12.1), the measured component AFRs (Chapter 14.3) — run the availability-and-goodput model (Chapter 12.5), and only then write a commitment with margin between the modeled number and the promised number. Promise the modeled number with no margin and a single bad-luck month — well within the variance the Monte-Carlo predicts — turns into a service-credit hit you did not price.

The failure environment is harsher than the marketing instinct assumes. At the cluster scale where these SLAs apply, interruptions are routine, not exceptional: a 16,384-GPU run logged an interruption roughly every three hours, 78% hardware-attributed; best-in-class MTBF is ~7 days per 512 GPUs, which at 16,000 GPUs is a hardware event every few hours by construction. A goodput contract that promises 96% to a tenant is promising to absorb that entire failure environment minus 4 points — which is achievable only if the operational reliability machinery (health-checks, hot spares, fast checkpointing) is genuinely best-in-class, and suicidal if it is not. The negotiation move for the provider is to set the goodput floor at the number the model supports with margin (often 90–93%, not 96%), reserve the 96% for a premium tier backed by extra hot spares and dedicated checkpointing, and price the difference. The move for the customer is to demand the measurement and attribution regime — because a high number with weak attribution is worth less than a modest number with airtight badput accounting.

The SLA is downstream of the model, never upstream of it

The costliest SLA mistake is writing the commitment first and hoping the plant delivers it. The correct order is: design-basis → failure environment → availability/goodput model → modeled number → promised number with margin → credit ladder sized to the variance. An SLA written this way is a number you can defend on a bad month; an SLA written by working backward from what the customer wants to hear is a structural short position on your own reliability. When the modeled number and the number sales wants to print diverge, the answer is to spend on redundancy or operational reliability until the model supports the promise — or to lower the promise. It is never to print the number and hope. → the quantitative engine in Chapter 12.5.

This chapter is the commercial endpoint of the Part 12 reliability arc. The availability-vs-goodput reframing it commits to contract is argued in Chapter 12.2; the redundancy design-basis the commitments are written against is the primer in Chapter 0.5 and the facility standards in Chapter 12.1; the geographic-failover obligations that feed DR-tier SLA terms are in Chapter 12.3; and the quantitative availability-and-goodput model that produces the number you are allowed to promise is in Chapter 12.5. The commissioning fingerprint that sets the contractual baseline is captured in Chapter 13.2; the component failure rates that define the badput environment come from Chapter 14.3; productization and customer delivery are in Chapter 10.9; and the serving-engineering SLOs that sit beneath an inference availability commitment are in Chapter 10.11.