Guide › Strategy, Workload Archetypes & Economics › 1.5

Chapter 1.5

Edge Inference & Distributed Micro-Datacenters

Edge inference inverts every default of a centralized AI build — you stop chasing the cheapest megawatt and start chasing the closest one to the user — and the single decision that governs whether the inversion pays for itself is the latency budget, not the GPU.

POWER-BOUNDGOODPUT

What you'll decide here

Whether your workload has a latency budget that a centralized region physically cannot meet — because if a sub-50 ms SLO is not real, the edge is a more expensive way to do something a region does better.
Which edge tier (on-prem appliance, telco/MEC node, Tier-2 metro colo, CDN-adjacent grid) matches your latency budget, data-gravity, and operational reach — they are not interchangeable.
The power, thermal, and form-factor envelope each micro-site can actually accept — a few kW to ~30 kW, air- or sealed-loop-cooled, with no on-site staff — and therefore which models can even run there.
Whether you can operate hundreds of lights-out sites with zero-touch provisioning and remote remediation — because the edge's cost is operational, not capital, and an un-automatable fleet is a stranded one.
When the edge is simply wrong: when the latency SLO is soft, utilization is thin, and a centralized region with a CDN front-end beats a distributed fleet on every line of the TCO.

Every other archetype in Part 1 chases power. Pre-training chases the cheapest stranded megawatt in the coldest climate; batch inference chases curtailable off-peak load; even online inference, for all its latency talk, will happily sit in a regional hyperscale campus an hour from the nearest user. Edge inference is the one archetype that inverts the siting hierarchy. Its binding constraint is not the price of a megawatt but the distance to a human being. That single inversion — proximity over cost — cascades into a completely different building: small, distributed, power-starved, thermally constrained, and operated with no one in the room. Get the inversion right and you unlock workloads a region cannot serve at any price. Get it wrong — invert when the latency budget was never real — and you have built the most expensive possible way to do something a centralized region does better and cheaper.

This chapter is about that fork and its consequences. We define the edge as a tiered topology, not a single thing — on-prem appliance, telco/MEC node, Tier-2 metro colo, CDN-adjacent grid — because the tiers trade latency, cost, and operational reach against each other and choosing the wrong tier is its own mis-scope. We anchor everything to the latency budget and the 30/50/100 ms perceptibility thresholds, because that budget is the master variable here exactly as the archetype was in Chapter 1.1. We then trace the consequences the inversion forces: a tight power/thermal/form-factor envelope, lights-out fleet operations with zero-touch provisioning, and an economics question — edge versus centralized regional inference — whose answer is "centralized" far more often than edge enthusiasts admit.

Defining the edge: a topology, not a place

"The edge" is the most abused word in this guide. It is a gradient running from the user's premises back toward the centralized region, and at each step you trade lower latency and better data-gravity for less power, worse economies of scale, and harder operations. Four tiers cover almost every real deployment, and they are genuinely different design problems.

On-prem appliance. A GPU box or sealed micro-rack inside the user's own building — a factory floor, a hospital, a retail back room, a trading desk, a forward military node. Latency is essentially zero (no WAN hop), data never leaves the premises (the data-sovereignty and air-gap case), but power is whatever the building's existing electrical service spares, cooling is ambient or a sealed loop, and there is no IT staff. This is the tier for hard real-time control and for data that legally or contractually cannot traverse a network.

Telco / MEC node. Compute placed inside the mobile operator's network — at the cell-site aggregation point, the central office, or a regional breakout — under the ETSI Multi-access Edge Computing (MEC) framework. This is the tier that 5G URLLC was built to feed: round-trip times under ~10 ms to the device when compute sits at the access edge, under ~50 ms from a regional breakout. The catch is that telco real estate is space-, power-, and cooling-constrained by design (these rooms were sized for switching gear, not GPUs), and the operator owns the landlord relationship.

Tier-2 metro colo. A conventional colocation hall in a second-tier metro — not Ashburn or Santa Clara, but the dozens of mid-size cities where a region does not exist but users do. This is the workhorse tier for latency-sensitive inference at meaningful scale: real power (hundreds of kW to a few MW), real cooling, real operations staff within driving distance, and 10–30 ms reach to a metro's users. It is "edge" only relative to the hyperscale region; physically it is a normal data center placed where the users are.

CDN-adjacent grid. Inference injected into the existing content-delivery footprint — hundreds or thousands of points-of-presence (PoPs) already positioned for sub-30 ms reach to nearly every populated area. The CDN tier inherits a ready-made distribution network and an anycast routing layer, but each PoP is tiny (a few racks at most), so only small, heavily-quantized models fit, and the operator is renting someone else's footprint with someone else's power and cooling limits.

The edge tiers — latency, power, and the operational reality of each

Edge tier	Typical RT latency	Power / footprint envelope	Cooling	Operations model	Best-fit workload
On-prem appliance	Sub-5 ms (no WAN hop)	A few kW; single box to a sealed micro-rack	Ambient air or sealed liquid loop; no facility water	Lights-out; vendor-managed or zero-touch	Hard real-time control; air-gapped / sovereign data
Telco / MEC node	Sub-10 ms (access edge) to ~50 ms (regional breakout)	Severely constrained: a few kW to ~30 kW per site	Air; retrofit of switching-room cooling	Operator-managed; zero-touch at scale	5G URLLC; AR/VR; agentic / interactive on mobile
Tier-2 metro colo	~10–30 ms (intra-metro)	Real: hundreds of kW to a few MW	Air, rear-door, or DLC by density	Staffed within driving distance; remote-first	Latency-sensitive inference at metro scale
CDN-adjacent grid	Sub-30 ms (anycast to most users)	Tiny per PoP: a few racks; aggregate is large	Whatever the PoP already has; air	Lights-out; managed by the CDN operator	Small / quantized models; cache-and-serve, RAG front-ends

Latency figures are typical round-trip to the user, dominated by physical distance (≈0.82 ms per 100 mi each way in fiber) plus access-network and processing overhead. Power and footprint are practitioner ranges, 2026. MEC = ETSI Multi-access Edge Computing.

The master fork: is the latency budget physical, or is it a preference?

There are two kinds of latency requirement, and only one of them justifies the edge. A physical budget is one a centralized region cannot meet because the speed of light forbids it: a 5 ms control loop, a sub-50 ms p99 for an AR overlay, a robot that cannot wait 80 ms for a region 600 miles away. A preference budget is one where "faster is nicer" but the region already clears the SLO — most chat, most retrieval, most batch-shaped work. Edge pays for itself only against a physical budget. Against a preference budget you are paying the edge's full penalty — thin utilization, tiny footprints, fleet operations overhead — to shave milliseconds nobody's SLA requires. Decide which kind of budget you actually have before you distribute a single GPU. It is the difference between unlocking a workload and stranding capital across a hundred sites.

Latency budgets and the 30/50/100 ms thresholds

The latency budget is the edge's master variable, so it is worth being precise about what is in it and where the thresholds come from. End-to-end user-perceived latency is a sum: network round-trip (dominated by physical distance), access-network overhead (the last mile — 5G, Wi-Fi, fixed broadband), queuing and processing at each hop, and the inference time itself (time-to-first-token plus enough decode to be useful). The edge can only attack the first term. If inference time alone blows the budget, moving the GPU closer to the user does nothing — you needed a smaller model or a faster accelerator, not a different site.

The physical floor is unforgiving and worth memorizing: light in fiber travels at roughly two-thirds of c, about 0.82 ms per 100 miles each way, so a user 600 miles from the nearest region pays ~10 ms round-trip on distance alone, before a single packet is queued or a single token generated. That is the number that makes or breaks the edge case. The three perceptibility thresholds that recur in the literature map to three classes of experience:

~100 ms — the ceiling for an interaction to feel "instant" and for total user-facing response to preserve natural conversational flow. Above it, the system feels laggy but usable. This is the loosest threshold and the one most regions already meet for nearby users.
~50 ms — the working budget for genuinely interactive experiences: AR/VR overlays, live translation, responsive agentic loops, cloud-rendered interfaces. A 50 ms p99 is the SLO most often cited as the line where edge placement starts to earn its keep, because it leaves little room once access-network and inference time are subtracted.
~30 ms and below — hard real-time: industrial control, autonomous-vehicle perception assist, haptic/tele-operation, safety interlocks. At this budget the compute must be on-prem or at the access edge; a regional round-trip is physically excluded for any user not sitting next to the region.

The consequence chain is direct: the threshold you commit to sets the maximum tolerable distance to the user, which sets the minimum number of sites, which sets the whole economics. A 100 ms budget might be served from a handful of regions plus a CDN front-end. A 30 ms budget for a national user base can force dozens to hundreds of micro-sites — and that site count, not the GPU bill, is what dominates edge TCO.

~0.82 ms / 100 mi

one-way fiber latency from distance alone (~5 ms per 1,000 km); ~1.64 ms RT per 100 mi before any processing

2025M2 Optics fiber-latency analysis (≈2/3 c in glass)

sub-10 ms

MEC round-trip at the access edge; under ~50 ms from a regional 5G URLLC breakout

2025ETSI ISG MEC; arXiv 2504.03708 (telco-LLM latency)

~30 / 50 / 100 ms

perceptibility thresholds: hard real-time / interactive (AR-VR, agentic) / 'instant' conversational

2026Spheron hybrid edge guide; AR/VR latency literature

~$40B → ~$106B

edge data center market, 2026 to 2033, ~14.9% CAGR; AI/ML inference the fastest-growing segment

2026Grand View Research; Coherent Market Insights

~35% / ~54%

micro data centers' share of the edge market (global 2025) / of US edge by 2026

2026Grand View Research; Coherent Market Insights (US)

~2/3

inference share of AI compute in 2026 (½ in 2025); the growth pool the edge competes for

2026Deloitte TMT Predictions 2026

~1 hr / 90%+

edge-site deploy time and install-time reduction under zero-touch provisioning (Vapor IO; ZTP fleet tooling)

2026Vapor IO; Scale Computing / VMware VCF Edge

a few kW – ~30 kW

practical power envelope per edge micro-site (vs ~132 kW for a centralized NVL72 rack)

2026research/domain-research.json; practitioner ranges

Power, thermal, and form factor: the edge's hard ceiling

The inversion that defines the edge — proximity over cost — has a cruel corollary: the places closest to users are the worst places to put power and heat. A hyperscale campus is sited because it has cheap, abundant megawatts and room for a cooling plant. A cell-site cabinet, a retail back room, a CDN PoP, or a Tier-2 colo suite has none of those. The edge is therefore power-bound at the micro scale in exactly the way the industry is power-bound at the macro scale — but the binding number is kilowatts, not megawatts, and it is set by a building someone else designed for something else.

Power sets the model, not the other way around. This is the cascade inverted. In a centralized build you pick the model and provision power to match; at the edge the available power is fixed by the host site, and it dictates which models can run there at all. A few kW supports a single small or heavily-quantized model on one or two accelerators; ~30 kW at a generous Tier-2 site supports a modest cluster. The frontier-scale model that needs an NVL72's ~132 kW simply cannot live at the edge — which is why edge serving is overwhelmingly the domain of distilled, quantized, sub-10B-parameter models and small mixture-of-experts variants. The decision to go edge is implicitly a decision to serve a smaller model, and if your product needs the big model the edge case collapses on the first line. → quantization and model-sizing for serving in Chapter 1.3.

Thermal envelope is ambient-limited and water-free. The cooling-cliff logic of Chapter 5.1 still applies, but the edge has neither the density to need direct-to-chip liquid nor the facility water to supply it. The dominant answer is sealed/closed-loop air or self-contained liquid that rejects to ambient — a modular cabinet that needs no plumbing to the building. That keeps the form factor deployable but caps the heat you can reject, which loops straight back to the power ceiling. The form factor itself is the third constraint: micro/modular containerized units are now the largest edge segment precisely because a prefabricated, sealed, ship-and-drop enclosure is the only thing that deploys economically across hundreds of sites with no construction crew.

Lights-out fleet operations and zero-touch provisioning

A centralized region is one operations problem with staff on site. An edge deployment is the opposite: hundreds of sites, none with staff, scattered across a geography. This is the decision that quietly kills more edge projects than latency or power — because the edge's cost is operational, not capital, and a fleet you cannot operate without sending a technician to each site is a fleet that loses money on every truck roll. The whole model only works if a site can be deployed, configured, monitored, updated, and recovered with no one in the room.

Zero-touch provisioning (ZTP) is the enabling discipline: hardware is shipped to a site, plugged in, and powered on, and it then enrolls itself, pulls its image and configuration from a central control plane, joins the fleet, and begins serving — with no local expertise and no manual steps. Mature tooling cuts install time by 90% or more and collapses site bring-up from days of skilled labor to about an hour of unskilled plug-in; lights-out operators have driven deployment to roughly one hour per site. The corollary is that everything downstream of provisioning must also be remote: over-the-air model and firmware updates, remote attestation and secure boot (the physical-security perimeter at an unmanned site is weak, so the root of trust must be in silicon — see Chapter 1.1's irreversible-decisions logic applied to hardware), telemetry-driven health monitoring, and automated remediation that can quarantine, reimage, or fail a site over without dispatch.

The consequence of getting this wrong is not a worse SLA — it is a different business. Without ZTP and remote remediation, every firmware bug, every wedged node, every certificate rotation becomes a truck roll, and the fleet's operating cost scales with site count instead of being amortized across it. The edge's promise of distribution becomes its curse. The reversible/irreversible framing applies cleanly: the automation platform is reversible (you can re-tool), but the decision to operate lights-out is irreversible at scale — you cannot retrofit hands-on operations onto a hundred unstaffed sites without rebuilding the cost model from scratch. → fleet reliability and goodput-at-the-edge connect to Chapter 1.3 on inference operations.

Deep dive: geo-redundancy replaces facility redundancy at the edge

The redundancy logic of a centralized inference site — 2N power, N+1 cooling on standby, Tier-IV-class uptime — does not transfer to the edge, and trying to force it is an anti-pattern. A telco cabinet or a CDN PoP cannot host 2N power; there is no room and no budget. So the edge resilience model inverts the same way the siting model does: instead of making each site bulletproof, you make the fleet resilient and let individual sites fail. Most edge sites are therefore N — a single power feed, a single cooling path — and availability comes from geo-redundancy: anycast or latency-aware routing that steers a user's request to the next-nearest healthy site when the closest one is down, degraded, or saturated.

The consequence is a design constraint most teams discover late: the fleet must be over-provisioned in coverage, not in per-site nines. If losing one site pushes its users past their latency threshold because the next-nearest site is 200 miles away, you have a coverage gap that no per-site redundancy would have fixed — you needed another site, not another power feed. This is why edge capacity planning is a map problem first and a rack problem second: you place sites so that the failure of any one keeps every user inside budget against the next-nearest. The redundancy spend that would have bought 2N at one site buys an extra site in the gap instead, and that extra site is worth more. → the redundancy-vs-goodput reframing for centralized sites is in Chapter 1.1 and deepened in the reliability chapters.

Edge economics vs centralized regional inference

Now the decision that the rest of the chapter has been building toward: against a real workload, does the edge beat a centralized region with a CDN front-end? The honest answer, more often than the edge's advocates concede, is no. The edge trades away the three things that make centralized inference cheap — economies of scale, high utilization, and cheap power — in exchange for one thing: latency it can deliver and a region cannot. If you do not need that one thing badly enough, every other line of the comparison favors the region.

Utilization is the edge's structural weakness. A region pools demand from a continent and runs its fleet hot; an edge site serves only the users in its latency radius, so its demand is thinner, spikier, and harder to fill. A centralized cluster at 80–90% utilization and a Tier-2 edge node at 30–40% are serving the same model at wildly different unit costs, because the fixed costs — the GPU depreciation, the lease, the power baseline — are spread over far fewer tokens at the edge. The breakeven-utilization math that governs any GPU build (~70% for a debt-financed cluster, a contested single-source figure) is much harder to clear when your addressable demand is a single metro's worth of latency-sensitive traffic. Power is the edge's second penalty: metro and on-prem power costs more than stranded rural megawatts — often 2–4x — and the edge cannot chase the cheap power without giving up the proximity that was its entire reason to exist.

So the comparison is a genuine fork with a clear shape. Centralized + CDN wins when the latency budget is soft (≥100 ms), demand is too thin to fill edge sites, the model is large, or the workload is batch-shaped — which describes most inference today. Edge wins when the budget is physically binding (≤50 ms, often ≤30 ms), the model is small enough to fit the power envelope, demand is dense enough to fill the sites, and data-gravity or sovereignty forbids the round-trip. The mistake is treating the edge as a default modernization step rather than a targeted answer to a physical constraint. When in doubt, the region is the cheaper bet — and you can always push selected workloads outward later, which is itself the reversible move.

Edge vs centralized regional inference — when each wins

Decision driver	Favors centralized region (+ CDN)	Favors edge fleet	Downstream cost of choosing wrong
Latency budget	Soft: ≥100 ms; region already clears the SLO	Physical: ≤50 ms (often ≤30 ms) a region cannot meet	Edge-when-soft: pay full edge penalty to shave milliseconds no SLA requires
Demand density	Thin or pooled across a continent; fill a few hot regions	Dense within each latency radius; sites fill	Edge-when-thin: sub-40% utilization, breakeven never clears
Model size	Large / frontier; needs centralized power and scale-up	Small / quantized; fits a few-kW to ~30 kW envelope	Edge-with-big-model: the model does not fit the site, case collapses
Data gravity / sovereignty	Data may traverse the network; no residency constraint	Air-gapped, sovereign, or on-prem-only data	Region-when-forbidden: a compliance or contractual breach, not a cost
Operations	Staffed sites; conventional ops	Lights-out, zero-touch, geo-redundant fleet in place	Edge-without-ZTP: every fix is a truck roll, opex scales with site count

A decision table, not a ranking. The right answer is whichever column your workload's latency budget, demand density, and model size land in. 'CDN front-end' = caching, routing, and small cache-and-serve models at PoPs in front of a centralized fleet.

When the edge is wrong (the anti-patterns)

The same edge mis-scopes recur, each one from inverting the siting hierarchy when the inversion was not earned:

Edge for a preference budget. Distributing inference to shave latency the SLA never required. The region already cleared 100 ms for nearby users; the edge buys 40 ms of headroom nobody asked for at the price of thin utilization and a fleet to operate. The savings would have been larger spent on a faster centralized accelerator.
Edge with a model that does not fit. Committing to edge serving and then discovering the product needs the frontier model that an NVL72 hosts and a 30 kW cabinet cannot. The power envelope is the first line of the spec, not the last — decide the model and the site together.
Edge without lights-out operations. Standing up a hundred sites with no ZTP, no remote remediation, no in-silicon root of trust — so every firmware bug becomes a truck roll and opex scales with site count instead of being amortized across it. The automation must exist before the fleet does, not after.
Per-site redundancy instead of geo-redundancy. Trying to make each unmanned cabinet bulletproof with 2N power it has no room for, instead of placing an extra site in the coverage gap. At the edge, resilience is a map problem; the redundancy budget buys coverage, not nines.

Edge inference is one of the five workload archetypes framed in Chapter 1.1; its serving-engineering siblings are online inference in Chapter 1.3 (model sizing, quantization, decode pressure) and the procurement question of building versus renting the underlying capacity in Chapter 1.6 (where 'build-core-rent-edge' is named as a hybrid). The per-subsystem requirements mapping — including the power-first vs latency-first siting split that this chapter inverts — is tabulated in Chapter 1.7, and the economics that score an edge fleet against a centralized region live in Chapter 1.8. The cooling-cliff physics the edge sidesteps with sealed-loop cabinets is engineered in Chapter 5.1, and the siting hierarchy the edge inverts is laid out in Chapter 3.1.