Chapter 9.6

Object Storage, Data Lakes & the Capacity Tier

Object storage is the gravitational floor of the AI data center — the one tier that holds the whole corpus, every checkpoint lineage, and every shipped model — and the decision that determines its cost is not which vendor you pick but whether you treat it as a cheap archive or a first-class, flash-fronted serving layer that the GPUs actually read from.

GOODPUTPOWER-BOUND

What you'll decide here

Whether your capacity tier is on-prem object, cloud object, or a hybrid — and therefore who owns the egress bill, the durability SLA, and the 2–7 year data-gravity lock-in that follows the petabytes once they land.
Whether the bucket is HDD-backed nearline (cheap per TB, slow, dense-but-cold) or flash-fronted / all-QLC (10x throughput, single-digit-ms, the price premium that buys GPU goodput) — the fork that decides whether object storage feeds training directly or only stages it.
Which lakehouse table format (Iceberg, Delta, Hudi) governs the data lake, given that the ecosystem has converged on Iceberg as the interop standard and the wrong bet strands your catalog and multi-engine access.
The durability and protection scheme — replication vs erasure coding, single-AZ vs multi-AZ, and the immutability/object-lock posture — because the same eleven-nines number can be bought three ways at three very different costs.
Whether object is also your inference cold tier and model-distribution backbone — because if it is, its read latency and fan-out throughput stop being archive concerns and become cold-start and time-to-first-token concerns.

Every tier above this one — node-local NVMe scratch, the parallel file system, the checkpoint store — is sized to a working set. Object storage is sized to everything. It is the only layer in the building that holds the full training corpus, the complete checkpoint lineage across every run, the model registry, the eval artifacts, the inference logs, and the cold copies of weights waiting to be paged onto a GPU. Because it holds everything, it is the largest tier by capacity and usually the cheapest per terabyte — and that combination breeds a dangerous instinct: to treat it as a passive archive that sits behind the "real" storage and never touches the GPUs. In the 2026 AI data center that instinct is wrong, and acting on it is expensive.

The shift this chapter is built around is that object storage stopped being the cold tier and became a serving tier. Modern data loaders stream Parquet and WebDataset shards directly from S3-compatible buckets into the training loop; model servers page multi-hundred-gigabyte weight files out of object storage on cold start; lakehouse engines query Iceberg tables in place without a copy. The moment a GPU is waiting on an object GET, the bucket's read latency and aggregate throughput are no longer archive metrics — they are goodput metrics, and an under-provisioned capacity tier idles the most expensive hardware in the building. The forks ahead are on-prem vs cloud, HDD vs flash, replication vs erasure, and archive vs serving, each with a downstream cost. The data-loader path that consumes this tier is Chapter 9.5; the file-system tier that sits above it is Chapter 9.2; the sizing math that ties them together is Chapter 9.8.

What object storage is for in an AI fleet

Object storage is defined by three properties that make it the capacity tier and disqualify it from being the performance tier: a flat key-value namespace (no directory tree, no POSIX semantics, no in-place updates — you PUT and GET whole objects by key), massive horizontal scale (a single bucket can hold trillions of objects and exabytes without a metadata wall), and an HTTP/REST access path (the S3 API is the de-facto wire protocol the entire ecosystem speaks). The cost of those properties is latency and small-object overhead: a GET carries request setup, and a workload of millions of tiny objects pays per-request cost that crushes throughput. That is precisely why the data-loader layer packs many small samples into large sequential shards (WebDataset tar, Parquet, MDS) before they ever reach the bucket — the format choice in Chapter 9.5 exists largely to make object storage fast.

In the AI lifecycle, object storage carries five distinct payloads, and they have different access patterns even though they share a bucket. The training corpus is read-heavy, large-sequential, fan-out across thousands of GPUs. Checkpoints are write-burst (incast) then mostly cold, until a restart makes one of them suddenly latency-critical (Chapter 9.4). The model registry is small in object count but each object is huge and read on cold start. Eval and inference logs are append-heavy, queried analytically. And the data lake is the governed, queryable view over the raw corpus. Sizing the bucket to the average of these patterns guarantees you under-serve the peak of each one.

The master fork: on-prem object vs cloud object

The first and least-reversible decision is where the petabytes physically live, because data has gravity: once a multi-petabyte corpus lands somewhere, moving it costs money, time, and egress fees, and every workload that touches it is pulled toward it. The choice is not really "AWS vs MinIO" — it is a choice about who owns the durability SLA, who pays for reads, and how locked-in the data gravity makes you over the asset's life. The data-gravity economics are developed quantitatively in Chapter 9.8.

Cloud object (S3, GCS, Azure Blob, and the regional sovereign clouds) buys you a managed eleven-nines durability SLA, infinite elastic capacity, and zero capex, at the price of a per-GB-month storage rate, per-request charges, and — the line that quietly dominates AI economics — egress fees on every byte that leaves the cloud. If your GPUs are also in that cloud, egress is internal and cheap; if your GPUs are in a colo or self-build and your data is in the cloud, you are paying to feed your own training loop, and the data gravity makes it progressively harder to leave. On-prem object (MinIO, Ceph RADOS Gateway, VAST, Pure, Cloudian, Dell ECS, and the parallel-FS vendors' native S3 endpoints) inverts the trade: you own the capex and the operational burden, you set your own durability scheme, and — critically — there is no egress fee to feed a co-located GPU fleet. For a durable, large, self-hosted training operation the on-prem bucket is usually the lower-TCO floor; for spiky, multi-region, or cloud-native inference the managed bucket wins on elasticity and reach.

On-prem object vs cloud object vs hybrid — the capacity-tier fork

Dimension	On-prem object	Cloud object	Hybrid (cloud-DR / burst)
Capex / opex	High capex, low marginal opex	Zero capex, per-GB-month + per-request opex	Mixed; pay cloud only for the overflow/DR copy
Egress to GPUs	None if GPUs co-located	Internal-cheap in-cloud; expensive cross-cloud/colo	Egress only on the spillover path
Durability SLA	You design it (EC + replication)	Managed ~11 nines, multi-AZ by default	Cloud SLA on the off-site copy
Elasticity	Bounded by what you racked	Effectively infinite, instant	Cloud absorbs the burst
Data-gravity lock-in	Locks workloads to your site	Locks workloads to that cloud + egress moat	Splits gravity; highest operational complexity
Best-fit	Durable, large, co-located self-build	Spiky, multi-region, cloud-native serving	On-prem primary + cloud archive/DR/overflow

S3-API compatibility is now near-universal, so the decision is economics and control, not protocol. Figures are 2026 practitioner ranges; see keynumbers for sources.

Egress is the data-gravity tax — and it is a one-way ratchet

The single number that breaks more AI-storage architectures than any throughput figure is cloud egress. Storing a corpus in one cloud and training it in another — or in your own colo — means paying to read your own data, every epoch, forever. Worse, the cost is asymmetric by design: it is cheap to put data in and expensive to take it out, which is exactly the shape of a lock-in moat. The consequence for siting: co-locate the capacity tier with the compute that reads it, or accept that the egress line will quietly dominate your storage TCO and that leaving will require a deliberate, budgeted migration rather than a config change. This is the storage face of the data-gravity argument in Chapter 9.8 and the move-compute-to-data principle.

S3-over-flash: the capacity tier learns to serve

The second fork is the one that has changed most since 2024: what media sits behind the bucket. Classic object storage is HDD-backed — cheap per terabyte, dense, and slow, with ~4 ms seek latency and ~hundreds of MB/s per spindle. That is fine for a true cold archive and fatal for a serving tier. The new pattern is S3-over-flash: an S3-compatible namespace whose hot data lives on NVMe SSD, delivering single-digit-millisecond GETs and an order-of-magnitude higher throughput, so the data loader can stream training shards directly from object storage at line rate instead of staging them through a separate file system first.

The cloud made this concrete with S3 Express One Zone: up to ~10x faster data access and up to ~80% lower request cost than S3 Standard, at single-digit-ms latency — but bought by collapsing to a single availability zone, which trades away the multi-AZ durability that defines S3 Standard. That is the flash fork in miniature: you pay for speed in dollars (storage roughly $0.11/GB-month, several times the Standard rate) and in durability blast-radius (one AZ, not three). On-prem, the same shift is driven by high-capacity QLC SSDs — 122 TB drives shipping in 2025 with public roadmaps to ~245 TB in 2026, at ~7 GB/s sequential read versus ~300 MB/s for an HDD and 20–100 µs latency versus ~4 ms — which let an all-flash object tier match HDD density per rack while serving like a performance tier and slashing watts-per-terabyte. The QLC vs HDD crossover is the storage-economics story of the 2026 data center, and it is reshaping how cold the cold tier really has to be.

Capacity-tier media fork — HDD nearline vs flash-fronted / all-QLC

Property	HDD nearline object	Flash-fronted / all-QLC object
Per-TB cost	Lowest ($/TB floor)	2–4x HDD, falling fast as QLC scales
Read latency	~4 ms (seek-bound)	20–100 µs device; single-digit-ms at the bucket
Sequential throughput	~300 MB/s per spindle	~7 GB/s per drive (≈20x)
Density per rack	High (30+ TB HDDs)	Equal or higher (122–245 TB QLC)
Watts per TB	High — many spindles	Low — fewer, denser devices; better PUE
Feeds GPUs directly?	No — must stage through a faster tier	Yes — stream shards into the loop at line rate

Per-device figures are 2025–2026 vendor/market data (Solidigm 122 TB QLC; HDD nearline). Latency/throughput are typical, not worst-case.

Lifecycle and tiering: paying archive prices for archive data

Not every object earns flash. The corpus shard read every epoch does; the checkpoint from a run that finished four months ago does not; the inference log from last quarter belongs in deep archive. Lifecycle and tiering policies are the mechanism that keeps the average cost low while the hot path stays fast — they automatically demote objects down a temperature ladder (hot flash → warm standard → cool infrequent-access → cold/archive) as access frequency drops, and the cloud's intelligent-tiering variants do this without an explicit policy by watching access patterns.

The consequence here is subtle. Retrieval from a deep-archive class is cheap to store but slow and sometimes expensive to read back — which is fine for a compliance copy and disastrous for a checkpoint you might restart from. The anti-pattern is tiering by age alone: a six-month-old checkpoint looks archival until a model regression makes it the one you must restore now, and a multi-hour glacier retrieval becomes the long pole in your recovery. Tier by access likelihood and recovery-time requirement, not by calendar age — and keep anything on a recovery path in an instant-retrieval class even if it is cold.

Data lakes and the lakehouse: Iceberg won the format war

A bucket full of Parquet files is a data lake; a bucket full of Parquet files with a transactional metadata layer that gives you schema evolution, time travel, ACID commits, and partition evolution is a lakehouse. The metadata layer is the open table format, and for AI fleets it matters because the same governed corpus must be queryable by the data-prep pipeline, the training loader, the eval harness, and the analytics engine — without copying it four times. The format is the contract that lets many engines read one copy of the data in place.

The format war is effectively over: the ecosystem has converged on Apache Iceberg as the interoperability standard. AWS made Iceberg the default for Athena, Glue, and EMR and shipped S3 Tables (Iceberg-native buckets with a managed REST catalog); Snowflake added native Iceberg tables; Databricks acquired Tabular (Iceberg's creators) and uses Delta Lake UniForm to expose Iceberg-compatible metadata so a single Delta table reads as either format. Iceberg v3 deliberately aligns deletion semantics, file layout, and row tracking with Delta so one copy of data serves both. The practical decision: pick Iceberg unless you are all-in on the Databricks/Delta stack, in which case Delta-with-UniForm gives you interop anyway; Hudi remains the choice only for write-heavy, streaming-upsert CDC workloads. Betting on a proprietary or orphaned format is the expensive mistake — it strands your catalog and forecloses multi-engine access, which is the entire reason to run a lakehouse. The governance and lineage regime that sits on top of the lake is Chapter 10.10.

99.999999999%

S3 Standard designed durability (11 nines); data spread across ≥3 AZs; ~350T objects stored

2025AWS S3 documentation / FAQs

~10x faster

S3 Express One Zone data access vs S3 Standard; up to 80% lower request cost; single-AZ

2025AWS S3 Express One Zone

~$0.11/GB-mo

S3 Express One Zone storage rate (several x Standard) — the flash-tier price premium

2025AWS S3 pricing; Vantage analysis

122 TB

shipping QLC SSD capacity (Solidigm); public roadmap to ~245 TB in 2026

2025Solidigm; Blocks & Files; IDTechEx

~7 GB/s vs ~300 MB/s

QLC SSD vs HDD sequential read; 20–100 µs vs ~4 ms latency

2025Solidigm / Meta Engineering QLC case

Iceberg = default

open table format for AWS Athena/Glue/EMR + S3 Tables; Snowflake + Databricks interop

2025AWS; Dremio; Capital One Tech

~2 PB + 250–400 GB/s

reference capacity + aggregate bandwidth per ~1,024 GPUs for the corpus tier

2025domain synthesis; NVIDIA DGX SuperPOD storage RA

46 PB

Meta RSC flash cache fronting object/cold storage — scale of a serving capacity tier

2025Introl (Meta RSC)

Durability and protection: the same eleven nines, bought three ways

"Eleven nines" is a marketing number until you ask how it is achieved, because the same durability target costs very differently depending on the protection scheme. The two primitives are replication (store N full copies; simple, fast to repair, but N-x the raw capacity) and erasure coding (split each object into k data + m parity shards spread across failure domains; survives m losses at a fraction of the overhead of replication). At exabyte scale, erasure coding is mandatory — paying 3x raw capacity for triple replication on a 10 PB corpus is 20 PB of wasted media — but it costs more CPU on write and read-reconstruct, and repairs touch many nodes. The failure domain the shards span is the real durability lever: spread across drives only and you survive drive failures; across nodes and you survive node failures; across availability zones and you survive a whole-AZ loss. S3 Standard's eleven nines come from multi-AZ erasure coding plus continuous background integrity scanning (checksums, auditors, automated re-replication); S3 Express One Zone deliberately gives up the multi-AZ span for latency, which is why its durability blast-radius is a single AZ.

The two failure modes durability marketing hides are silent data corruption and operator error. Bit rot is caught by end-to-end checksums and background scrubbing — verify your object store does both, because a corrupted training shard poisons a run silently. And eleven nines of durability is no protection against an rm, a buggy lifecycle rule, or ransomware: that is what object lock / immutability (WORM) and versioning are for. For checkpoints and the model registry, an immutability window plus versioning is the difference between a recoverable incident and a destroyed lineage; the SDC-detection discipline connects to fleet health in Chapter 10.6.

Deep dive: erasure coding math and why the failure domain is the real knob

An erasure code is written as (k, m): each object is divided into k data shards and m parity shards, any k of the k+m suffice to reconstruct, and the storage overhead is (k+m)/k. A common (8,3) scheme stores 11 shards for 8 of data — a 1.375x overhead that tolerates any 3 simultaneous losses, versus 3.0x overhead for triple replication that also tolerates 2 losses. At 10 PB of corpus that is the difference between ~13.75 PB and ~30 PB of raw media — a multi-million-dollar line on flash. So erasure coding wins decisively on capacity efficiency at scale.

The catch, and the design decision, is where the k+m shards land. If all 11 shards of an (8,3) object sit on drives within one node, a node failure takes the object — the code protected you against drives, not nodes. Spread the shards across 11 nodes and you survive node failures; spread them across availability zones and you survive an AZ loss, which is exactly how cloud object storage buys its multi-AZ eleven nines. The cost of widening the failure domain is write amplification and repair traffic across the network: a wider spread means every write and every reconstruct touches more nodes and more fabric, which is why erasure-coded object stores must be co-designed with the network and why rebuild traffic is isolated from the training fabric (the incast-isolation argument in Chapter 9.8). The right scheme is the narrowest failure domain that meets your real loss-tolerance target — not the widest one the vendor will sell you.

Object as the inference cold tier and model-distribution backbone

The last role is the one that retroactively justifies treating object storage as a serving tier rather than an archive: object storage is where models live between requests and how they reach every GPU in the fleet. A model server on cold start pages a multi-hundred-gigabyte weight file out of the bucket onto the GPU; an autoscaler spinning up a new replica reads the same weights; a fleet-wide model rollout fans the same objects out to thousands of nodes at once. When that happens, the bucket's read latency becomes cold-start latency and its aggregate fan-out throughput becomes the ceiling on how fast you can scale a model up to meet a traffic spike — both of which are user-facing inference SLO concerns, not archive concerns.

This is why the inference-era capacity tier is increasingly flash-fronted and why the new inference memory hierarchy (Chapter 9.7) has an object/Ethernet-flash layer explicitly at its base: the KV-cache and weight tiers above it page to and from object storage, and a slow bucket throttles cold-start and cache rehydration. If object storage is your model-distribution backbone, you size it for the fan-out burst of a fleet-wide rollout — thousands of simultaneous reads of the same large objects — not for steady-state archive throughput. Under-provision it and your cold-start time-to-first-token and your ability to absorb a traffic surge both degrade. The model-serving and cold-start mechanics live in Chapter 9.7; the read-amplification of a synchronized fan-out is a fabric co-design problem shared with Chapter 8.5.

Deep dive: when to stage through a file system vs read object directly

The recurring architecture question is whether the training loop reads directly from the object bucket or whether object storage feeds a faster file-system or NVMe cache tier that the GPUs actually read. The answer turns on three variables. First, object media: an HDD-nearline bucket cannot feed a training loop at line rate, so it must stage through a faster tier; a flash-fronted / S3-over-flash bucket can often be read directly. Second, reuse: if the same shards are read every epoch for weeks, a one-time copy into a node-local NVMe or parallel-FS cache amortizes cheaply and removes the bucket from the steady-state hot path; if data is touched once (streaming, single-pass), staging is wasted copy and direct read wins. Third, working-set fit: if the corpus fits in the cache tier you stage once and forget the bucket; if it dwarfs the cache you stream, and the bucket's sustained throughput is the binding constraint.

The 2026 default for large pre-training is a caching architecture: object storage as the durable, capacity-priced source of truth, fronted by a flash cache (parallel FS or distributed cache like the Meta RSC's 46 PB flash cache) that absorbs the per-epoch reads, with the loader prefetching ahead of the GPU. For single-pass streaming and for inference cold-start, the flash-fronted bucket is read directly. The wrong call — staging a single-pass stream, or reading an HDD bucket directly into a training loop — shows up immediately as data-loader stall and collapsed GPU utilization. The loader-and-cache mechanics are Chapter 9.5; the NVMe and GPUDirect path is Chapter 9.3.

Object storage is the floor of the Part 9 hierarchy: the lifecycle framing and four I/O personalities are in Chapter 9.1; the parallel/distributed file systems that cache and front it (including their native object endpoints) in Chapter 9.2; the NVMe and GPUDirect path that the flash tier rides on in Chapter 9.3; checkpoints, which land here as the largest write-burst payload, in Chapter 9.4; the data-loader and format choices that make object storage fast in Chapter 9.5; the inference KV-cache and model-serving hierarchy that pages to and from object in Chapter 9.7; and the sizing, data-gravity, egress, and resilience economics that score the whole tier in Chapter 9.8. The data lake's governance, lineage, and legal regime is Chapter 10.10; SDC detection and integrity telemetry connect to Chapter 10.6; the fan-out/rebuild-traffic fabric co-design to Chapter 8.5; and the 2026→2030 trajectory of all-flash-everywhere and file/object convergence to Chapter 16.2.