Chapter 9.6
Object Storage, Data Lakes & the Capacity Tier
Object storage is the gravitational floor of the AI data center — the one tier that holds the whole corpus, every checkpoint lineage, and every shipped model — and the decision that determines its cost is not which vendor you pick but whether you treat it as a cheap archive or a first-class, flash-fronted serving layer that the GPUs actually read from.
What you'll decide here
- Whether your capacity tier is on-prem object, cloud object, or a hybrid — and therefore who owns the egress bill, the durability SLA, and the 2–7 year data-gravity lock-in that follows the petabytes once they land.
- Whether the bucket is HDD-backed nearline (cheap per TB, slow, dense-but-cold) or flash-fronted / all-QLC (10x throughput, single-digit-ms, the price premium that buys GPU goodput) — the fork that decides whether object storage feeds training directly or only stages it.
- Which lakehouse table format (Iceberg, Delta, Hudi) governs the data lake, given that the ecosystem has converged on Iceberg as the interop standard and the wrong bet strands your catalog and multi-engine access.
- The durability and protection scheme — replication vs erasure coding, single-AZ vs multi-AZ, and the immutability/object-lock posture — because the same eleven-nines number can be bought three ways at three very different costs.
- Whether object is also your inference cold tier and model-distribution backbone — because if it is, its read latency and fan-out throughput stop being archive concerns and become cold-start and time-to-first-token concerns.
Every tier above this one — node-local NVMe scratch, the parallel file system, the checkpoint store — is sized to a working set. Object storage is sized to everything. It is the only layer in the building that holds the full training corpus, the complete checkpoint lineage across every run, the model registry, the eval artifacts, the inference logs, and the cold copies of weights waiting to be paged onto a GPU. Because it holds everything, it is the largest tier by capacity and usually the cheapest per terabyte — and that combination breeds a dangerous instinct: to treat it as a passive archive that sits behind the "real" storage and never touches the GPUs. In the 2026 AI data center that instinct is wrong, and acting on it is expensive.
The shift this chapter is built around is that object storage stopped being the cold tier and became a serving tier. Modern data loaders stream Parquet and WebDataset shards directly from S3-compatible buckets into the training loop; model servers page multi-hundred-gigabyte weight files out of object storage on cold start; lakehouse engines query Iceberg tables in place without a copy. The moment a GPU is waiting on an object GET, the bucket's read latency and aggregate throughput are no longer archive metrics — they are goodput metrics, and an under-provisioned capacity tier idles the most expensive hardware in the building. The forks ahead are on-prem vs cloud, HDD vs flash, replication vs erasure, and archive vs serving, each with a downstream cost. The data-loader path that consumes this tier is Chapter 9.5; the file-system tier that sits above it is Chapter 9.2; the sizing math that ties them together is Chapter 9.8.
What object storage is for in an AI fleet
Object storage is defined by three properties that make it the capacity tier and disqualify it from being the performance tier: a flat key-value namespace (no directory tree, no POSIX semantics, no in-place updates — you PUT and GET whole objects by key), massive horizontal scale (a single bucket can hold trillions of objects and exabytes without a metadata wall), and an HTTP/REST access path (the S3 API is the de-facto wire protocol the entire ecosystem speaks). The cost of those properties is latency and small-object overhead: a GET carries request setup, and a workload of millions of tiny objects pays per-request cost that crushes throughput. That is precisely why the data-loader layer packs many small samples into large sequential shards (WebDataset tar, Parquet, MDS) before they ever reach the bucket — the format choice in Chapter 9.5 exists largely to make object storage fast.
In the AI lifecycle, object storage carries five distinct payloads, and they have different access patterns even though they share a bucket. The training corpus is read-heavy, large-sequential, fan-out across thousands of GPUs. Checkpoints are write-burst (incast) then mostly cold, until a restart makes one of them suddenly latency-critical (Chapter 9.4). The model registry is small in object count but each object is huge and read on cold start. Eval and inference logs are append-heavy, queried analytically. And the data lake is the governed, queryable view over the raw corpus. Sizing the bucket to the average of these patterns guarantees you under-serve the peak of each one.
The master fork: on-prem object vs cloud object
The first and least-reversible decision is where the petabytes physically live, because data has gravity: once a multi-petabyte corpus lands somewhere, moving it costs money, time, and egress fees, and every workload that touches it is pulled toward it. The choice is not really "AWS vs MinIO" — it is a choice about who owns the durability SLA, who pays for reads, and how locked-in the data gravity makes you over the asset's life. The data-gravity economics are developed quantitatively in Chapter 9.8.
Cloud object (S3, GCS, Azure Blob, and the regional sovereign clouds) buys you a managed eleven-nines durability SLA, infinite elastic capacity, and zero capex, at the price of a per-GB-month storage rate, per-request charges, and — the line that quietly dominates AI economics — egress fees on every byte that leaves the cloud. If your GPUs are also in that cloud, egress is internal and cheap; if your GPUs are in a colo or self-build and your data is in the cloud, you are paying to feed your own training loop, and the data gravity makes it progressively harder to leave. On-prem object (MinIO, Ceph RADOS Gateway, VAST, Pure, Cloudian, Dell ECS, and the parallel-FS vendors' native S3 endpoints) inverts the trade: you own the capex and the operational burden, you set your own durability scheme, and — critically — there is no egress fee to feed a co-located GPU fleet. For a durable, large, self-hosted training operation the on-prem bucket is usually the lower-TCO floor; for spiky, multi-region, or cloud-native inference the managed bucket wins on elasticity and reach.
| Dimension | On-prem object | Cloud object | Hybrid (cloud-DR / burst) |
|---|---|---|---|
| Capex / opex | High capex, low marginal opex | Zero capex, per-GB-month + per-request opex | Mixed; pay cloud only for the overflow/DR copy |
| Egress to GPUs | None if GPUs co-located | Internal-cheap in-cloud; expensive cross-cloud/colo | Egress only on the spillover path |
| Durability SLA | You design it (EC + replication) | Managed ~11 nines, multi-AZ by default | Cloud SLA on the off-site copy |
| Elasticity | Bounded by what you racked | Effectively infinite, instant | Cloud absorbs the burst |
| Data-gravity lock-in | Locks workloads to your site | Locks workloads to that cloud + egress moat | Splits gravity; highest operational complexity |
| Best-fit | Durable, large, co-located self-build | Spiky, multi-region, cloud-native serving | On-prem primary + cloud archive/DR/overflow |
S3-over-flash: the capacity tier learns to serve
The second fork is the one that has changed most since 2024: what media sits behind the bucket. Classic object storage is HDD-backed — cheap per terabyte, dense, and slow, with ~4 ms seek latency and ~hundreds of MB/s per spindle. That is fine for a true cold archive and fatal for a serving tier. The new pattern is S3-over-flash: an S3-compatible namespace whose hot data lives on NVMe SSD, delivering single-digit-millisecond GETs and an order-of-magnitude higher throughput, so the data loader can stream training shards directly from object storage at line rate instead of staging them through a separate file system first.
The cloud made this concrete with S3 Express One Zone: up to ~10x faster data access and up to ~80% lower request cost than S3 Standard, at single-digit-ms latency — but bought by collapsing to a single availability zone, which trades away the multi-AZ durability that defines S3 Standard. That is the flash fork in miniature: you pay for speed in dollars (storage roughly $0.11/GB-month, several times the Standard rate) and in durability blast-radius (one AZ, not three). On-prem, the same shift is driven by high-capacity QLC SSDs — 122 TB drives shipping in 2025 with public roadmaps to ~245 TB in 2026, at ~7 GB/s sequential read versus ~300 MB/s for an HDD and 20–100 µs latency versus ~4 ms — which let an all-flash object tier match HDD density per rack while serving like a performance tier and slashing watts-per-terabyte. The QLC vs HDD crossover is the storage-economics story of the 2026 data center, and it is reshaping how cold the cold tier really has to be.
| Property | HDD nearline object | Flash-fronted / all-QLC object |
|---|---|---|
| Per-TB cost | Lowest ($/TB floor) | 2–4x HDD, falling fast as QLC scales |
| Read latency | ~4 ms (seek-bound) | 20–100 µs device; single-digit-ms at the bucket |
| Sequential throughput | ~300 MB/s per spindle | ~7 GB/s per drive (≈20x) |
| Density per rack | High (30+ TB HDDs) | Equal or higher (122–245 TB QLC) |
| Watts per TB | High — many spindles | Low — fewer, denser devices; better PUE |
| Feeds GPUs directly? | No — must stage through a faster tier | Yes — stream shards into the loop at line rate |
Lifecycle and tiering: paying archive prices for archive data
Not every object earns flash. The corpus shard read every epoch does; the checkpoint from a run that finished four months ago does not; the inference log from last quarter belongs in deep archive. Lifecycle and tiering policies are the mechanism that keeps the average cost low while the hot path stays fast — they automatically demote objects down a temperature ladder (hot flash → warm standard → cool infrequent-access → cold/archive) as access frequency drops, and the cloud's intelligent-tiering variants do this without an explicit policy by watching access patterns.
The consequence here is subtle. Retrieval from a deep-archive class is cheap to store but slow and sometimes expensive to read back — which is fine for a compliance copy and disastrous for a checkpoint you might restart from. The anti-pattern is tiering by age alone: a six-month-old checkpoint looks archival until a model regression makes it the one you must restore now, and a multi-hour glacier retrieval becomes the long pole in your recovery. Tier by access likelihood and recovery-time requirement, not by calendar age — and keep anything on a recovery path in an instant-retrieval class even if it is cold.
Data lakes and the lakehouse: Iceberg won the format war
A bucket full of Parquet files is a data lake; a bucket full of Parquet files with a transactional metadata layer that gives you schema evolution, time travel, ACID commits, and partition evolution is a lakehouse. The metadata layer is the open table format, and for AI fleets it matters because the same governed corpus must be queryable by the data-prep pipeline, the training loader, the eval harness, and the analytics engine — without copying it four times. The format is the contract that lets many engines read one copy of the data in place.
The format war is effectively over: the ecosystem has converged on Apache Iceberg as the interoperability standard. AWS made Iceberg the default for Athena, Glue, and EMR and shipped S3 Tables (Iceberg-native buckets with a managed REST catalog); Snowflake added native Iceberg tables; Databricks acquired Tabular (Iceberg's creators) and uses Delta Lake UniForm to expose Iceberg-compatible metadata so a single Delta table reads as either format. Iceberg v3 deliberately aligns deletion semantics, file layout, and row tracking with Delta so one copy of data serves both. The practical decision: pick Iceberg unless you are all-in on the Databricks/Delta stack, in which case Delta-with-UniForm gives you interop anyway; Hudi remains the choice only for write-heavy, streaming-upsert CDC workloads. Betting on a proprietary or orphaned format is the expensive mistake — it strands your catalog and forecloses multi-engine access, which is the entire reason to run a lakehouse. The governance and lineage regime that sits on top of the lake is Chapter 10.10.
Durability and protection: the same eleven nines, bought three ways
"Eleven nines" is a marketing number until you ask how it is achieved, because the same durability target costs very differently depending on the protection scheme. The two primitives are replication (store N full copies; simple, fast to repair, but N-x the raw capacity) and erasure coding (split each object into k data + m parity shards spread across failure domains; survives m losses at a fraction of the overhead of replication). At exabyte scale, erasure coding is mandatory — paying 3x raw capacity for triple replication on a 10 PB corpus is 20 PB of wasted media — but it costs more CPU on write and read-reconstruct, and repairs touch many nodes. The failure domain the shards span is the real durability lever: spread across drives only and you survive drive failures; across nodes and you survive node failures; across availability zones and you survive a whole-AZ loss. S3 Standard's eleven nines come from multi-AZ erasure coding plus continuous background integrity scanning (checksums, auditors, automated re-replication); S3 Express One Zone deliberately gives up the multi-AZ span for latency, which is why its durability blast-radius is a single AZ.
The two failure modes durability marketing hides are silent data corruption and operator error. Bit rot is caught by end-to-end checksums and background scrubbing — verify your object store does both, because a corrupted training shard poisons a run silently. And eleven nines of durability is no protection against an rm, a buggy lifecycle rule, or ransomware: that is what object lock / immutability (WORM) and versioning are for. For checkpoints and the model registry, an immutability window plus versioning is the difference between a recoverable incident and a destroyed lineage; the SDC-detection discipline connects to fleet health in Chapter 10.6.
Deep dive: erasure coding math and why the failure domain is the real knob
An erasure code is written as (k, m): each object is divided into k data shards and m parity shards, any k of the k+m suffice to reconstruct, and the storage overhead is (k+m)/k. A common (8,3) scheme stores 11 shards for 8 of data — a 1.375x overhead that tolerates any 3 simultaneous losses, versus 3.0x overhead for triple replication that also tolerates 2 losses. At 10 PB of corpus that is the difference between ~13.75 PB and ~30 PB of raw media — a multi-million-dollar line on flash. So erasure coding wins decisively on capacity efficiency at scale.
The catch, and the design decision, is where the k+m shards land. If all 11 shards of an (8,3) object sit on drives within one node, a node failure takes the object — the code protected you against drives, not nodes. Spread the shards across 11 nodes and you survive node failures; spread them across availability zones and you survive an AZ loss, which is exactly how cloud object storage buys its multi-AZ eleven nines. The cost of widening the failure domain is write amplification and repair traffic across the network: a wider spread means every write and every reconstruct touches more nodes and more fabric, which is why erasure-coded object stores must be co-designed with the network and why rebuild traffic is isolated from the training fabric (the incast-isolation argument in Chapter 9.8). The right scheme is the narrowest failure domain that meets your real loss-tolerance target — not the widest one the vendor will sell you.
Object as the inference cold tier and model-distribution backbone
The last role is the one that retroactively justifies treating object storage as a serving tier rather than an archive: object storage is where models live between requests and how they reach every GPU in the fleet. A model server on cold start pages a multi-hundred-gigabyte weight file out of the bucket onto the GPU; an autoscaler spinning up a new replica reads the same weights; a fleet-wide model rollout fans the same objects out to thousands of nodes at once. When that happens, the bucket's read latency becomes cold-start latency and its aggregate fan-out throughput becomes the ceiling on how fast you can scale a model up to meet a traffic spike — both of which are user-facing inference SLO concerns, not archive concerns.
This is why the inference-era capacity tier is increasingly flash-fronted and why the new inference memory hierarchy (Chapter 9.7) has an object/Ethernet-flash layer explicitly at its base: the KV-cache and weight tiers above it page to and from object storage, and a slow bucket throttles cold-start and cache rehydration. If object storage is your model-distribution backbone, you size it for the fan-out burst of a fleet-wide rollout — thousands of simultaneous reads of the same large objects — not for steady-state archive throughput. Under-provision it and your cold-start time-to-first-token and your ability to absorb a traffic surge both degrade. The model-serving and cold-start mechanics live in Chapter 9.7; the read-amplification of a synchronized fan-out is a fabric co-design problem shared with Chapter 8.5.
Deep dive: when to stage through a file system vs read object directly
The recurring architecture question is whether the training loop reads directly from the object bucket or whether object storage feeds a faster file-system or NVMe cache tier that the GPUs actually read. The answer turns on three variables. First, object media: an HDD-nearline bucket cannot feed a training loop at line rate, so it must stage through a faster tier; a flash-fronted / S3-over-flash bucket can often be read directly. Second, reuse: if the same shards are read every epoch for weeks, a one-time copy into a node-local NVMe or parallel-FS cache amortizes cheaply and removes the bucket from the steady-state hot path; if data is touched once (streaming, single-pass), staging is wasted copy and direct read wins. Third, working-set fit: if the corpus fits in the cache tier you stage once and forget the bucket; if it dwarfs the cache you stream, and the bucket's sustained throughput is the binding constraint.
The 2026 default for large pre-training is a caching architecture: object storage as the durable, capacity-priced source of truth, fronted by a flash cache (parallel FS or distributed cache like the Meta RSC's 46 PB flash cache) that absorbs the per-epoch reads, with the loader prefetching ahead of the GPU. For single-pass streaming and for inference cold-start, the flash-fronted bucket is read directly. The wrong call — staging a single-pass stream, or reading an HDD bucket directly into a training loop — shows up immediately as data-loader stall and collapsed GPU utilization. The loader-and-cache mechanics are Chapter 9.5; the NVMe and GPUDirect path is Chapter 9.3.