One-line hook
A client’s S3 bucket had quietly grown past a petabyte of operational media — overwhelmingly cold, occasionally hot, never deleted — and the line item was starting to embarrass the CFO. Here’s how we cut it without a migration window, a data-loss panic, or a single ticket from a confused user.
Who this is for
- Teams sitting on multi-hundred-TB to multi-PB buckets that grew organically and were never tiered
- Engineering leaders being asked “can we just delete it?” by finance and “absolutely not” by legal/compliance
- Anyone who has stared at
s3:ListObjectsV2pagination and wondered if there’s a cheaper way
The setup (vague-client framing)
- Client: a B2B SaaS platform in the operations/asset-management space. Every customer transaction generates artifacts: photos, signed PDFs, generated documents, exported reports. Retention is effectively “forever” because of contract/legal requirements.
- Volume: ~1.2 PB at the time of engagement, growing ~30 TB/month
- Access pattern: classic long-tail. ~92% of bytes hadn’t been read in 90+ days. <1% of bytes were read in any given week. But which objects would be read was unpredictable — driven by disputes, audits, support requests, customer self-service.
- The “easy” answers (delete old stuff, just turn on Intelligent-Tiering) were both wrong for non-obvious reasons.
Why this was harder than it looked
- You can’t just enable Intelligent-Tiering on a hot bucket. The monitoring fee per object is real (~$0.0025/1000 objects/month). For a bucket with hundreds of millions of small objects, the monitoring fee alone can eat the savings.
- Lifecycle transitions cost money per object. A transition request is a PUT-equivalent. Moving 400M objects to Glacier Instant Retrieval is a non-trivial line item by itself.
- Glacier Deep Archive saves the most, but retrieval is measured in hours. Customer support pulling a 3-year-old signed agreement during a dispute cannot wait 12 hours.
- Retrieval costs can ambush you. A misconfigured “rehydrate everything for a customer export” job can run up a five-figure bill in a single afternoon.
- Object metadata is the only thing telling you what’s safe to tier. And in a bucket that grew organically, that metadata is inconsistent.
The diagnostic phase
- Step 1: Don’t trust the console’s bucket size. Use S3 Storage Lens (or roll your own inventory pipeline). The console number lags by days and aggregates across versions in ways that mislead you.
- Step 2: Generate an S3 Inventory report. This is the unsung hero. CSV/Parquet manifest of every object: key, size, last-modified, storage class, encryption status. Drop it in Athena.
- Step 3: Query the inventory like it’s a data warehouse. Group by prefix, by age bucket, by size bucket. Find the distributions.
- Snippet: Python + boto3 + Athena query to produce a “cost-by-prefix-by-age” matrix
- Step 4: Map prefixes back to product features. This is where engineering judgment lives — a database query won’t tell you that
/exports/is regenerable but/signed-agreements/is not.
The tiering strategy we actually shipped
A tiered strategy, not a single-class strategy. Different prefixes got different lifecycle policies based on access pattern and regenerability.
| Prefix family | Storage class plan | Rationale |
|---|---|---|
| Signed legal artifacts | Standard → Glacier Instant Retrieval at 90 days | Read rarely, but when read, it’s during a dispute and latency matters |
| User-uploaded media | Standard → Standard-IA at 30d → Glacier Instant at 180d | Strong “recent is hot” pattern |
| Generated exports/reports | Standard → expire at 30d | Regenerable from source data — delete, don’t tier |
| Thumbnails / derivatives | Standard → expire at 7d | Cheap to recreate, never worth storing cold |
| Internal logs / audit trails | Standard → Glacier Deep Archive at 365d | Almost never read; retrieval latency acceptable |
Key insight: the cheapest byte is the one you don’t store. Roughly 18% of bucket size turned out to be regenerable derivatives that nobody had thought to expire.
Implementation, the boring-but-critical parts
- Lifecycle rules in code, not the console. Terraform module + tested rule generation. (Code snippet in Python via
boto3showing rule application + diffing the current vs. desired state.) - A “tiering canary” before applying to the whole bucket. Apply the rule to a 0.1% sampled prefix first. Wait 30 days. Measure retrieval volume against baseline.
- A retrieval-cost guardrail. Wrap any code path that issues
RestoreObjector reads from IA/Glacier classes with a budgeting middleware that emits a metric per request — and an alert at $X/hour. - Code snippet: a small Python decorator that wraps the read path, tags requests with a “tier-cost-class,” and emits to CloudWatch/Prometheus
- Prefix-aware retrieval logic in the app. When code asks for an object, it should know (or be able to discover) the storage class and either: (a) serve it transparently if Instant, (b) queue an async restore with user-facing messaging if Deep Archive, (c) return a regenerate-from-source path if the object was an expired derivative.
What we’d do differently
- We under-estimated the first-month transition request bill. Plan for it; don’t get surprised.
- We over-engineered the canary. The S3 Inventory data was good enough to predict access patterns within ~5% — the canary mostly confirmed what we already knew.
- We should have built the retrieval-cost guardrail first, before any tiering went live. We had one near-miss where a bulk export job almost hit IA-class on 200k objects.
- “Just turn on Intelligent-Tiering everywhere” is the right answer for some prefixes (large-object, unpredictable-access) and the wrong answer for others (small-object, regular-access). Don’t bucket-wide it.
Results
- ~60% reduction in monthly S3 line item, sustained over 12+ months
- Zero customer-visible regressions; <5 internal support tickets related to tiered-object access, all resolved by the async-restore flow we built
- Growth rate of the bill flattened from ~exponential-with-data to ~linear-with-new-data (because old data tiers down)
- Recovered budget redirected into… (vague client outcome: scaling team, new product line, whatever)
The takeaway
You don’t have a storage problem. You have an access pattern problem dressed up as a storage problem. Build the inventory pipeline, map prefixes to product features, tier by access pattern not by age, and put a guardrail on the retrieval side before you flip a single lifecycle rule. The bill takes care of itself.