One-line hook

A client’s S3 bucket had quietly grown past a petabyte of operational media — overwhelmingly cold, occasionally hot, never deleted — and the line item was starting to embarrass the CFO. Here’s how we cut it without a migration window, a data-loss panic, or a single ticket from a confused user.

Who this is for

Teams sitting on multi-hundred-TB to multi-PB buckets that grew organically and were never tiered
Engineering leaders being asked “can we just delete it?” by finance and “absolutely not” by legal/compliance
Anyone who has stared at s3:ListObjectsV2 pagination and wondered if there’s a cheaper way

The setup (vague-client framing)

Client: a B2B SaaS platform in the operations/asset-management space. Every customer transaction generates artifacts: photos, signed PDFs, generated documents, exported reports. Retention is effectively “forever” because of contract/legal requirements.
Volume: ~1.2 PB at the time of engagement, growing ~30 TB/month
Access pattern: classic long-tail. ~92% of bytes hadn’t been read in 90+ days. <1% of bytes were read in any given week. But which objects would be read was unpredictable — driven by disputes, audits, support requests, customer self-service.
The “easy” answers (delete old stuff, just turn on Intelligent-Tiering) were both wrong for non-obvious reasons.

Why this was harder than it looked

You can’t just enable Intelligent-Tiering on a hot bucket. The monitoring fee per object is real (~$0.0025/1000 objects/month). For a bucket with hundreds of millions of small objects, the monitoring fee alone can eat the savings.
Lifecycle transitions cost money per object. A transition request is a PUT-equivalent. Moving 400M objects to Glacier Instant Retrieval is a non-trivial line item by itself.
Glacier Deep Archive saves the most, but retrieval is measured in hours. Customer support pulling a 3-year-old signed agreement during a dispute cannot wait 12 hours.
Retrieval costs can ambush you. A misconfigured “rehydrate everything for a customer export” job can run up a five-figure bill in a single afternoon.
Object metadata is the only thing telling you what’s safe to tier. And in a bucket that grew organically, that metadata is inconsistent.

The diagnostic phase

Step 1: Don’t trust the console’s bucket size. Use S3 Storage Lens (or roll your own inventory pipeline). The console number lags by days and aggregates across versions in ways that mislead you.
Step 2: Generate an S3 Inventory report. This is the unsung hero. CSV/Parquet manifest of every object: key, size, last-modified, storage class, encryption status. Drop it in Athena.
Step 3: Query the inventory like it’s a data warehouse. Group by prefix, by age bucket, by size bucket. Find the distributions.
Snippet: Python + boto3 + Athena query to produce a “cost-by-prefix-by-age” matrix
Step 4: Map prefixes back to product features. This is where engineering judgment lives — a database query won’t tell you that /exports/ is regenerable but /signed-agreements/ is not.

The tiering strategy we actually shipped

A tiered strategy, not a single-class strategy. Different prefixes got different lifecycle policies based on access pattern and regenerability.

Prefix family	Storage class plan	Rationale
Signed legal artifacts	Standard → Glacier Instant Retrieval at 90 days	Read rarely, but when read, it’s during a dispute and latency matters
User-uploaded media	Standard → Standard-IA at 30d → Glacier Instant at 180d	Strong “recent is hot” pattern
Generated exports/reports	Standard → expire at 30d	Regenerable from source data — delete, don’t tier
Thumbnails / derivatives	Standard → expire at 7d	Cheap to recreate, never worth storing cold
Internal logs / audit trails	Standard → Glacier Deep Archive at 365d	Almost never read; retrieval latency acceptable

Key insight: the cheapest byte is the one you don’t store. Roughly 18% of bucket size turned out to be regenerable derivatives that nobody had thought to expire.

Implementation, the boring-but-critical parts

Lifecycle rules in code, not the console. Terraform module + tested rule generation. (Code snippet in Python via boto3 showing rule application + diffing the current vs. desired state.)
A “tiering canary” before applying to the whole bucket. Apply the rule to a 0.1% sampled prefix first. Wait 30 days. Measure retrieval volume against baseline.
A retrieval-cost guardrail. Wrap any code path that issues RestoreObject or reads from IA/Glacier classes with a budgeting middleware that emits a metric per request — and an alert at $X/hour.
Code snippet: a small Python decorator that wraps the read path, tags requests with a “tier-cost-class,” and emits to CloudWatch/Prometheus
Prefix-aware retrieval logic in the app. When code asks for an object, it should know (or be able to discover) the storage class and either: (a) serve it transparently if Instant, (b) queue an async restore with user-facing messaging if Deep Archive, (c) return a regenerate-from-source path if the object was an expired derivative.

What we’d do differently

We under-estimated the first-month transition request bill. Plan for it; don’t get surprised.
We over-engineered the canary. The S3 Inventory data was good enough to predict access patterns within ~5% — the canary mostly confirmed what we already knew.
We should have built the retrieval-cost guardrail first, before any tiering went live. We had one near-miss where a bulk export job almost hit IA-class on 200k objects.
“Just turn on Intelligent-Tiering everywhere” is the right answer for some prefixes (large-object, unpredictable-access) and the wrong answer for others (small-object, regular-access). Don’t bucket-wide it.

Results

~60% reduction in monthly S3 line item, sustained over 12+ months
Zero customer-visible regressions; <5 internal support tickets related to tiered-object access, all resolved by the async-restore flow we built
Growth rate of the bill flattened from ~exponential-with-data to ~linear-with-new-data (because old data tiers down)
Recovered budget redirected into… (vague client outcome: scaling team, new product line, whatever)

The takeaway

You don’t have a storage problem. You have an access pattern problem dressed up as a storage problem. Build the inventory pipeline, map prefixes to product features, tier by access pattern not by age, and put a guardrail on the retrieval side before you flip a single lifecycle rule. The bill takes care of itself.