Fail-Open or Fail-Closed: The One Decision Every Piece of Edge Middleware Has to Make

Every rate limiter, auth gate, feature flag check, and A/B assignment that runs at the edge shares a dependency you don’t think about until 3 a.m.: a state store. A KV namespace, a Durable Object, a config store, a Redis instance somewhere. Your middleware reads a counter, a token, a flag — and then decides what to do with the request.

The question nobody asks in the happy-path demo is: what happens when that read fails?

Not “returns the wrong value.” Fails. Times out. The store is being redeployed, a POP is partitioned, the counter service is having its own bad day. Your await store.get(key) throws. Now what?

There are exactly two answers, and picking the wrong one is how a small dependency outage becomes a large customer-facing one.

The two doors

Fail open means: when you can’t check, allow the request. The middleware gets out of the way. Traffic flows as if the check had passed.

Fail closed means: when you can’t check, deny the request. The middleware holds the line. Traffic is blocked as if the check had failed.

These are not “optimistic” and “pessimistic” personality traits. They are a deliberate trade between availability and safety, and which one is correct depends entirely on what your middleware is protecting.

A rate limiter should fail open

I’ll make the case with the thing I most recently built: a small edge rate limiter, written to the fetch-handler shape that Fastly Compute and Cloudflare Workers both use. Its job is to stop any single caller from sending more than N requests per window. It leans on an atomic counter in an edge store.

When that store is unavailable, the limiter fails open. Here’s the reasoning, and it’s worth being explicit because the instinct — “it’s a limiter, its whole job is to say no” — points the wrong way.

A rate limiter is a guardrail, not the reason the service exists. Nobody’s users are there for the rate limiter. They’re there for the API behind it. The limiter exists to protect that API from abuse and overload — a real and important job, but a secondary one.

Now play out fail-closed during a store outage. The counter store blips for ninety seconds. Every request worldwide can’t be verified, so every request is denied. You have just taken a counter outage and amplified it into a total outage — a global storm of 429s, every legitimate customer locked out, because a bookkeeping service you use for a guardrail went down. The limiter, whose entire purpose was to keep the service available, has become the single thing making it unavailable.

Fail-open during that same blip? A handful of callers who were near their limit get a few extra requests through. That’s the worst case. It’s recoverable — the store comes back, the counters resume, the over-sends were bounded by how long the outage lasted. A global false-429 is not recoverable in the same way; you can’t un-ring that bell for the customer who got locked out during your incident.

So: availability wins. The limiter fails open, and — this part matters — it flags every degraded response so the outage is visible in your telemetry instead of silently swallowed. Fail-open is not “ignore the problem.” It’s “don’t let my problem become everyone’s problem, but do make sure someone sees it.”

// The whole decision, made explicit and per-route:
failMode: "open"   // guardrail — availability > protection (default)
failMode: "closed" // gate      — safety > availability (opt-in)

An auth gate should fail closed

Now take the same middleware shape and change the job. Instead of a rate limiter, it’s an authorization check at the edge: does this request carry a valid token for the resource it’s asking for?

When the token store or the verification service is unavailable, this one fails closed. Deny.

The reasoning is the mirror image. For an auth gate, letting an unverified request through is the hazard itself. The thing you’re protecting against isn’t “too much traffic,” it’s “the wrong person seeing the data.” If you can’t verify that a request is authorized, allowing it isn’t a minor over-send — it’s a potential data exposure. Availability is no longer the priority; not leaking is.

So the auth gate accepts the availability hit. During a store outage it returns 401/403 and some legitimate users are temporarily blocked. That’s bad. It is much less bad than serving protected data to someone who might not be allowed to see it. Safety wins.

Same code shape. Opposite default. The difference is entirely in what failure costs on each side of the door.

How to actually decide

The trap is treating this as a global switch — “our stack fails open” or “our stack fails closed” — set once and forgotten. It’s not global. It’s per piece of middleware, and you decide it by asking one question:

When I can’t check, which mistake is cheaper to make — letting a bad request through, or blocking a good one?

If letting a bad request through is cheap and recoverable (rate limits, feature flags for non-critical features, A/B bucketing, soft quotas) → fail open. Don’t let a guardrail’s outage become an outage.
If letting a bad request through is expensive and irreversible (authn/authz, payment gating, anything fronting a paid or rate-capped downstream you must not overrun, compliance boundaries) → fail closed. Eat the availability hit; it’s the cheaper mistake.

Two more rules that make this real in production:

Make the choice explicit in the code and the config, not implicit in a try/catch that someone wrote at 2 a.m. A reviewer should be able to read failMode: "open" and know a real decision was made. The most dangerous version of this is the one where nobody chose — the behavior is just whatever the framework does when an exception propagates.
Always instrument the degraded path. Fail-open that hides the outage is worse than fail-closed, because you never find out the store was down until the abuse bill arrives. Emit a metric, set a header, log it. “Allowed because I couldn’t check” and “allowed because you were under the limit” are different events and must be countable separately.

Why the edge makes this sharper

This decision exists in any middleware, but the edge raises the stakes. Edge code runs in front of everything, across every POP, on every request. A fail-closed rate limiter in a single backend service degrades one service. A fail-closed rate limiter at the edge, during a state-store incident, degrades your entire property, globally, at once — which is exactly the blast-radius lesson the whole industry learned from the big CDN outages of the last few years. When your code sits at the edge, “what happens when my dependency fails” isn’t an edge case. It’s the case.

Get it right and it’s invisible: the store blips, a dashboard lights up, an on-call gets paged, and no customer ever knows. Get it wrong and a bookkeeping outage becomes your incident review.