Skip to main content

Command Palette

Search for a command to run...

Understanding and Solving Cache Stampede: The Invisible Threat to Databases

Updated
4 min read

Imagine this:

Your high-traffic app is humming along smoothly. Most data requests are being served instantly from Redis or other, blazing-fast cache layer. But then — a cache key expires, and within milliseconds, 10,000 clients hit your backend at once trying to fetch the same data from the database.

Boom — your production database is slammed, response times skyrocket, and the system becomes unresponsive.

You’ve just been trampled by a cache stampede.

What is a Cache Stampede?

A cache stampede (also called a cache miss storm) occurs when:

  1. A popular cache key expires

  2. Many clients attempt to read the same key

  3. All of them get a cache miss

  4. All of them bypass the cache simultaneously

  5. They hit the backend/database at once

  6. Overload happens — especially in high-concurrency environments

Even if your cache hit rate is 99%, a stampede on 1% of traffic can collapse the system.

Why Cache Stampedes Happen

Caches are typically built with a cache-aside pattern, where:

  • You check the cache first

  • If the key is missing, you recompute or fetch from the DB

  • Then store it back into the cache

This works great for individual requests… but under heavy concurrency, when many requests miss the same key, all of them go through this pattern at the same time:

def get_data(key):
    data = redis.get(key)
    if not data:
        data = query_postgres()
        redis.set(key, data, ex=60)
    return data

Without protection, this causes 1000s of concurrent query_postgres() calls.

Real-world Impact

A cache stampede causes:

  • Sudden spikes in DB traffic (often 10x to 100x)

  • Connection pool exhaustions.

  • Slow queries, timeouts, or even crashes

  • Denial of service for downstream services

  • Resource contention across your stack

When Does This Happen?

  • After a cache TTL expires

  • After a cache eviction (due to memory pressure)

  • During cold starts or deployment rollouts

  • When there’s only one shared cache key for a popular item (e.g., homepage data)

How to Mitigate Cache Stampedes

There are few ways to mitigate Cache Stampedes.

1. Request Coalescing / Single-flight Pattern

Let only one request rebuild the cache, while others wait for it to finish.

Concept:

  • First request takes a lock and fetches the data from DB

  • Others wait for that request to populate the cache

2. Add Jitter (Randomized Expiry)

Avoid simultaneous cache expiry by randomizing TTLs.

ttl = random.randint(50, 70)
redis.set(key, data, ex=ttl)

Prevents a "herd" of keys expiring together
Works best in high-concurrency apps with shared TTLs

3. Serve Stale Data While Rebuilding

Don’t block the user when the cache expires.

Instead:

  • Return the stale value

  • Trigger a background refresh

This is known as “stale-while-revalidate”.

Implementation Approach:

  • Store TTL metadata separately

  • If TTL is expired, serve stale data and refresh in background thread

if redis.ttl(key) <= 0:
    async_refresh(key)
return redis.get(key)

4. Proactive Cache Warming / Preload

Use background jobs to preload popular cache keys before they expire.

Example:

hot_keys = ['homepage', 'top_products', 'user_analytics']

for key in hot_keys:
    data = query_postgres()
    redis.set(key, data, ex=60)

Keeps your most important data hot
Ideal for dashboards, trending items, etc.

Use schedulers like cron, Celery beat, Sidekiq, etc.

5. Use Multi-Level Cache (L1 + L2)

Layered cache architecture:

  • L1 cache: In-process or in-memory (e.g. Python LRUCache)

  • L2 cache: Redis / Memcached

  • L3: Database

Each layer reduces pressure on the next.

6. Batch Writes (if cache miss triggers DB writes)

If the cache miss causes multiple writes, use queues like Kafka or Redis Streams to batch and smooth load.

PostgreSQL Tips for Surviving Stampedes

If it still happens:

  1. Use PgBouncer: Reduces connection overhead

  2. Add read replicas: Distribute SELECT load

  3. Materialized Views: Precompute expensive queries

  4. Analyze slow queries: Use EXPLAIN ANALYZE + indexes

  5. Use rate-limiting middleware: Prevent DoS

Conclusion

Cache stampedes are easy to miss in staging, but devastating in production. You don’t need a DDoS to bring down your system — just a single cache key expiring under high load.

The fix isn’t just “increase cache TTL” — it’s about smart architecture:

  • Let only one request rebuild

  • Serve stale when you can

  • Add randomness to TTL

  • Use background warming.

More from this blog