Understanding and Solving Cache Stampede: The Invisible Threat to Databases
Imagine this:
Your high-traffic app is humming along smoothly. Most data requests are being served instantly from Redis or other, blazing-fast cache layer. But then — a cache key expires, and within milliseconds, 10,000 clients hit your backend at once trying to fetch the same data from the database.
Boom — your production database is slammed, response times skyrocket, and the system becomes unresponsive.
You’ve just been trampled by a cache stampede.
What is a Cache Stampede?
A cache stampede (also called a cache miss storm) occurs when:
A popular cache key expires
Many clients attempt to read the same key
All of them get a cache miss
All of them bypass the cache simultaneously
They hit the backend/database at once
Overload happens — especially in high-concurrency environments
Even if your cache hit rate is 99%, a stampede on 1% of traffic can collapse the system.
Why Cache Stampedes Happen
Caches are typically built with a cache-aside pattern, where:
You check the cache first
If the key is missing, you recompute or fetch from the DB
Then store it back into the cache
This works great for individual requests… but under heavy concurrency, when many requests miss the same key, all of them go through this pattern at the same time:
def get_data(key):
data = redis.get(key)
if not data:
data = query_postgres()
redis.set(key, data, ex=60)
return data
Without protection, this causes 1000s of concurrent query_postgres() calls.
Real-world Impact
A cache stampede causes:
Sudden spikes in DB traffic (often 10x to 100x)
Connection pool exhaustions.
Slow queries, timeouts, or even crashes
Denial of service for downstream services
Resource contention across your stack
When Does This Happen?
After a cache TTL expires
After a cache eviction (due to memory pressure)
During cold starts or deployment rollouts
When there’s only one shared cache key for a popular item (e.g., homepage data)
How to Mitigate Cache Stampedes
There are few ways to mitigate Cache Stampedes.
1. Request Coalescing / Single-flight Pattern
Let only one request rebuild the cache, while others wait for it to finish.
Concept:
First request takes a lock and fetches the data from DB
Others wait for that request to populate the cache
2. Add Jitter (Randomized Expiry)
Avoid simultaneous cache expiry by randomizing TTLs.
ttl = random.randint(50, 70)
redis.set(key, data, ex=ttl)
Prevents a "herd" of keys expiring together
Works best in high-concurrency apps with shared TTLs
3. Serve Stale Data While Rebuilding
Don’t block the user when the cache expires.
Instead:
Return the stale value
Trigger a background refresh
This is known as “stale-while-revalidate”.
Implementation Approach:
Store TTL metadata separately
If TTL is expired, serve stale data and refresh in background thread
if redis.ttl(key) <= 0:
async_refresh(key)
return redis.get(key)
4. Proactive Cache Warming / Preload
Use background jobs to preload popular cache keys before they expire.
Example:
hot_keys = ['homepage', 'top_products', 'user_analytics']
for key in hot_keys:
data = query_postgres()
redis.set(key, data, ex=60)
Keeps your most important data hot
Ideal for dashboards, trending items, etc.
Use schedulers like cron, Celery beat, Sidekiq, etc.
5. Use Multi-Level Cache (L1 + L2)
Layered cache architecture:
L1 cache: In-process or in-memory (e.g. Python
LRUCache)L2 cache: Redis / Memcached
L3: Database
Each layer reduces pressure on the next.
6. Batch Writes (if cache miss triggers DB writes)
If the cache miss causes multiple writes, use queues like Kafka or Redis Streams to batch and smooth load.
PostgreSQL Tips for Surviving Stampedes
If it still happens:
Use PgBouncer: Reduces connection overhead
Add read replicas: Distribute SELECT load
Materialized Views: Precompute expensive queries
Analyze slow queries: Use
EXPLAIN ANALYZE+ indexesUse rate-limiting middleware: Prevent DoS
Conclusion
Cache stampedes are easy to miss in staging, but devastating in production. You don’t need a DDoS to bring down your system — just a single cache key expiring under high load.
The fix isn’t just “increase cache TTL” — it’s about smart architecture:
Let only one request rebuild
Serve stale when you can
Add randomness to TTL
Use background warming.