Caching interview questions

Table of Contents

Thundering herd problem

The thundering herd problem occurs when many clients or servers react to the same event simultaneously, causing a sudden spike in load on a downstream dependency. One of the most common causes is cache expiration.

Consider a cache entry with a fixed TTL of one hour. While the cache is valid, requests are served directly from memory and the database receives little or no traffic for that key. However, once the TTL expires, every server that needs the value will observe a cache miss at roughly the same time.

Imagine a fleet of 50 application servers. If the cached value expires, each server may independently issue a query to rebuild the cache. Instead of one database query, the system suddenly generates 50 identical queries. If the cached data is used by hundreds or thousands of incoming requests, the situation becomes even worse. Before any server manages to rebuild and repopulate the cache, thousands of requests may reach the database for the exact same piece of data.

Databases are typically provisioned for steady-state traffic rather than large synchronized bursts. Although the system may comfortably handle 10,000 requests per second under normal conditions, a sudden spike of identical requests can exhaust database resources, increase latency, and trigger request timeouts.

Timeouts often make the situation worse. Many retry mechanisms use the same retry interval for every client. If thousands of requests fail simultaneously and all retry after the same delay, they create a second synchronized wave of traffic. This feedback loop can prolong outages and significantly increase recovery time.

This phenomenon is known as the thundering herd problem, and in caching systems it is often referred to as a cache stampede. The core issue is not the overall request volume, but rather the fact that many requests become correlated and arrive at the same moment because they were all triggered by a single event, such as cache expiration.

Request coalescing (singleflight)

Request coalescing ensures that only one request performs the expensive operation required to rebuild a missing cache entry. When multiple requests encounter the same cache miss, the first request fetches the data while the remaining requests wait for the result. Once the cache has been populated, all waiting requests use the newly cached value. This prevents duplicate work and dramatically reduces load on the database.

Distributed locking

In a distributed environment, request coalescing must work across multiple servers. A common solution is distributed locking. One server acquires a lock, rebuilds the cache entry, and then releases the lock when the update is complete. Other servers either wait for the lock to be released or temporarily serve stale data while the cache refresh is in progress.

Facebook introduced a lease mechanism in Memcache to prevent large numbers of servers from simultaneously regenerating the same cache entry. The idea is similar to distributed locking, where only one server is allowed to repopulate the cache while others wait or continue serving stale data.

Jittered retries

Retry logic can unintentionally create another thundering herd. If 100 clients all fail at the same time and each retries exactly two seconds later, they will generate another synchronized burst of traffic.

To avoid this, systems introduce random jitter into the retry delay. Instead of retrying at a fixed interval, each client waits for a randomly chosen amount of time. This spreads retries over a larger time window and reduces the likelihood of another traffic spike.

1delay = random(0, min(cap, base * 2^attempt))

In this formula, base * 2^attempt implements exponential backoff, capdefines the maximum retry delay, and random(...)ensures that retries are distributed across time rather than occurring at the same instant.

Modern distributed systems often combine request coalescing, distributed locking, stale cache serving, and jittered retries to mitigate thundering herd scenarios. Together, these techniques help maintain system stability during cache expirations, traffic spikes, and partial service failures.