Notes/Rate Limiting and Circuit Breakers

Rate Limiting and Circuit Breakers

How rate limiting algorithms like token bucket and sliding window protect services from overload, and how circuit breakers prevent cascading failures by failing fast when downstream services are unhealthy.

2021-02-17AI-Synthesized from Personal Notes

System DesignRate LimitingCircuit Breakers

Terminology

Rate limiting: controlling the number of requests a client or service can make within a given time window to prevent overload and ensure fair usage
Token bucket: a rate limiting algorithm that maintains a bucket of tokens refilled at a steady rate; each request consumes a token, and requests are rejected when the bucket is empty
Leaky bucket: a rate limiting algorithm that processes requests at a fixed rate regardless of burst patterns; excess requests queue up and overflow when the queue is full
Fixed window counter: a rate limiting algorithm that counts requests in fixed time intervals (e.g., per minute) and rejects requests once the count exceeds the limit
Sliding window log: a rate limiting algorithm that tracks the timestamp of each request and counts how many fall within a rolling time window
Sliding window counter: a hybrid algorithm that combines fixed window counts with a weighted overlap calculation to approximate a true sliding window
Circuit breaker: a stability pattern that monitors failures to a downstream service and stops sending requests when failures exceed a threshold, allowing the downstream to recover
Closed state: the normal operating state of a circuit breaker where requests flow through to the downstream service
Open state: the tripped state where the circuit breaker rejects all requests immediately without calling the downstream service
Half-open state: a probe state where the circuit breaker allows a limited number of test requests through to check if the downstream has recovered
Backoff: a strategy where retry intervals increase after each failure, typically exponentially, to avoid overwhelming a recovering service
Throttling: reducing the rate of operations rather than rejecting them outright, often by queuing or delaying requests
Bulkhead: an isolation pattern that limits the resources (threads, connections) allocated to a particular dependency so that a failure in one does not exhaust resources for others

What & Why

Every service has a capacity limit. A web server can handle a certain number of concurrent connections. A database can process a certain number of queries per second. When traffic exceeds these limits, performance degrades for everyone: response times spike, errors increase, and the service may crash entirely.

Rate limiting protects services by capping how many requests a client can make in a given time period. An API that allows 100 requests per minute per user ensures that one aggressive client cannot monopolize server resources. Rate limiting also defends against denial-of-service attacks, prevents runaway scripts from overwhelming backends, and enforces usage tiers in paid APIs.

The choice of algorithm matters. A fixed window counter is simple but allows bursts at window boundaries. A sliding window provides smoother enforcement. A token bucket allows controlled bursts while maintaining a long-term average rate. Each algorithm trades off between implementation complexity, memory usage, and burst tolerance.

Circuit breakers solve a different but related problem: cascading failures. When Service A calls Service B and Service B is slow or down, Service A's threads block waiting for responses. If Service A has limited threads, they all get consumed waiting for the failing Service B, and Service A itself becomes unresponsive. Now Service C, which depends on Service A, also fails. The failure cascades through the system.

A circuit breaker monitors the error rate of calls to a downstream service. When failures exceed a threshold, the circuit "opens" and all subsequent requests fail immediately without attempting the call. This frees up threads, gives the downstream time to recover, and returns fast errors to callers instead of slow timeouts. After a cooldown period, the circuit enters a "half-open" state and allows a few test requests through. If they succeed, the circuit closes and normal traffic resumes.

How It Works

Token Bucket

The token bucket starts full with $b$ tokens (the burst capacity). Tokens are added at a rate of $r$ tokens per second. Each request consumes one token. If the bucket has tokens, the request is allowed. If the bucket is empty, the request is rejected (or queued). The bucket never holds more than $b$ tokens, so excess refills are discarded.

This allows short bursts of up to $b$ requests while enforcing a long-term average rate of $r$ requests per second. It is the most widely used rate limiting algorithm because it handles bursty traffic gracefully.

Sliding Window Counter

The fixed window counter divides time into fixed intervals (e.g., 1-minute windows) and counts requests per window. The problem: a client can send the full limit at the end of one window and the full limit at the start of the next, effectively doubling the rate at the boundary.

The sliding window counter fixes this by weighting the previous window's count. If the current window is 30% elapsed, the effective count is:

$\text{count} = \text{current\_window\_count} + \text{previous\_window\_count} \times (1 - \text{elapsed\_fraction})$

This approximates a true sliding window with only two counters instead of storing every request timestamp.

Circuit Breaker States

The circuit breaker operates as a state machine with three states:

Closed (normal): requests flow through. The breaker tracks the failure rate over a rolling window. If the failure rate exceeds the threshold (e.g., 50% of the last 20 calls), the circuit opens.

Open (tripped): all requests fail immediately with a predefined error. No calls are made to the downstream service. After a timeout period (e.g., 30 seconds), the circuit transitions to half-open.

Half-open (probing): a limited number of test requests are allowed through. If they succeed, the circuit closes. If any fail, the circuit reopens and the timeout resets.

Complexity Analysis

Let $n$ be the number of requests tracked, $w$ be the window size, and $k$ be the number of clients.

Algorithm	Time per Request	Space per Client
Token Bucket	$O(1)$	$O(1)$
Leaky Bucket	$O(1)$	$O(1)$
Fixed Window Counter	$O(1)$	$O(1)$
Sliding Window Log	$O(\log n)$	$O(n)$
Sliding Window Counter	$O(1)$	$O(1)$

For a distributed rate limiter serving $k$ clients, the total memory is:

$\text{Memory}_{\text{token bucket}} = k \times O(1) = O(k)$

$\text{Memory}_{\text{sliding log}} = k \times O(n) = O(k \cdot n)$

With millions of clients, the sliding window log becomes impractical. Token bucket and sliding window counter are preferred for their constant per-client memory.

Implementation

ALGORITHM TokenBucket(clientId, buckets, rate, capacity)
INPUT: clientId, map of client buckets, refill rate, max capacity
OUTPUT: ALLOW or REJECT
BEGIN
  IF clientId NOT IN buckets THEN
    buckets[clientId] <- { tokens: capacity, lastRefill: NOW() }
  END IF

  bucket <- buckets[clientId]

  // Refill tokens based on elapsed time
  elapsed <- NOW() - bucket.lastRefill
  newTokens <- elapsed * rate
  bucket.tokens <- MIN(bucket.tokens + newTokens, capacity)
  bucket.lastRefill <- NOW()

  IF bucket.tokens >= 1 THEN
    bucket.tokens <- bucket.tokens - 1
    RETURN ALLOW
  ELSE
    RETURN REJECT
  END IF
END

ALGORITHM SlidingWindowCounter(clientId, counters, limit, windowSize) INPUT: clientId, counter store, max requests per window, window duration OUTPUT: ALLOW or REJECT BEGIN currentWindow <- FLOOR(NOW() / windowSize) previousWindow <- currentWindow - 1 elapsedFraction <- (NOW() MOD windowSize) / windowSize

currentCount <- counters.GET(clientId, currentWindow) OR 0 previousCount <- counters.GET(clientId, previousWindow) OR 0

weightedCount <- currentCount + previousCount * (1 - elapsedFraction)

IF weightedCount >= limit THEN RETURN REJECT END IF

counters.INCREMENT(clientId, currentWindow) RETURN ALLOW END

ALGORITHM CircuitBreaker STRUCTURE: state: CLOSED | OPEN | HALF_OPEN failureCount: integer successCount: integer failureThreshold: integer successThreshold: integer // required successes in half-open to close timeout: duration lastFailureTime: timestamp

FUNCTION EXECUTE(request, downstreamService) INPUT: outgoing request, target service OUTPUT: response or circuit-open error BEGIN IF state = OPEN THEN IF NOW() - lastFailureTime >= timeout THEN state <- HALF_OPEN successCount <- 0 ELSE RETURN error "circuit open: failing fast" END IF END IF

response <- CALL(downstreamService, request)

IF response IS failure THEN RECORD_FAILURE() RETURN error "downstream call failed" ELSE RECORD_SUCCESS() RETURN response END IF END

FUNCTION RECORD_FAILURE() BEGIN failureCount <- failureCount + 1 lastFailureTime <- NOW()

IF state = HALF_OPEN THEN state <- OPEN // probe failed, reopen ELSE IF failureCount >= failureThreshold THEN state <- OPEN END IF END

FUNCTION RECORD_SUCCESS() BEGIN IF state = HALF_OPEN THEN successCount <- successCount + 1 IF successCount >= successThreshold THEN state <- CLOSED failureCount <- 0 END IF ELSE failureCount <- 0 // reset on success in closed state END IF END

ALGORITHM ExponentialBackoffRetry(request, service, maxRetries) INPUT: request, target service, maximum retry attempts OUTPUT: response or final error BEGIN FOR attempt <- 0 TO maxRetries - 1 DO response <- CALL(service, request)

IF response IS success THEN
  RETURN response
END IF

IF attempt < maxRetries - 1 THEN
  baseDelay <- 2^attempt * 100  // milliseconds
  jitter <- RANDOM(0, baseDelay)
  WAIT(baseDelay + jitter)
END IF

END FOR

RETURN error "all retries exhausted" END


## Real-World Applications

<ul>
<li><strong>Public APIs</strong>: services like GitHub, Twitter, and Stripe enforce per-client rate limits (e.g., 5000 requests per hour) using token bucket algorithms, returning HTTP 429 responses when limits are exceeded</li>
<li><strong>API gateways</strong>: gateways (Kong, AWS API Gateway) apply rate limiting at the edge before requests reach backend services, protecting internal infrastructure from traffic spikes</li>
<li><strong>Microservice resilience</strong>: service meshes and libraries (Hystrix, resilience4j, Envoy) implement circuit breakers on every service-to-service call to prevent cascading failures across the dependency graph</li>
<li><strong>Login and authentication</strong>: rate limiting on login endpoints prevents brute-force password attacks by limiting the number of attempts per account or IP address within a time window</li>
<li><strong>Payment processing</strong>: payment gateways use circuit breakers to detect when a bank's API is degraded, failing fast and routing to backup processors instead of accumulating timeouts</li>
<li><strong>Cloud resource protection</strong>: cloud providers rate limit API calls (e.g., AWS throttling) to prevent any single tenant from consuming disproportionate control plane resources</li>
</ul>

## Key Takeaways

<ul>
<li>Token bucket is the most versatile rate limiting algorithm: it allows controlled bursts up to the bucket capacity while enforcing a steady long-term rate, with $O(1)$ time and space per client</li>
<li>Sliding window counter approximates a true sliding window using only two counters per client, avoiding the boundary-burst problem of fixed windows without the memory cost of logging every request</li>
<li>Circuit breakers prevent cascading failures by monitoring downstream error rates and failing fast when a service is unhealthy, giving it time to recover</li>
<li>The three circuit breaker states (closed, open, half-open) form a state machine that automatically recovers: closed handles normal traffic, open rejects everything, half-open probes for recovery</li>
<li>Exponential backoff with jitter prevents retry storms: each retry waits longer than the last, and random jitter spreads retries across time so recovering services are not hit by synchronized waves</li>
</ul>

2021-02-18

CDNs and Edge Computing

How content delivery networks distribute content to edge servers worldwide, how cache hierarchies reduce origin load, and how edge functions move computation closer to users for lower latency.

2021-02-12

Load Balancers: Distributing Traffic Across Servers

How load balancers distribute incoming requests across multiple servers using strategies like round robin, least connections, and consistent hashing, and why the choice between L4 and L7 matters.

2021-02-13

Caching: Strategies, Eviction, and Invalidation

How caches accelerate reads with strategies like LRU and LFU, how write-through and write-back policies keep data consistent, and why cache invalidation is one of the hardest problems in computing.