Notes/Load Balancers: Distributing Traffic Across Servers

Load Balancers: Distributing Traffic Across Servers

How load balancers distribute incoming requests across multiple servers using strategies like round robin, least connections, and consistent hashing, and why the choice between L4 and L7 matters.

2021-02-12AI-Synthesized from Personal Notes

System DesignLoad Balancers

Terminology

Load balancer: a device or software component that distributes incoming network requests across a pool of backend servers to prevent any single server from becoming overwhelmed
Backend server (upstream): one of the servers in the pool that actually processes client requests; also called an origin or upstream server
Health check: a periodic probe (HTTP request, TCP connection, or ping) sent by the load balancer to each backend to verify it is alive and able to serve traffic
Round robin: a scheduling algorithm that cycles through servers in order, sending each new request to the next server in the list
Weighted round robin: a variant of round robin where servers with higher capacity receive a proportionally larger share of requests
Least connections: a scheduling algorithm that routes each new request to the server currently handling the fewest active connections
Consistent hashing: a hashing technique that maps both servers and requests onto a ring, minimizing remapping when servers are added or removed
Hash ring: a circular number space (typically $0$ to $2^{32}-1$) used in consistent hashing where both server identifiers and request keys are placed
Virtual node: a replica of a physical server placed at a different position on the hash ring to improve load distribution uniformity
L4 load balancer: a load balancer operating at the transport layer (TCP/UDP); it routes based on IP addresses and port numbers without inspecting application data
L7 load balancer: a load balancer operating at the application layer (HTTP/HTTPS); it can inspect headers, URLs, cookies, and message bodies to make routing decisions
Sticky session (session affinity): a configuration where the load balancer routes all requests from the same client to the same backend server, typically using cookies or IP hashing
TLS termination: the process of decrypting HTTPS traffic at the load balancer so that backend servers receive plain HTTP, offloading cryptographic work
Failover: the automatic redirection of traffic away from a failed server to healthy servers in the pool
Throughput: the number of requests a system can handle per unit of time, often measured in requests per second (RPS)

What & Why

Every popular web service eventually outgrows a single server. When one machine cannot handle all incoming requests, you add more servers. But clients only know one address. Something needs to sit between clients and servers, accept every incoming request, and decide which backend should handle it. That something is a load balancer.

A load balancer solves three problems at once. First, it increases capacity by spreading requests across multiple servers so no single machine is a bottleneck. Second, it improves availability: if one server crashes, the load balancer detects the failure through health checks and stops sending traffic to it, so clients never see an error. Third, it enables maintenance without downtime. You can take a server offline, update it, and bring it back while the others continue serving traffic.

The core question in load balancing is the routing strategy: how does the balancer choose which server gets the next request? Simple strategies like round robin work well when all servers are identical and requests are roughly equal in cost. More sophisticated strategies like least connections or consistent hashing handle uneven workloads and stateful services. The choice depends on your traffic patterns, whether requests carry session state, and how your backend is architected.

Load balancers also differ in the network layer at which they operate. An L4 balancer works at the TCP/UDP level and is extremely fast but blind to application content. An L7 balancer understands HTTP and can route based on URL paths, headers, or cookies, but it does more work per request. Most production systems use both: an L4 balancer at the edge for raw throughput, and L7 balancers behind it for intelligent routing.

How It Works

Round Robin

The simplest strategy. The load balancer maintains a list of healthy servers and a counter. Each new request goes to the next server in the list, wrapping around to the beginning after the last server.

Weighted round robin extends this by assigning each server a weight proportional to its capacity. A server with weight 3 receives three times as many requests as a server with weight 1.

Least Connections

The load balancer tracks how many active connections each server currently holds. Each new request goes to the server with the fewest active connections. This naturally adapts to servers with different processing speeds: a fast server finishes requests quickly, its connection count drops, and it receives more work.

Consistent Hashing

Both servers and request keys are hashed onto a circular space (the hash ring) of size $2^{32}$. To route a request, hash its key and walk clockwise around the ring until you hit a server. When a server is added or removed, only the requests that map to the arc between the new/removed server and its predecessor are affected. With $n$ servers, adding or removing one remaps only $\approx 1/n$ of the keys.

Virtual nodes improve uniformity. Each physical server is placed at $v$ positions on the ring. With $v = 150$ virtual nodes per server, the load variance drops to roughly 5-10% even with a small number of physical servers.

L4 vs L7 Load Balancing

An L4 load balancer sees only the TCP or UDP packet headers: source/destination IP and port. It makes a routing decision based on these four values, then forwards the raw packets (or rewrites addresses via NAT). Because it never parses the application payload, it is extremely fast and can handle millions of connections per second. The trade-off is that it cannot distinguish between an API call and a static file request.

An L7 load balancer terminates the TCP connection, parses the HTTP request, and can inspect the URL path, Host header, cookies, query parameters, and even the request body. This enables content-based routing: send /api/* requests to the API server pool, /static/* to a CDN origin, and WebSocket upgrades to a dedicated pool. L7 balancers can also perform TLS termination, request rewriting, rate limiting, and authentication. The cost is higher latency per request and lower maximum throughput compared to L4.

Complexity Analysis

Let $n$ be the number of backend servers and $v$ be the number of virtual nodes per server in consistent hashing.

Algorithm	Selection Time	Add/Remove Server	Space
Round Robin	$O(1)$	$O(1)$	$O(n)$
Weighted Round Robin	$O(1)$	$O(n)$ rebuild	$O(n)$
Least Connections	$O(n)$ or $O(\log n)$ with heap	$O(\log n)$	$O(n)$
Consistent Hashing	$O(\log(n \cdot v))$	$O(v \cdot \log(n \cdot v))$	$O(n \cdot v)$
IP Hash	$O(1)$	$O(1)$ but remaps all keys	$O(n)$

When a server is added or removed with simple modular hashing ($\text{server} = \text{hash}(\text{key}) \mod n$), the fraction of keys that must be remapped is:

$\text{Remapped fraction (mod hash)} = \frac{n-1}{n} \approx 1 - \frac{1}{n}$

With consistent hashing, only the keys in the affected arc move:

$\text{Remapped fraction (consistent)} \approx \frac{1}{n}$

This is the key advantage of consistent hashing for stateful services like caches. Adding a server to a 10-node cluster remaps ~10% of keys instead of ~90%.

Implementation

ALGORITHM RoundRobin(servers)
INPUT: servers: list of healthy backend servers
STATE: counter: integer, initialized to 0
OUTPUT: selected server
BEGIN
  selected <- servers[counter MOD LENGTH(servers)]
  counter <- counter + 1
  RETURN selected
END

ALGORITHM WeightedRoundRobin(servers, weights)
INPUT: servers: list of servers, weights: list of integer weights
STATE: currentWeights: list of integers, initialized to copy of weights
OUTPUT: selected server
BEGIN
  totalWeight <- SUM(weights)
  bestIndex <- -1
  bestWeight <- -INFINITY

  FOR i <- 0 TO LENGTH(servers) - 1 DO
    currentWeights[i] <- currentWeights[i] + weights[i]
    IF currentWeights[i] > bestWeight THEN
      bestWeight <- currentWeights[i]
      bestIndex <- i
    END IF
  END FOR

  currentWeights[bestIndex] <- currentWeights[bestIndex] - totalWeight
  RETURN servers[bestIndex]
END

ALGORITHM LeastConnections(servers, activeConnections)
INPUT: servers: list of servers, activeConnections: map of server to connection count
OUTPUT: selected server
BEGIN
  minConnections <- INFINITY
  selected <- NIL

  FOR EACH server IN servers DO
    IF activeConnections[server] < minConnections THEN
      minConnections <- activeConnections[server]
      selected <- server
    END IF
  END FOR

  activeConnections[selected] <- activeConnections[selected] + 1
  RETURN selected
END

ALGORITHM ConsistentHashLookup(ring, key)
INPUT: ring: sorted list of (hashValue, serverID) pairs, key: request key
OUTPUT: selected server
BEGIN
  keyHash <- HASH(key) MOD 2^32

  // Binary search for the first ring position >= keyHash
  low <- 0
  high <- LENGTH(ring) - 1
  result <- 0

  WHILE low <= high DO
    mid <- (low + high) / 2
    IF ring[mid].hashValue >= keyHash THEN
      result <- mid
      high <- mid - 1
    ELSE
      low <- mid + 1
    END IF
  END WHILE

  // If keyHash is larger than all ring positions, wrap to first
  IF low > LENGTH(ring) - 1 THEN
    result <- 0
  END IF

  RETURN ring[result].serverID
END

ALGORITHM AddServerToRing(ring, serverID, virtualNodes)
INPUT: ring: sorted list of (hashValue, serverID), serverID: new server, virtualNodes: count
BEGIN
  FOR i <- 0 TO virtualNodes - 1 DO
    virtualKey <- serverID + ":" + TO_STRING(i)
    hashValue <- HASH(virtualKey) MOD 2^32
    INSERT (hashValue, serverID) INTO ring maintaining sorted order
  END FOR
END

ALGORITHM HealthCheck(servers, healthySet, interval)
INPUT: servers: all known servers, healthySet: set of currently healthy servers, interval: check period
BEGIN
  LOOP EVERY interval seconds DO
    FOR EACH server IN servers DO
      response <- SEND_PROBE(server)
      IF response is successful THEN
        ADD server TO healthySet
      ELSE
        REMOVE server FROM healthySet
      END IF
    END FOR
  END LOOP
END

Real-World Applications

Web applications: nearly every production web service places a load balancer (such as NGINX, HAProxy, or a cloud-managed balancer) in front of its application servers to distribute HTTP traffic and provide failover
Distributed caches: systems like Memcached and Redis clusters use consistent hashing to route cache keys to specific nodes, ensuring that adding or removing a node only invalidates a small fraction of cached data
Microservices: service meshes (Envoy, Linkerd) embed L7 load balancing into every service-to-service call, using least connections or weighted strategies to handle heterogeneous instance sizes
DNS-based balancing: DNS servers return different IP addresses for the same domain name in round-robin order, providing a coarse form of global load distribution across data centers
Database read replicas: applications route read queries across multiple database replicas using round robin or least connections, while directing all writes to the primary node
CDN edge routing: content delivery networks use geographic and latency-based load balancing to route users to the nearest edge server, combining L4 and L7 techniques

Key Takeaways

Load balancers distribute traffic across servers to increase capacity, improve availability, and enable zero-downtime maintenance
Round robin is simple and effective when servers are identical and requests are uniform; weighted round robin handles heterogeneous server capacities
Least connections adapts naturally to varying request costs and server speeds by always choosing the least-loaded server
Consistent hashing minimizes key remapping when servers join or leave ($\approx 1/n$ keys affected vs. $\approx (n-1)/n$ with modular hashing), making it ideal for caches and stateful services
L4 balancers are fast but blind to application content; L7 balancers enable content-based routing, TLS termination, and request manipulation at the cost of higher per-request overhead

2021-02-13

Caching: Strategies, Eviction, and Invalidation

How caches accelerate reads with strategies like LRU and LFU, how write-through and write-back policies keep data consistent, and why cache invalidation is one of the hardest problems in computing.

2021-02-14

Databases: SQL vs NoSQL, Indexing, and Query Optimization

How relational and non-relational databases differ in data modeling and guarantees, how B-tree and hash indexes accelerate lookups, and how query optimizers turn SQL into efficient execution plans.

2021-02-15

Message Queues and Event-Driven Architecture

How message queues decouple producers from consumers, the differences between pub/sub and point-to-point messaging, and the trade-offs between at-least-once and exactly-once delivery guarantees.