Notes/Sharding and Partitioning: Splitting Data Across Nodes

Sharding and Partitioning: Splitting Data Across Nodes

How distributed systems split data across multiple nodes using hash-based and range-based partitioning, consistent hashing, and rebalancing strategies.

2021-02-22AI-Synthesized from Personal Notes

Distributed SystemsShardingPartitioning

Terminology

Sharding (partitioning): dividing a dataset into smaller, independent subsets called shards (or partitions), each stored on a different node, so that no single node must hold the entire dataset
Shard (partition): a subset of the total data that lives on one node; each record belongs to exactly one shard
Partition key (shard key): the field or combination of fields used to determine which shard a record belongs to
Hash-based partitioning: assigning records to shards by computing a hash of the partition key and mapping the hash value to a shard
Range-based partitioning: assigning records to shards based on contiguous ranges of the partition key (e.g., A-M on shard 1, N-Z on shard 2)
Consistent hashing: a hashing scheme that maps both keys and nodes onto a ring, minimizing the number of keys that must move when nodes are added or removed
Virtual node (vnode): a logical partition point on the hash ring; each physical node owns multiple vnodes to improve load distribution
Hotspot: a shard that receives disproportionately more traffic than others, causing a performance bottleneck
Rebalancing: the process of redistributing data across shards when nodes are added, removed, or when load becomes uneven
Scatter-gather: a query pattern where a request is sent to all shards in parallel and the results are merged; necessary when the query cannot be routed to a single shard
Cross-shard join: a join operation that requires data from multiple shards, which is expensive because it involves network communication between nodes
Routing layer: a component (proxy, client library, or coordinator node) that determines which shard holds a given key and directs requests accordingly
Skew: uneven distribution of data or load across shards, where some shards hold significantly more data or receive more requests than others

What & Why

Replication copies the same data to multiple nodes. Sharding does the opposite: it splits data so each node holds a different subset. While replication improves read throughput and fault tolerance, sharding is how you scale writes and total storage capacity beyond what a single machine can handle.

Consider a database with 10 TB of data. No single server has enough RAM to keep it all in memory, and a single disk cannot serve enough I/O operations per second. By splitting the data across 10 nodes (1 TB each), each node handles only a fraction of the total load. Reads and writes for a given key go to exactly one node, so throughput scales roughly linearly with the number of nodes.

The core challenge is choosing how to split the data. A bad partitioning strategy creates hotspots where one shard gets most of the traffic while others sit idle. The partition key must distribute data and load evenly. You also need a plan for what happens when you add or remove nodes: how much data moves, and can you do it without downtime?

How It Works

Hash-Based Partitioning

Compute a hash of the partition key and use modular arithmetic to assign it to a shard:

$\text{shard} = \text{hash}(\text{key}) \mod N$

This distributes keys uniformly across $N$ shards (assuming a good hash function). The problem is that when $N$ changes (adding or removing a node), nearly every key maps to a different shard. For $N$ changing to $N+1$, approximately $\frac{N-1}{N}$ of all keys must be moved.

Range-Based Partitioning

Assign contiguous ranges of the key space to each shard. For example, with a timestamp key, shard 1 holds January-March, shard 2 holds April-June, and so on. This preserves key ordering, making range queries efficient (scan all records between two dates by querying a single shard or a small number of adjacent shards).

The downside is that ranges can become unbalanced. If most writes have recent timestamps, the shard holding the latest range becomes a hotspot. Range boundaries must be chosen carefully or adjusted dynamically.

Consistent Hashing

Consistent hashing maps both keys and nodes onto a circular hash space (ring) of size $2^{32}$. Each key is assigned to the first node encountered when walking clockwise from the key's position on the ring. When a node is added, only the keys in the arc between the new node and its predecessor need to move. When a node is removed, its keys move to the next node clockwise.

With $N$ nodes, adding or removing one affects only $\approx 1/N$ of the keys, compared to $\approx (N-1)/N$ with modular hashing.

Virtual nodes improve balance. Each physical node is placed at $v$ positions on the ring. With $v = 256$ vnodes per node, the standard deviation of load per node drops significantly, even with a small number of physical nodes. Cassandra and DynamoDB both use vnodes.

Rebalancing Strategies

Fixed number of partitions: Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes). Each node owns ~100 partitions. When a node is added, it takes partitions from existing nodes. When removed, its partitions are distributed to remaining nodes. The partition boundaries never change, only the assignment of partitions to nodes.

Dynamic partitioning: Start with a small number of partitions and split them when they grow too large (like B-tree page splits). Merge partitions that shrink below a threshold. This adapts to data volume automatically but adds complexity.

Proportional to nodes: Each node gets a fixed number of partitions. When a new node joins, it splits some existing partitions and takes half. This keeps the number of partitions proportional to the cluster size.

Complexity Analysis

Let $N$ be the number of nodes, $K$ the total number of keys, and $v$ the number of virtual nodes per physical node.

Strategy	Key Lookup	Keys Moved on Rebalance	Space
Modular hash ($\text{key} \mod N$)	$O(1)$	$\approx \frac{N-1}{N} \cdot K$	$O(N)$
Consistent hashing	$O(\log(N \cdot v))$	$\approx \frac{K}{N}$	$O(N \cdot v)$
Range-based	$O(\log P)$ where $P$ = partitions	1 partition per split/merge	$O(P)$
Fixed partitions	$O(1)$ with partition map	$\approx P/N$ partitions per node change	$O(P)$

The load imbalance with consistent hashing and $v$ virtual nodes per physical node:

$\text{Max load ratio} \approx \frac{\ln(N \cdot v)}{N \cdot v / N} = \frac{N \cdot \ln(N \cdot v)}{N \cdot v}$

With $v = 256$ and $N = 10$: each node handles approximately $10\% \pm 2\%$ of the total load, which is acceptable for most systems.

Scatter-gather query cost across all shards:

$\text{Scatter-gather latency} = \max_{i=1}^{N}(\text{latency}_i)$

The total latency is dominated by the slowest shard, making scatter-gather queries sensitive to tail latency.

Implementation

ALGORITHM HashPartitionLookup(key, numShards)
INPUT: key: partition key, numShards: total number of shards
OUTPUT: shard index
BEGIN
  hashValue <- HASH(key)
  shardIndex <- hashValue MOD numShards
  RETURN shardIndex
END

ALGORITHM ConsistentHashLookup(ring, key)
INPUT: ring: sorted list of (position, nodeID) pairs, key: partition key
OUTPUT: responsible node
BEGIN
  keyHash <- HASH(key) MOD 2^32

  // Binary search for first ring position >= keyHash
  low <- 0
  high <- LENGTH(ring) - 1
  result <- 0

  WHILE low <= high DO
    mid <- FLOOR((low + high) / 2)
    IF ring[mid].position >= keyHash THEN
      result <- mid
      high <- mid - 1
    ELSE
      low <- mid + 1
    END IF
  END WHILE

  // Wrap around if key is past the last node
  IF low > LENGTH(ring) - 1 THEN
    result <- 0
  END IF

  RETURN ring[result].nodeID
END

ALGORITHM RangePartitionLookup(key, boundaries)
INPUT: key: partition key, boundaries: sorted list of (lowerBound, shardID) pairs
OUTPUT: shard ID
BEGIN
  // Binary search for the partition whose range contains the key
  low <- 0
  high <- LENGTH(boundaries) - 1
  result <- 0

  WHILE low <= high DO
    mid <- FLOOR((low + high) / 2)
    IF boundaries[mid].lowerBound <= key THEN
      result <- mid
      low <- mid + 1
    ELSE
      high <- mid - 1
    END IF
  END WHILE

  RETURN boundaries[result].shardID
END

ALGORITHM RebalanceFixedPartitions(partitionMap, newNode, existingNodes)
INPUT: partitionMap: map of partitionID to nodeID, newNode: joining node, existingNodes: current nodes
OUTPUT: updated partitionMap
BEGIN
  totalPartitions <- LENGTH(partitionMap)
  targetPerNode <- FLOOR(totalPartitions / (LENGTH(existingNodes) + 1))

  // Count partitions per node
  nodeCounts <- empty map
  FOR EACH (partID, nodeID) IN partitionMap DO
    nodeCounts[nodeID] <- nodeCounts[nodeID] + 1
  END FOR

  // Move partitions from overloaded nodes to the new node
  moved <- 0
  FOR EACH (partID, nodeID) IN partitionMap DO
    IF moved >= targetPerNode THEN
      BREAK
    END IF
    IF nodeCounts[nodeID] > targetPerNode THEN
      partitionMap[partID] <- newNode
      nodeCounts[nodeID] <- nodeCounts[nodeID] - 1
      moved <- moved + 1
    END IF
  END FOR

  RETURN partitionMap
END

ALGORITHM ScatterGatherQuery(shards, query)
INPUT: shards: list of shard nodes, query: the query to execute
OUTPUT: merged results
BEGIN
  responses <- empty list

  // Send query to all shards in parallel
  FOR EACH shard IN shards DO
    ASYNC_SEND_QUERY(shard, query)
  END FOR

  // Collect responses
  FOR EACH shard IN shards DO
    response <- AWAIT_RESPONSE(shard)
    APPEND response.results TO responses
  END FOR

  // Merge and sort results
  merged <- MERGE_SORT(responses, query.orderBy)

  IF query.hasLimit THEN
    merged <- TAKE(merged, query.limit)
  END IF

  RETURN merged
END

Real-World Applications

Apache Cassandra: uses consistent hashing with virtual nodes to distribute data across a cluster; the partition key determines which node owns each row, and vnodes simplify rebalancing when nodes join or leave
Amazon DynamoDB: automatically partitions tables based on throughput and storage; uses consistent hashing internally and splits partitions when they exceed size or throughput limits
MongoDB: supports both hash-based and range-based sharding; the shard key is chosen by the application developer, and the balancer process automatically migrates chunks between shards to maintain even distribution
Google Spanner: uses range-based partitioning with automatic splitting and merging; ranges are organized hierarchically, and the system moves ranges between nodes to balance load
Elasticsearch: divides each index into a fixed number of shards at creation time; each shard is a self-contained Lucene index that can be placed on any node in the cluster
Vitess (YouTube): a sharding middleware for MySQL that adds horizontal scaling by routing queries to the correct shard based on a sharding key, without modifying the application's SQL

Key Takeaways

Sharding splits data across nodes so that each node holds a subset; this is how you scale writes and total storage beyond a single machine
Hash-based partitioning distributes keys uniformly but loses key ordering; range-based partitioning preserves ordering but risks hotspots on skewed data
Consistent hashing minimizes data movement when nodes are added or removed ($\approx 1/N$ keys move vs. $\approx (N-1)/N$ with modular hashing)
Virtual nodes improve load balance in consistent hashing by giving each physical node multiple positions on the ring
Rebalancing strategies (fixed partitions, dynamic splitting, proportional to nodes) trade simplicity against adaptability
Scatter-gather queries across all shards are expensive and latency-sensitive; choosing a good partition key that aligns with query patterns is critical for performance

2021-02-23

Consistency Models: From Strong to Eventual

Understanding the spectrum of consistency models in distributed systems, from linearizability and sequential consistency to causal and eventual consistency, and the trade-offs each model makes.

2021-02-24

Distributed Transactions: Coordinating Writes Across Services

How distributed systems coordinate multi-node writes using two-phase commit, three-phase commit, and the Saga pattern with compensating transactions.

2021-02-25

Vector Clocks and Conflict Resolution: Tracking Causality in Distributed Systems

How vector clocks, version vectors, and CRDTs track causality and resolve conflicts in distributed systems where there is no global clock.