Sharding and Partitioning: Splitting Data Across Nodes
How distributed systems split data across multiple nodes using hash-based and range-based partitioning, consistent hashing, and rebalancing strategies.
Terminology
- Sharding (partitioning): dividing a dataset into smaller, independent subsets called shards (or partitions), each stored on a different node, so that no single node must hold the entire dataset
- Shard (partition): a subset of the total data that lives on one node; each record belongs to exactly one shard
- Partition key (shard key): the field or combination of fields used to determine which shard a record belongs to
- Hash-based partitioning: assigning records to shards by computing a hash of the partition key and mapping the hash value to a shard
- Range-based partitioning: assigning records to shards based on contiguous ranges of the partition key (e.g., A-M on shard 1, N-Z on shard 2)
- Consistent hashing: a hashing scheme that maps both keys and nodes onto a ring, minimizing the number of keys that must move when nodes are added or removed
- Virtual node (vnode): a logical partition point on the hash ring; each physical node owns multiple vnodes to improve load distribution
- Hotspot: a shard that receives disproportionately more traffic than others, causing a performance bottleneck
- Rebalancing: the process of redistributing data across shards when nodes are added, removed, or when load becomes uneven
- Scatter-gather: a query pattern where a request is sent to all shards in parallel and the results are merged; necessary when the query cannot be routed to a single shard
- Cross-shard join: a join operation that requires data from multiple shards, which is expensive because it involves network communication between nodes
- Routing layer: a component (proxy, client library, or coordinator node) that determines which shard holds a given key and directs requests accordingly
- Skew: uneven distribution of data or load across shards, where some shards hold significantly more data or receive more requests than others
What & Why
Replication copies the same data to multiple nodes. Sharding does the opposite: it splits data so each node holds a different subset. While replication improves read throughput and fault tolerance, sharding is how you scale writes and total storage capacity beyond what a single machine can handle.
Consider a database with 10 TB of data. No single server has enough RAM to keep it all in memory, and a single disk cannot serve enough I/O operations per second. By splitting the data across 10 nodes (1 TB each), each node handles only a fraction of the total load. Reads and writes for a given key go to exactly one node, so throughput scales roughly linearly with the number of nodes.
The core challenge is choosing how to split the data. A bad partitioning strategy creates hotspots where one shard gets most of the traffic while others sit idle. The partition key must distribute data and load evenly. You also need a plan for what happens when you add or remove nodes: how much data moves, and can you do it without downtime?
How It Works
Hash-Based Partitioning
Compute a hash of the partition key and use modular arithmetic to assign it to a shard:
$\text{shard} = \text{hash}(\text{key}) \mod N$
This distributes keys uniformly across N shards (assuming a good hash function). The problem is that when N changes (adding or removing a node), nearly every key maps to a different shard. For N changing to N+1, approximately \frac{N-1}{N} of all keys must be moved.
Range-Based Partitioning
Assign contiguous ranges of the key space to each shard. For example, with a timestamp key, shard 1 holds January-March, shard 2 holds April-June, and so on. This preserves key ordering, making range queries efficient (scan all records between two dates by querying a single shard or a small number of adjacent shards).
The downside is that ranges can become unbalanced. If most writes have recent timestamps, the shard holding the latest range becomes a hotspot. Range boundaries must be chosen carefully or adjusted dynamically.
Consistent Hashing
Consistent hashing maps both keys and nodes onto a circular hash space (ring) of size 2^{32}. Each key is assigned to the first node encountered when walking clockwise from the key's position on the ring. When a node is added, only the keys in the arc between the new node and its predecessor need to move. When a node is removed, its keys move to the next node clockwise.
With N nodes, adding or removing one affects only \approx 1/N of the keys, compared to \approx (N-1)/N with modular hashing.
Virtual nodes improve balance. Each physical node is placed at v positions on the ring. With v = 256 vnodes per node, the standard deviation of load per node drops significantly, even with a small number of physical nodes. Cassandra and DynamoDB both use vnodes.
Rebalancing Strategies
Fixed number of partitions: Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes). Each node owns ~100 partitions. When a node is added, it takes partitions from existing nodes. When removed, its partitions are distributed to remaining nodes. The partition boundaries never change, only the assignment of partitions to nodes.
Dynamic partitioning: Start with a small number of partitions and split them when they grow too large (like B-tree page splits). Merge partitions that shrink below a threshold. This adapts to data volume automatically but adds complexity.
Proportional to nodes: Each node gets a fixed number of partitions. When a new node joins, it splits some existing partitions and takes half. This keeps the number of partitions proportional to the cluster size.
Complexity Analysis
Let $N$ be the number of nodes, $K$ the total number of keys, and $v$ the number of virtual nodes per physical node.
| Strategy | Key Lookup | Keys Moved on Rebalance | Space |
|---|---|---|---|
| Modular hash ($\text{key} \mod N$) | $O(1)$ | $\approx \frac{N-1}{N} \cdot K$ | $O(N)$ |
| Consistent hashing | $O(\log(N \cdot v))$ | $\approx \frac{K}{N}$ | $O(N \cdot v)$ |
| Range-based | $O(\log P)$ where $P$ = partitions | 1 partition per split/merge | $O(P)$ |
| Fixed partitions | $O(1)$ with partition map | $\approx P/N$ partitions per node change | $O(P)$ |
The load imbalance with consistent hashing and $v$ virtual nodes per physical node:
With $v = 256$ and $N = 10$: each node handles approximately $10\% \pm 2\%$ of the total load, which is acceptable for most systems.
Scatter-gather query cost across all shards:
The total latency is dominated by the slowest shard, making scatter-gather queries sensitive to tail latency.
Implementation
ALGORITHM HashPartitionLookup(key, numShards)
INPUT: key: partition key, numShards: total number of shards
OUTPUT: shard index
BEGIN
hashValue <- HASH(key)
shardIndex <- hashValue MOD numShards
RETURN shardIndex
END
ALGORITHM ConsistentHashLookup(ring, key)
INPUT: ring: sorted list of (position, nodeID) pairs, key: partition key
OUTPUT: responsible node
BEGIN
keyHash <- HASH(key) MOD 2^32
// Binary search for first ring position >= keyHash
low <- 0
high <- LENGTH(ring) - 1
result <- 0
WHILE low <= high DO
mid <- FLOOR((low + high) / 2)
IF ring[mid].position >= keyHash THEN
result <- mid
high <- mid - 1
ELSE
low <- mid + 1
END IF
END WHILE
// Wrap around if key is past the last node
IF low > LENGTH(ring) - 1 THEN
result <- 0
END IF
RETURN ring[result].nodeID
END
ALGORITHM RangePartitionLookup(key, boundaries)
INPUT: key: partition key, boundaries: sorted list of (lowerBound, shardID) pairs
OUTPUT: shard ID
BEGIN
// Binary search for the partition whose range contains the key
low <- 0
high <- LENGTH(boundaries) - 1
result <- 0
WHILE low <= high DO
mid <- FLOOR((low + high) / 2)
IF boundaries[mid].lowerBound <= key THEN
result <- mid
low <- mid + 1
ELSE
high <- mid - 1
END IF
END WHILE
RETURN boundaries[result].shardID
END
ALGORITHM RebalanceFixedPartitions(partitionMap, newNode, existingNodes)
INPUT: partitionMap: map of partitionID to nodeID, newNode: joining node, existingNodes: current nodes
OUTPUT: updated partitionMap
BEGIN
totalPartitions <- LENGTH(partitionMap)
targetPerNode <- FLOOR(totalPartitions / (LENGTH(existingNodes) + 1))
// Count partitions per node
nodeCounts <- empty map
FOR EACH (partID, nodeID) IN partitionMap DO
nodeCounts[nodeID] <- nodeCounts[nodeID] + 1
END FOR
// Move partitions from overloaded nodes to the new node
moved <- 0
FOR EACH (partID, nodeID) IN partitionMap DO
IF moved >= targetPerNode THEN
BREAK
END IF
IF nodeCounts[nodeID] > targetPerNode THEN
partitionMap[partID] <- newNode
nodeCounts[nodeID] <- nodeCounts[nodeID] - 1
moved <- moved + 1
END IF
END FOR
RETURN partitionMap
END
ALGORITHM ScatterGatherQuery(shards, query)
INPUT: shards: list of shard nodes, query: the query to execute
OUTPUT: merged results
BEGIN
responses <- empty list
// Send query to all shards in parallel
FOR EACH shard IN shards DO
ASYNC_SEND_QUERY(shard, query)
END FOR
// Collect responses
FOR EACH shard IN shards DO
response <- AWAIT_RESPONSE(shard)
APPEND response.results TO responses
END FOR
// Merge and sort results
merged <- MERGE_SORT(responses, query.orderBy)
IF query.hasLimit THEN
merged <- TAKE(merged, query.limit)
END IF
RETURN merged
END
Real-World Applications
- Apache Cassandra: uses consistent hashing with virtual nodes to distribute data across a cluster; the partition key determines which node owns each row, and vnodes simplify rebalancing when nodes join or leave
- Amazon DynamoDB: automatically partitions tables based on throughput and storage; uses consistent hashing internally and splits partitions when they exceed size or throughput limits
- MongoDB: supports both hash-based and range-based sharding; the shard key is chosen by the application developer, and the balancer process automatically migrates chunks between shards to maintain even distribution
- Google Spanner: uses range-based partitioning with automatic splitting and merging; ranges are organized hierarchically, and the system moves ranges between nodes to balance load
- Elasticsearch: divides each index into a fixed number of shards at creation time; each shard is a self-contained Lucene index that can be placed on any node in the cluster
- Vitess (YouTube): a sharding middleware for MySQL that adds horizontal scaling by routing queries to the correct shard based on a sharding key, without modifying the application's SQL
Key Takeaways
- Sharding splits data across nodes so that each node holds a subset; this is how you scale writes and total storage beyond a single machine
- Hash-based partitioning distributes keys uniformly but loses key ordering; range-based partitioning preserves ordering but risks hotspots on skewed data
- Consistent hashing minimizes data movement when nodes are added or removed ($\approx 1/N$ keys move vs. $\approx (N-1)/N$ with modular hashing)
- Virtual nodes improve load balance in consistent hashing by giving each physical node multiple positions on the ring
- Rebalancing strategies (fixed partitions, dynamic splitting, proportional to nodes) trade simplicity against adaptability
- Scatter-gather queries across all shards are expensive and latency-sensitive; choosing a good partition key that aligns with query patterns is critical for performance