A backend engineer's journey of learning and growth.
by kan01234
To achieve horizontal scaling, it’s crucial to distribute requests and data efficiently and evenly across servers. A key challenge arises when adding or removing servers, which can disrupt the balance and lead to costly data movement. Consistent hashing is a powerful technique designed to address this issue. But before diving into how it works, let’s first take a closer look at the problem.
In computing, hashing is a technique used to efficiently map data (such as a key) to a fixed location in memory or on disk. This process is carried out using a hash function, which takes an input (the key) and produces a fixed-size hash value, often represented as an integer. The goal of hashing is to distribute data uniformly across a set of available storage locations.
In a traditional system, hashing helps quickly locate and retrieve data. For instance, a simple hash function could map a user ID to a particular server in a system of five servers. Using a modulo operation like hash(user_id) % 5, we can easily determine which server should handle the request.
User ID | Hash Value (Simulated) | Hash Value % 5 (Server) |
---|---|---|
1001 | 5643 | 5643 % 5 = 3 (Server 3) |
2005 | 7932 | 7932 % 5 = 2 (Server 2) |
3021 | 10123 | 10123 % 5 = 3 (Server 3) |
4987 | 16489 | 16489 % 5 = 4 (Server 4) |
5763 | 18642 | 18642 % 5 = 2 (Server 2) |
This means:
There’s a fundamental problem with traditional hashing when applied to distributed systems: scalability. Distributed systems often involve multiple servers or nodes that may need to be dynamically added or removed due to scaling requirements or failures.
Node Addition/Removal: When new nodes are added or removed, the entire hash distribution changes. For example, if we add a new server, the hash function’s modulo calculation (hash(key) % number_of_nodes) no longer holds, and almost every piece of data must be reassigned to a different node.
Data Movement: As a result, significant amounts of data must be redistributed across nodes, which can lead to performance degradation, system downtime, and increased operational costs.
User ID | Hash Value (Simulated) | Hash Value % 5 (Server) | Hash Value % 6 (Server) |
---|---|---|---|
1001 | 5643 | 5643 % 5 = 3 (Server 3) | 5643 % 6 = 3 (Server 3) |
2005 | 7932 | 7932 % 5 = 2 (Server 2) | 7932 % 6 = 0 (Server 0) |
3021 | 10123 | 10123 % 5 = 3 (Server 3) | 10123 % 6 = 1 (Server 1) |
4987 | 16489 | 16489 % 5 = 4 (Server 4) | 16489 % 6 = 1 (Server 1) |
5763 | 18642 | 18642 % 5 = 2 (Server 2) | 18642 % 6 = 3 (Server 3) |
Consistent hashing is a distributed hashing technique that addresses the key challenges faced by traditional hashing when nodes are added or removed. It achieves this by using a ring (or circle) structure where both data (keys) and nodes (servers) are placed based on their hash values.
The Hash Ring: Imagine a circle (or ring) where all possible hash values are arranged in a clockwise direction. Each node in the system is assigned a position on this ring, determined by applying a hash function to the node’s identifier (like its IP address or server ID).
Key Placement: Each data key is also assigned a position on the ring using the same hash function. A key is mapped to the first node that appears in a clockwise direction from its position. This ensures that each node is responsible for a specific range of keys.
The term “consistent hashing” was introduced by David Karger et al. at MIT, and the basic idea are:
However, two main problems arise with this approach:
1. Imbalanced Partitions:
In a real system, it’s impossible to ensure that each server gets an equal-sized partition of the ring. Each partition is the hash space between two adjacent servers. As servers are added or removed, the size of these partitions can become highly uneven, meaning:
This imbalance can lead to inefficiency, as some servers may be overburdened while others are underutilized.
if node 3 is removed, node 4 will get much more data than other nodes
2. Hotspots and Load Imbalance:
Because partitions vary in size, it’s likely that some servers will experience much heavier loads than others. For instance, a server with a larger partition will receive more requests, which could cause performance bottlenecks. This can become particularly problematic as new servers are added or removed, leading to frequent rebalancing.
To address these issues, one common solution is to use virtual nodes. By assigning multiple virtual nodes to each server, the partitions can be split more evenly across the ring. This ensures a more balanced distribution of keys and prevents any single server from being overloaded.
Assume we have 3 physical nodes, and for each node, we create 3 virtual nodes. In a real-world scenario, the number of virtual nodes would typically be much larger, but for simplicity, we’ll use 3 here.
Instead of directly mapping to physical nodes like N1, N2, and N3, we now divide the hash ring into 9 partitions represented by the virtual nodes: N1_0, N1_1, N1_2, N2_0, N2_1, N2_2, N3_0, N3_1, N3_2. Each virtual node corresponds to a different partition on the hash ring.
With this setup, each physical node is responsible for multiple smaller partitions (represented by its virtual nodes), which helps distribute the load more evenly across all nodes in the system. This ensures that if a node joins or leaves the system, only the keys associated with its virtual nodes need to be reassigned, minimizing disruption and improving scalability.
How Virtual Nodes Solve the Problem
1. Imperfect Load Balancing
Even with virtual nodes, consistent hashing doesn’t always guarantee a perfectly balanced load. Some servers may still end up handling more data or requests than others, especially if certain keys are accessed more frequently (hotspots). For instance, in a system where some resources (e.g., popular videos or viral posts) are much more in demand, consistent hashing may not fully mitigate the issue.
2. Increased System Complexity:
Managing virtual nodes introduces overhead, particularly in large-scale systems with numerous servers. Adding more virtual nodes to improve balance increases operational complexity and makes it more difficult to debug and manage the system.
3. Cold Start Problem:
When new servers are added (e.g., during scaling events), they may initially have no data assigned, leading to a cold start issue. This server will need to gradually pick up load, which may create inefficiencies or delays. Over time, consistent hashing will rebalance, but the initial load distribution can be less efficient.
4. Snowball Effect:
If a server failure or removal occurs in an already uneven system, the redistribution of load can cause other servers to become overloaded. This “snowball effect” can lead to cascading failures as more and more load is shifted to fewer servers, potentially disrupting the entire system.
5. Hash Function Dependency:
The effectiveness of consistent hashing depends heavily on the quality of the chosen hash function. Poorly designed or non-uniform hash functions can lead to uneven distribution of data, which defeats the purpose of using consistent hashing in the first place.
Efficiency:
Uniformity:
The key benefits of consistent hashing include:
Consistent hashing is a core part of the infrastructure for many large-scale distributed systems, including:
By leveraging consistent hashing, these systems can scale effectively, distribute data evenly, and maintain high availability, even in the face of frequent changes to the system topology.
tags: system-design