atom feed4 messages in com.basho.lists.riak-usersRiak doesn't use consistent hashing
FromSent OnAttachments
Greg NelsonMay 25, 2011 10:54 pm 
Jonathan LangevinMay 26, 2011 6:57 am 
Ben TillyMay 26, 2011 7:20 am 
Greg NelsonMay 26, 2011 11:49 am 
Subject:Riak doesn't use consistent hashing
From:Greg Nelson (gro@dropcam.com)
Date:May 25, 2011 10:54:25 pm
List:com.basho.lists.riak-users

I've been doing some digging through the details of how a node joins a cluster.
When you hear that Riak uses consistent hashing, you'd expect it to distribute
keys to nodes by hashing keys onto the ring AND hashing nodes onto the ring.
Keys belong to the closest node on the ring, in the clockwise direction. Add a
node, it hashes onto the ring and takes over some keys. Ordinarily the node
would hash onto the ring in several places, to achieve better spread. Some data
(roughly 1 / #nodes) moves to the new node from each of the other nodes, and
everything else stays the same.

In what Amazon describes as operationally simpler (strategy 3 in the Dynamo
paper), the ring is instead divided into equally-sized partitions. Nodes are
hashed onto the ring, and preflists are calculated by walking clockwise from a
partition, skipping partitions on already visited nodes. Riak does something
similar: it divides the ring into equally-sized partitions, then nodes
"randomly" claim partitions. However, the skipping bit isn't part of Riak's
preflist calculation. Instead, nodes claim partitions in such a way as to be
spaced out by target_n_val, to obviate the need for skipping.

Now, getting back to what happens when a node joins. The new node calculates a
new ring state that maintains the target_n_val invariant, as well as trying to
keep even spread of partitions per node. The algorithm (default_choose_claim) is
heuristic and greedy in nature, and recursively transfers partitions to the new
node until optimal spread is achieved, maintaining target_n_val along the way.
But if -- during one of those recursive calls -- it can't meet the target_n_val,
it will throw up its hands and completely re-do the whole ring (by calling
claim_rebalance_n). Striping the partitions across nodes, in a round-robin
fashion. When that happens, most of the data needs to be handed off between
nodes.

This happens a lot, with many ring sizes. With ring_creation_size=128 (i.e., 128
partitions), it will happen when adding node 9 (87.5% of data moves), adding
node 12 (82%), adding node 15 (80%), adding node 19 (94%). It happens with all
ring sizes >= 128 (256, 512, 1024, ...). It appears that any ring_creation_size
(64 by default) is safe for growing to 8 nodes or so. But if you want to go
beyond that... A ring size of >= 128 with more than 8 nodes doesn't seem all
that unusual, surely someone has hit this before? I've filed a bug report here:
https://issues.basho.com/show_bug.cgi?id=1111

Anyway, this feels like a bit of a departure from consistent hashing. In fact,
could this not be replaced by normal hashing + a lookup table mapping intervals
of the hash space to nodes? And isn't that simply sharding?

At any rate, I believe the claim algorithm can be improved to avoid those "throw
up hands and stripe everything" scenarios. In fact, here is such an
implementation: https://github.com/basho/riak_core/pull/55. It is still
heuristic and greedy, but it seems to do a better job of avoiding re-stripe.
Test results are attached in a zip on the bug linked above. I'd love to get the
riak_core gurus at Basho to look at this and help validate it. It probably could
use some cleaning up, but I want to make sure there aren't other invariants or
considerations I'm leaving out -- besides maintaining target_n_val, keeping
optimal partition spread, and minimizing handoff between ring states.

-Greg