Break the algorithm: Distributed Lock
The scenario for this is to create a locking mechanism in a Distributed Hash Table where nodes are allowed to fail without taking the entire DHT down.
Now, don’t expect too much out of this, I thought this out at 2 AM or so, and just sat down to hurriedly write it before it escape my mind.
The environment in which it runs is a DHT, where a key may reside on several nodes (usually 1 or 3). Taking a look means placing a lock item in over half of the nodes. Lock expires after a set amount of time (because we can’t trust the client to clear them). We assume a system that share a clock (or synchronize clocks).
The annoying thing is that we need to recover from situations in which some of the nodes holding the key are down or inaccessible.
Here is the pseudo code:
def LockKey(key, recursionDepth) as bool:
topology = dht.GetTopologyFor(key)
successfulLocks = 0
lockExpiry = DateTime.Now.AddMinutes(1)
lockKey = key+"_lock"
for server in topology:
try:
server.WriteIfDoesNotExistsOrSameServer(lockKey, currentServerName, lockExpiry)
successfulLocks += 1
except ServerDown:
ignore error
except KeyAlreadyExists:
if ScavengeExpiredLocks(lockKey):
return LockKey(key, recursionDepth+1) if recursionDepth < 3
return false
return (successfulLocks/2) >= (topology.Count/2) //at least half the servers have the lock
def ScavengeExpiredLocks(key):
topology = dht.GetTopologyFor(key)
for server in topology:
try:
val = server.ReadKey(lockKey)
if HasExpired(val):
server.RemoveKey(lock, val.Version)
else:
return false
except ServerDown:
ignore error
except KeyVersionChanged:
return false
return true
def ClearLock(key):
topology = dht.GetTopologyFor(key)
for server in topology:
try:
val = server.ReadKey(lockKey)
if BelongsToCurrentServer(val):
server.RemoveKey(lock, val.Version)
except ServerDown:
ignore error
except KeyVersionChanged:
ignore error
So, how many critical bugs do I have here?
Comments
Don't at least n/2+1 servers need to return success in order for the lock to be considered as 'entered'? Otherwise two nodes could enter the lock, each with half the nodes.
This is sort of like the Google Chubby protocol. They use a lock revision # that is incremented 'atomically' across nodes to ensure that two nodes can't lock on the same key.
What do you mean by automatically incrementing the lock revision?
Can you explain?
Jason,
You are probably right.
As I understand it, the Chubby service associates a number that is incremented for locking requests. It's a variant of the Paxos algorithm and the number is needed to form a consensus on who owns the lock.
That may be overkill for this use, but I think the idea is if you need to change something in the DHT from one state to the next, this gives you some 'consensus' of the 'start state' before you begin.
what interest me is that your pseudo is in Boo/Python(or really close to it) :)
That looks more like boo than psuedo code :-P
Uriel,
I love Boo for its clean syntax.
It is almost pseudo code
Oren, I'd recommend you to read about Chubby & distributed consensus algorithms (Paxos, etc.). You'll see principal issues, rather than just technical.
Btw, we evaluated DHT approach for distributed storage for DO databases. And finally decided this approach won't work for storages we typically need: index seek can't beimplemented there well, but this is essential in quite many cases.
Note that e.g. BigTable is not DHT.
P.S. For me the worst issues here are:
-There must be issues related to difference in time
Its completely unclear what will happen when new server wakes up after temporary failure (e.g. network outage).
It is unclear how they're classified as down / working. What invariants are guaranteed to be maintained?
P.S. One more good article to read is Microsoft Boxwood project description.
As far as I can judge from above, there is no code related to distributed consensus. So no any "global" state guarantees. Thus it's difficult to judge if this will work at all: no one can predict how such a system will work after failure, because state of recovered node initially can be completely unexpected.
Ah, I see... That's the most complex problem. Check out the links ;)
Ayende:
I'm confused by this line:
return (successfulLocks/2) >= (topology.Count/2)
Why is it not
return successfulLocks >= (topology.Count/2)
Are you somehow trying to deal with and even number of servers?
Howard,
I would like to deal with even number of servers, yes.
But you are right, your code is simpler, much simpler.
Actually, now that I think about it, since you want a majority for quorum, it probably should be:
return successfulLocks >= (topology.Count / 2) + 1
Comment preview