Break the algorithm: Distributed Lock

time to read 4 min | 677 words

The scenario for this is to create a locking mechanism in a Distributed Hash Table where nodes are allowed to fail without taking the entire DHT down.

Now, don’t expect too much out of this, I thought this out at 2 AM or so, and just sat down to hurriedly write it before it escape my mind.

The environment in which it runs is a DHT, where a key may reside on several nodes (usually 1 or 3).  Taking a look means placing a lock item in over half of the nodes. Lock expires after a set amount of time (because we can’t trust the client to clear them). We assume a system that share a clock (or synchronize clocks).

The annoying thing is that we need to recover from situations in which some of the nodes holding the key are down or inaccessible.

Here is the pseudo code:

def LockKey(key, recursionDepth) as bool:
topology = dht.GetTopologyFor(key)
successfulLocks = 0
lockExpiry = DateTime.Now.AddMinutes(1)
lockKey = key+"_lock"
for server in topology:
try:
server.WriteIfDoesNotExistsOrSameServer(lockKey, currentServerName, lockExpiry)
successfulLocks += 1
except ServerDown:
ignore error
except KeyAlreadyExists:
if ScavengeExpiredLocks(lockKey):
return LockKey(key, recursionDepth+1) if recursionDepth < 3
return false

return (successfulLocks/2) >= (topology.Count/2) //at least half the servers have the lock



def ScavengeExpiredLocks(key):
topology = dht.GetTopologyFor(key)
for server in topology:
try:
val = server.ReadKey(lockKey)
if HasExpired(val):
server.RemoveKey(lock, val.Version)
else:
return false
except ServerDown:
ignore error
except KeyVersionChanged:
return false

return true

def ClearLock(key):
topology = dht.GetTopologyFor(key)
for server in topology:
try:
val = server.ReadKey(lockKey)
if BelongsToCurrentServer(val):
server.RemoveKey(lock, val.Version)
except ServerDown:
ignore error
except KeyVersionChanged:
ignore error

So, how many critical bugs do I have here?