Break the algorithm: Distributed Lock

time to read 4 min | 677 words

The scenario for this is to create a locking mechanism in a Distributed Hash Table where nodes are allowed to fail without taking the entire DHT down.

Now, don’t expect too much out of this, I thought this out at 2 AM or so, and just sat down to hurriedly write it before it escape my mind.

The environment in which it runs is a DHT, where a key may reside on several nodes (usually 1 or 3). Taking a look means placing a lock item in over half of the nodes. Lock expires after a set amount of time (because we can’t trust the client to clear them). We assume a system that share a clock (or synchronize clocks).

The annoying thing is that we need to recover from situations in which some of the nodes holding the key are down or inaccessible.

Here is the pseudo code:

def LockKey(key, recursionDepth) as bool:
    topology = dht.GetTopologyFor(key)
    successfulLocks = 0
    lockExpiry = DateTime.Now.AddMinutes(1)
    lockKey = key+"_lock"
    for server in topology:
        try:
            server.WriteIfDoesNotExistsOrSameServer(lockKey, currentServerName, lockExpiry)
            successfulLocks += 1
        except ServerDown:
            ignore error
        except KeyAlreadyExists:
            if ScavengeExpiredLocks(lockKey):
                return LockKey(key, recursionDepth+1) if recursionDepth < 3
            return false


    return (successfulLocks/2) >= (topology.Count/2) //at least half the servers have the lock




def ScavengeExpiredLocks(key):
    topology = dht.GetTopologyFor(key)
    for server in topology:
       try:
           val = server.ReadKey(lockKey)
           if HasExpired(val):
                server.RemoveKey(lock, val.Version)
           else:
                return false
       except ServerDown:
           ignore error
       except KeyVersionChanged:
           return false
      
    return true
        
def ClearLock(key):
  topology = dht.GetTopologyFor(key)
  for server in topology:
     try:
         val = server.ReadKey(lockKey)
         if BelongsToCurrentServer(val):
              server.RemoveKey(lock, val.Version)
     except ServerDown:
         ignore error
     except KeyVersionChanged:
         ignore error

So, how many critical bugs do I have here?

Tweet Share Share 14 comments

Comments

05 Sep 2009
14:23 PM

Jason

Don't at least n/2+1 servers need to return success in order for the lock to be considered as 'entered'? Otherwise two nodes could enter the lock, each with half the nodes.

This is sort of like the Google Chubby protocol. They use a lock revision # that is incremented 'atomically' across nodes to ensure that two nodes can't lock on the same key.

05 Sep 2009
15:42 PM

Ayende Rahien

What do you mean by automatically incrementing the lock revision?

Can you explain?

05 Sep 2009
15:42 PM

Ayende Rahien

Jason,

You are probably right.

05 Sep 2009
15:52 PM

Jason

As I understand it, the Chubby service associates a number that is incremented for locking requests. It's a variant of the Paxos algorithm and the number is needed to form a consensus on who owns the lock.

That may be overkill for this use, but I think the idea is if you need to change something in the DHT from one state to the next, this gives you some 'consensus' of the 'start state' before you begin.

05 Sep 2009
17:17 PM

Uriel Katz

what interest me is that your pseudo is in Boo/Python(or really close to it) :)

05 Sep 2009
18:09 PM

Justin Chase

That looks more like boo than psuedo code :-P

05 Sep 2009
21:59 PM

Ayende Rahien

Uriel,

I love Boo for its clean syntax.

It is almost pseudo code

06 Sep 2009
11:13 AM

Alex Yakunin

Oren, I'd recommend you to read about Chubby & distributed consensus algorithms (Paxos, etc.). You'll see principal issues, rather than just technical.

Btw, we evaluated DHT approach for distributed storage for DO databases. And finally decided this approach won't work for storages we typically need: index seek can't beimplemented there well, but this is essential in quite many cases.

Note that e.g. BigTable is not DHT.

06 Sep 2009
15:49 PM

Alex Yakunin

P.S. For me the worst issues here are:

Operations aren't atomic. Its completely unclear what guarantees are you going to provide after their completion.

-There must be issues related to difference in time

Its completely unclear what will happen when new server wakes up after temporary failure (e.g. network outage).
It is unclear how they're classified as down / working. What invariants are guaranteed to be maintained?

P.S. One more good article to read is Microsoft Boxwood project description.

06 Sep 2009
15:54 PM

Alex Yakunin

Don't at least n/2+1 servers need to return success in order for the lock to be considered as 'entered'?

As far as I can judge from above, there is no code related to distributed consensus. So no any "global" state guarantees. Thus it's difficult to judge if this will work at all: no one can predict how such a system will work after failure, because state of recovered node initially can be completely unexpected.

06 Sep 2009
15:57 PM

Alex Yakunin

The annoying thing is that we need to recover from situations in which some of the nodes holding the key are down or inaccessible.

Ah, I see... That's the most complex problem. Check out the links ;)

08 Sep 2009
12:51 PM

Howard Pinsley

Ayende:

I'm confused by this line:

return (successfulLocks/2) >= (topology.Count/2)

Why is it not

return successfulLocks >= (topology.Count/2)

Are you somehow trying to deal with and even number of servers?

08 Sep 2009
14:04 PM

Ayende Rahien

Howard,

I would like to deal with even number of servers, yes.

But you are right, your code is simpler, much simpler.

08 Sep 2009
23:09 PM

Howard Pinsley

Actually, now that I think about it, since you want a majority for quorum, it probably should be:

return successfulLocks >= (topology.Count / 2) + 1

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB