More CouchDB readingbtree

time to read 4 min | 789 words

I want to dive deep into the way CouchDB's file format, which is interesting, because it maintain ACID guarantees and the code is small enough to make it possible to read.

The file starts with a header, with the following structure:

"gmk\0"
db_header:
    writer_version
    update_seq
    summary_stream_state,
    fulldocinfo_by_id_btree_state
    docinfo_by_seq_btree_state
    local_docs_btree_state
    purge_seq
    purged_docs
// zero padding to 2048
md5 (including zero padding)

At the moment, I am not sure what some of the fields are (all the state and the purged_docs), and there is indication that this header can get larger than 2Kb. I'll ignore it for now and go study the way CouchDB retrieve a node. The internal data structure for the CouchDB file is a btree.

Here is the root lookup/2, which takes the btree and a list of keys:

image

It makes a call to lookup/3, the first of which is error handling for null btree, and the second on is the really interesting one.

get_node will perform a read from the file at the specified location. As you can see, the Pointer we get pass is the btree root at this stage. So this basically reads a term (an Erlang data structure) from the file. It is important to note that Erlang has stable serialization in the face of versioning, unlike .Net.

So, we get the node, and it is either a kp or a kv node (I think that kv is key/value and kp is key/pointer). Since key value seems to be easier to understand, I am going to look it up first.

image

As usual, we are actually interested in the lower one, with the first two for error handling. The first is to search for an empty list of keys, which return the reversed output. This is a standard pattern in Erlang, where you accumulate output until you run out of input to process in a recursive manner, at which point you simple return the reversed accumulator, since you need to return the data in the same order that you got it.

Indeed, we can see if from the signature of the third function that this is how this works. We split LookupKey from RestLookupKeys. We will ignore the second function for now, since I don't understand its purpose.

find_first_gteq seemed strange at first, until I realized that gteq stands for greater or equals to. This perform a simple binary search on NodeTuple. Assuming that it contains a list of ordered tuples (key,value), it will give you the index of the first item that is equal or greater to the LookupKey. The rest of the function is a simple recursive search for all the other items in the RestLookupKeys.

That is interesting, since it means that all the items are in memory. The only file access that we had was in the get_node call in lookup. That is interesting, since I would assume that it is entirely possible to get a keys across wide distribution of nodes. I am going to assume that this is there for the best case scenario, that the root node is also a value node that has no branches. Let us look at the handling of kp_node and see what is going on.

And indeed, this seems to be the case:

image

As usual, the first two are stop conditions, and the last one is what we are really interested in.

We get the first key that interest us, as before. But now, instead of having the actual value, we have a PointerInfo. This is the pointer to where this resides on the disk, I assume.

We then split the LookupKeys list on those who are greater then what FirstLookupKey is. For all those who are less then or equal to us, we call back to lookup, to recursively find us the results of them. Then we call again to all those who are greater than FirstLookupKey, passing it all those items.

In short, this is a implementation of reading a binary tree from a file, I started to call it simple, but then I realized that it isn't, really. It is elegant, though.

Of course, this is still missing a critical piece, the part in which we actually write the data, but for now I am actually able to understand how the file format works. Now I have to wonder about attachments, and how the actual document content is stored, beyond just the key.

Considering that just yesterday I gave up on reading lookup for being too complex, I am pretty happy with this.

More posts in "More CouchDB reading" series:

  1. (24 Sep 2008) btree
  2. (24 Sep 2008) btree