Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,162
Privacy Policy · Terms
filter by tags archive
time to read 9 min | 1692 words

Okay!

After diving into CouchDB source code and doing an in depth review of the btree:lookup, I am ready for the real challenge, figuring out how CouchDB writes to the btree. This is the really interesting part, from my point of view.

The heart of this functionality is the query_modify method. This method allow us to perform a batch of operations on the tree in one go. Following my own ideal that all remote APIs (and this is talking to disk, hence, remote) should support batching, I really like the idea.

image

The first part is dedicated to building the actions list. I am not going to go over that in detail, but I will say that this first sort things by key, and then by action.

So, for example, let us say that I want to perform the following actions:

query_modify(tree, {'c', 'a'},'a','a')

What will happen is that we will first query the current value of 'a', then we will remove it, and then we will insert it. Finally, we will query the value of 'c'. This is done in order to ensure good traversal performance of the tree, as well as to ensure consistency when performing multiple operations.

The part that really interest us is modify_node, which perform the actual work:

image

Now we are starting to see how things are really happening. We can see that if the root is null we create a value node, or we find the appropriate node if it is not. This follows what we seen in the lookup.

Moving on, we forward the call to modify_kpnode or modify_kvnode. I am going to touch kpnode first, since it is bound to be simpler (kvnode needs to handle splitting, I would assume).

image

This is far from trivial code, and I don't think that I understand it fully yet, but what it basically does is fairly simple. First the first node that match the current FirstActionKey, if this is the last node that we have to deal with, we call modify_node on it, accumulate the result and return it. If it is not the last node, we split the work, sending all that are less than or equal to the current FirstActionKey to be handled using modify_node (which is likely to be a key/value node and thus handled directly) and continue to handle the rest using modify_kpnode.

In a way, this is the exact reverse of how lookup_kvnode is handled.

modify_kvnode is complex. I can't fit the functions on a single screen in a 1280x1024 resolution. I am going to split it into two sections. It is not ideal, since they are supposed to go together, but I'm sure you'll manage.

image

The first part is to handle the case where we have more actions to perform. In this case, we can simple return the results. The second is there to handle the case where we run out of the items in the node. Note how insert works, for example. You can see that the way that the btree works is the same in which Erlang does. That is, we will always rewrite the entire node, rather than modify it.  remove is also interesting. If we got to this point and haven't found the node, it doesn't exist, so we can move on. Same for fetching.

Now let us see the more complex part, what happen if we do find the items in the value node?

image

Note that we have AccNode, which is our accumulator. We find the first node that match ActionKey, and then we take the NodeTuple and the AccNode and turn them into a reverse list. This copies all the items that are less than the current one to the ResultNode, those are the ones that we are not interested in, so we can just copy them as is.

The next part handles the actual adding/removing/fetching from the node. It is pretty self explanatory, I think, so I'll leave it at that.

So now we understand what modify_kvnode and modify_kpnode works. But there is nothing here about splitting nodes, which I find suspicious. Let us go back to modify_node and look what else is going on there:

image

Ah, note the handling of NewNodeList, there is probably where the money is.

We make a couple of tests. The first is to see if there are any nodes left, the second to see if we changed the node list (by comparing it to the old value). We don't care for any of those at the moment, so we will focus on the last one. write_node is called, and this is likely where we will see what we are looking for.

image

Well, this is simple enough, and chunkify seems like what we were looking for. However, It bothers me that we write all the nodes to disk. Is seems... wrong somehow. More specifically, since we are always appending, aren't we going to break the binary tree? There is also the reduce_node call that is hiding there, which we also need to check. It is being called after the node was written to disk, so I am not sure what is going on here.

Let us read chunkify first, and then deal with reduce_node.

image

Well... it looks like the chunkify function sucks. But anyway, what is going on there is fairly simple. We check if the list that we were passed is greater than CHUNK_THRESHOLD. This is set to 1279, for some strange reason that I can't follow. I assume that the reason for that is to ensure blocks of less than 1280, but that no sector size that I have heard of came in this size.

The second part is more interesting (and complex). OutputChunks is a list of lists of the elements that started in the InList. This piece of code is very short, but it does a tremendous amount of work.

And now we can move back to reduce_node, like I promised.

image

This is short, to the point, and interesting. The idea of rereduce is that when something has changed, you don't have to go and recalculate everything from scratch, you can take partial reduced results and the new values and combine them to produce the same result that you would have if you had reduced over the entire data set.

As you can see, calling reduce_node on a key pointer node will cause a re reduction, while on a value node, it just reduce after a map. I am assuming that the thought was that value nodes are small enough that there is no point in optimizing this.

There are a few interesting points that need to be raised, which I don't have answers for at the moment.

  • Why is this happening after we write to file?
  • How does this relate to CouchDB's ability to shell out to java script in order to perform maps and reductions?
  • What ensure that the ordering of the node reduction match the tree hierarchy?

At any rate, this completes our examination of write_node and modify_node, we can now go back to where we started, query_modify:

image

We are nearly at the end. We have seen that the nodes are written to disk after being chunkified. Note that currently nothing actually have a reference to them at this point. We do have KeyPointers, but they aren't attached to anything. If we crashed right now (directly after modify_node), there is no state change as far as we are concerned, we just lost some disk space that we will recover in the next compaction.

I am pretty sure that complete_root is the one responsible for hooking everything together, so let us see what it does...

image

Yes, I expected complete_root to be complicated as well :-)

What is going on here is the actual missing piece. This is what takes all the value nodes and turn them into pointer nodes, and does so recursively until we finally get to the point where we only have a single value returned, which is the root node. There is also handling for no nodes, in which case the tree is empty.

Basically, what this means that that the way CouchDB is able to achieve ACID using a btree is by saving the modified tree structure to disk on each commit. Since it is only the modified part that is saved, and since btree structure are really small, it has no real storage penalty. Now I want to go and find what actually save the new root tree to disk, since query_modify does not modify the actual header (which means that in the case of a crash, nothing will change from the point of view of the system).

Right now I suspect that this is intentional, that this allows to combine even more actions into a single transaction, even beyond what you can do in a single query_modify. This is especially true since the common interface for those would usually be add, remove, etc.

As an interesting side effect, this is also how CouchDB is able to handle MVCC. Since everything is immutable, it can just hand a tree reference to the calling code and forget about it. No changes ever occur, so you get serializable isolation level by default. And, from what I can see, basically for free.

Going over the couch_httpd file, is seems that the main interface is couch_db, so I am heading that way... and I run into some very complex functions that I don't really feel like reading right now.

Hope this post doesn't contain too many mistakes...

time to read 4 min | 789 words

I want to dive deep into the way CouchDB's file format, which is interesting, because it maintain ACID guarantees and the code is small enough to make it possible to read.

The file starts with a header, with the following structure:

"gmk\0"
db_header:
    writer_version
    update_seq
    summary_stream_state,
    fulldocinfo_by_id_btree_state
    docinfo_by_seq_btree_state
    local_docs_btree_state
    purge_seq
    purged_docs
// zero padding to 2048
md5 (including zero padding)

At the moment, I am not sure what some of the fields are (all the state and the purged_docs), and there is indication that this header can get larger than 2Kb. I'll ignore it for now and go study the way CouchDB retrieve a node. The internal data structure for the CouchDB file is a btree.

Here is the root lookup/2, which takes the btree and a list of keys:

image

It makes a call to lookup/3, the first of which is error handling for null btree, and the second on is the really interesting one.

get_node will perform a read from the file at the specified location. As you can see, the Pointer we get pass is the btree root at this stage. So this basically reads a term (an Erlang data structure) from the file. It is important to note that Erlang has stable serialization in the face of versioning, unlike .Net.

So, we get the node, and it is either a kp or a kv node (I think that kv is key/value and kp is key/pointer). Since key value seems to be easier to understand, I am going to look it up first.

image

As usual, we are actually interested in the lower one, with the first two for error handling. The first is to search for an empty list of keys, which return the reversed output. This is a standard pattern in Erlang, where you accumulate output until you run out of input to process in a recursive manner, at which point you simple return the reversed accumulator, since you need to return the data in the same order that you got it.

Indeed, we can see if from the signature of the third function that this is how this works. We split LookupKey from RestLookupKeys. We will ignore the second function for now, since I don't understand its purpose.

find_first_gteq seemed strange at first, until I realized that gteq stands for greater or equals to. This perform a simple binary search on NodeTuple. Assuming that it contains a list of ordered tuples (key,value), it will give you the index of the first item that is equal or greater to the LookupKey. The rest of the function is a simple recursive search for all the other items in the RestLookupKeys.

That is interesting, since it means that all the items are in memory. The only file access that we had was in the get_node call in lookup. That is interesting, since I would assume that it is entirely possible to get a keys across wide distribution of nodes. I am going to assume that this is there for the best case scenario, that the root node is also a value node that has no branches. Let us look at the handling of kp_node and see what is going on.

And indeed, this seems to be the case:

image

As usual, the first two are stop conditions, and the last one is what we are really interested in.

We get the first key that interest us, as before. But now, instead of having the actual value, we have a PointerInfo. This is the pointer to where this resides on the disk, I assume.

We then split the LookupKeys list on those who are greater then what FirstLookupKey is. For all those who are less then or equal to us, we call back to lookup, to recursively find us the results of them. Then we call again to all those who are greater than FirstLookupKey, passing it all those items.

In short, this is a implementation of reading a binary tree from a file, I started to call it simple, but then I realized that it isn't, really. It is elegant, though.

Of course, this is still missing a critical piece, the part in which we actually write the data, but for now I am actually able to understand how the file format works. Now I have to wonder about attachments, and how the actual document content is stored, beyond just the key.

Considering that just yesterday I gave up on reading lookup for being too complex, I am pretty happy with this.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}