Voron’s Journal: Design Decisions

time to read 4 min | 730 words

I’ve spoken before about the idea behind Voron’s Journal. It allows us to do some pretty sweet things:

Sequential writes.
Incremental backups.
Multi Versioning Concurrency Control.
Mutable trees.

We haven’t really run perf testing yet (we have some early benchmarks, but they aren’t ready to be talked about yet), but things are looking up.

In particular, because we write to the log sequentially, we get a much better write speed. I’ll discuss the mechanics of flushing the journal in a bit, and I already talking about incremental backup.

What I really want to talk about now is how we implement MVCC and a major optimization: Mutable trees.

In LMDB, the engine uses a copy on write method to ensure that you can modify the data while other transactions are reading from it. This leads to a lot of of writes (and free space). We can avoid that entirely by using the journal file. This end up reduce the amount of writes we have to do by a factor of 10 or so in our random writes scenario. You can read more about this here.

The downside of having a journal is that you can’t really just keep the data there. You need to move it to the data file at some point. (CouchDB’s engine uses what is effectively only a journal file, but need to compact to save space).

And as it turns out, there are a lot of issues that we need to consider here. To start with, we have a minor write amplification. We are (at worst) writing twice as much data as we get. The reason that I am saying that we are writing at most twice as much data is that there are actually optimizations in place to stop up from writing the same page multiple times. So if we wrote five values in five transactions, we would have to write the 5 transactions to the journal, but only update a single page in the data file.

More interesting are how we are going to actually handle flushing the journal. Or to be rather more exact, when we are going to do that. We need to balance system speed vs. number of journal files vs. recovery time vs. performance spikes.

My current thinking is that we’ll start with 1MB log file (see side note), and flush to disk every 1MB. When a log file is full, we create a new one, and if there is more than a single log file, we will double the size of the log file (up to 64 – 256MB, I guess). This allows us to very quickly upsize the log file automatically without needing to do explicit configuration. It also means that under load, we will get bigger log files, hence better write performance. Once we are beyond the 1MB range (a size that I don’t think require any optimizations at all), we need to start thinking on how to do work in parallel, both flushing to the data file and writing to the current log file.

Side note: optimizing small and growing
One of the things that we noticed with our users is that we have a lot of users that are trying out the database with very small data sets. With RavenDB default configuration, we are set up to take about 64MB of disk space, and we got some complaints about that. It might be better to start small and increase the size, than start with a more reasonable default that hurts small dbs.

What I think we’ll do is allow only one concurrent data flushing, which will be how we limit the amount of work that flushing can take. After every write, we will check how much space we have to flush. If the space is about 50% of the current log file size, we will initiate a flush. We will also initiate a flush if there have been no writes for 5 seconds. What we are trying to do with this logic is that ideally, under load, we will always have writes to both the log file and flushing from the log file. The reason we want to do that is to ensure that we get consistent performance. We want to avoid spikes and stalls in performance. It is more important to be consistent that allow high burst speeds.

Tweet Share Share 12 comments

Tags:

raven

Comments

07 Nov 2013
10:33 AM

tobi

So you moved away from COW to a log-based model? The COW-model always seems like an odd trade-off.

Couldn't you just switch over to the stock InnoDB engine or some other war-hardened engine?

07 Nov 2013
10:36 AM

Ayende Rahien

Tobi, No, I couldn't do that. See the full talk for details: http://skillsmatter.com/podcast/scala/the-db-disassembly-kit

07 Nov 2013
12:00 PM

Catalin Pop

"reason we want to do that is to ensure that we get consistent performance"

From what you wrote it looks like you actually want consistent throughput. Consitent performance can also mean consistent latency (which does not seem to be a goal).

07 Nov 2013
12:11 PM

Ayende Rahien

Catalin, I am not sure that I understand the difference between your consistent perf, latency and throughput. Ideally, I want to have consistent response times overall. The three terms seems interchangeable in this context.

07 Nov 2013
14:39 PM

Brian Vallelunga

When in this flow does indexing happen? Does the indexer work with both the journal and stored data?

07 Nov 2013
15:05 PM

Ayende Rahien

Brian, This is several levels down from indexing, The indexer doesn't know about any of this.

07 Nov 2013
15:21 PM

Brian Vallelunga

Sorry, I should have been more clear. When is the data available for reading by the higher level code such as the indexer? Does it have to be flushed first or is data in the journal read as well?

07 Nov 2013
15:22 PM

Ayende Rahien

Brian, The data is immediately available, it can be read from the journal.

08 Nov 2013
04:15 AM

Howard Chu

LMDB's first basic concept was to be a memory-mapped database. Every decision after that was driven by this choice. The use of COW was mandatory, because if you are using a writable mmap, and you make changes to the database, you cannot guarantee the order in which the OS will persist those changes to disk. The only viable approach was to ensure that when you write to the database, you never write to a page that might currently be in use. A side benefit of this approach is that a system crash can never corrupt the database, so restarts after a crash are instantaneous; no recovery procedure is needed. This approach also naturally leads you to MVCC, you get it basically for free.

The strategy of using the journal to turn all writes into sequential writes is generally a good one. There's no free lunch though - eventually you have to merge the journaled ops back into the main DB file, and that means you're still going to have to do a lot of random I/Os at that point. Of course you can consolidate a lot of the I/Os by batching the journal updates, but there will still be a large number of random I/Os left. (Look at the performance delta between batched and non-batched random writes in the LMDB microbenchmarks - that's how much performance headroom you have to work with.)

08 Nov 2013
10:56 AM

Ayende Rahien

Howard, The journal solves another problem, it drastically simplify free space (I can modify the "same" page, not a different page). Even though that page actually reside in the journal file. That means that for a lot of updates, I don't have to go up the tree to modify things. That saves me not insignificant amount of I/O.

And yes, we would need to do flushing of the journal file. But it won't be random I/O. We actually batch the writes from the journal file to the data file, and we do so in a sequential manner. So yes, we might actually be skipping sectors (and might jump ahead a lot, but it ain't random).

We also get the benefit that we only need to modify any page only once, so for hot pages, we save a lot of IO.

08 Nov 2013
15:58 PM

Howard Chu

If I've understood what you're saying, that means that you write modified pages into your journal (as opposed to just logging operations/deltas). At merge time, you write the most up-to-date version of a page from the journal back to the main DB file. This is an update-in-place operation, so if the system crashes during this merge, the DB file is corrupted.

But I guess that's OK, because the merge is idempotent, so the next time you start up you can redo the merge. It sounds promising. The only hassle I see is in tracking which pages are live in the journal, to satisfy various read requests.

08 Nov 2013
16:16 PM

Ayende Rahien

Howard, No, if the system crash midway, that is not going to do anything bad, you can just restart it.

We do have to effectively track pages in journal, we keep a page translation table for that. It tend to be pretty small, since we flush quite often.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB