Voron, LMDB and the external APIs, on my!

time to read 12 min | 2250 words

One of the things that I really don’t like in LMDB is the API that is exposed to the user. Well, it is C, so I guess there isn’t much that can be done about it. But let look at the abstractions that are actually exposed to the user by looking how you usually work with Voron.

   1: using (var tx = Env.NewTransaction(TransactionFlags.ReadWrite))

   2: {

   3:     Env.Root.Add(tx, "key/1", new MemoryStream(Encoding.UTF8.GetBytes("123")));

4:

   5:     tx.Commit();

   6: }

7:

8:

   9: using (var tx = Env.NewTransaction(TransactionFlags.Read))

  10: {

  11:     using(var stream = Env.Root.Read(tx, "key/1"))

  12:     using (var reader = new StreamReader(stream))

  13:     {

  14:         var result = reader.ReadToEnd();

  15:         Assert.Equal("123", result);

  16:     }

  17:     tx.Commit();

  18: }

This is a perfectly nice API, it is quite explicit about what is going on, and it gives you a lot of options with regards to how to actually make things happen. It also gives the underlying library about zero chance to do interesting things. Worse, it means that you have to know, upfront, if you want to do a read only or a read/write operation. And since there can be only one write transaction at any given point in time… well, I think you get the point. If you code doesn’t respond well to explicit demarcation between read/write, you have to create a lot of writes transaction, essentially serializing pretty much your entire codebase.

Now, sure, you might have good command / query separation, right? So you have queries for reads and commands for writes, problem solved. Except that the real world doesn’t operate in this manner. Let us consider the trivial case of a user logging in. When a user logs in, we need to check the credentials, and if they are wrong, we need to mark it so we can lock the account after 5 failed tries. That means either having to always do the login in a write transaction (meaning only one user can log it at any time) or we start with a read transaction, then we switch to a write transaction when we need to write.

Either option isn’t really nice as far as I am concerned. Therefor, I came with a different API (which is internally based on the one above). This now looks like this:

   1: var batch = new WriteBatch();

   2: batch.Add("key/1", new MemoryStream(Encoding.UTF8.GetBytes("123")), null);

3:

   4: Env.Writer.Write(batch);

5:

   6: using (var snapshot = Env.CreateSnapshot())

   7: {

   8:     using (var stream = snapshot.Read(null, "key/1"))

   9:     using (var reader = new StreamReader(stream))

  10:     {

  11:         var result = reader.ReadToEnd();

  12:         Assert.Equal("123", result);

  13:     }

  14: }

As you can see, we make use of snapshots & write batches. Those are actually ideas taken from LevelDB. A write batch is a set of changes that we want to apply to the database. We can add any number of changes to the write batch, and it require no synchronization. When we want to actually write those changes, we call Writer.Write(). This will take the entire batch and apply it as a single transactional unit.

However, while it will do so as a single unit, it will also be able to merge concurrent calls to WriteBatch into a single write transaction, increasing the actual concurrency we gain by quite a bit. The expected usage pattern is that you create a snapshot, do whatever you need to do when reading the data, including maybe adding/removing stuff via a WriteBatch, and finally you write it all out.

Problems with this approach:

You can’t read stuff that you just added, because they haven’t been added yet to the actual storage yet. (Generally not that much of an issue in our expected use case)
You need to worry about concurrently modifying the same value in different write batches. (We’re going to add optimistic concurrency option for that purpose)

Benefits of this approach:

We can optimize concurrent writes.
We don’t have to decide in advance whatever we need to read only or read / write.

Tweet Share Share 11 comments

Tags:

raven

Comments

12 Sep 2013
21:49 PM

Ayende Rahien

Rafal, Same thing we did in RavenDB a while ago. Take multiple concurrent transactions and merge all their writes. We have to change the API to do so, but I think it is worth it. See: http://ayende.com/blog/163554/voron-lmdb-and-the-external-apis-on-my?key=43b8ccf6dce64b4c918f965e2c64a0d8

27 Sep 2013
09:09 AM

Rob Ashton

<3 for using sodding streams.

I'm in the JVM at the moment and the wrappers around any of the decent storage engines I'd want to use are all exposed as either large byte arrays or strings. Hiss.

27 Sep 2013
14:52 PM

Kyle Szklenski

Incidentally, in case it matters to you, this is exactly the type of post of yours that I love. You asked a while back. API design and discussions (cost-benefit type stuff) is really fun and interesting, and it's always good to see other viewpoints aside from my own.

27 Sep 2013
16:27 PM

Howard Chu

The LMDB API was modeled after the BDB API. It was designed to allow rapid porting from BDB code. Since BDB is still the #1 embedded transactional key value store, it was a pragmatic choice.

Some things were tweaked for OpenLDAP's convenience, though. BDB doesn't treat read-only transactions any special way, but since the majority of LDAP operations are reads, it made sense to tailor the LMDB API for reads. That works fine for us, and it's real world for LDAP.

Also, in your login example, those really are two separate actions - checking the credential is a read-only action. Recording a failed login is a write action, and there is no valid reason why the two steps should be contained in a single transaction. You picked a pretty good example here, since 90% of the use for LDAP is in authentication systems...

28 Sep 2013
12:13 PM

Udi Dahan

The failed login example would be better handled by publishing an event (LoginFailed) which would be routed to some subscriber responsible for recording the failure (as @HowardChu said) in a separate transaction.

28 Sep 2013
17:41 PM

Rafal

@Udi but as I understand it, writing and reading are always done in separate transactions. WriteBatch doesn't do any reading and the snapshot can't write.

29 Sep 2013
01:51 AM

Howard Chu

Rafal - the LevelDB API doesn't actually support transactions. In particular, transactions must allow reads within a txn to see what was written in that txn, while preventing anything outside the txn from seeing it. If you have a chain of dependent modifications, where each mod depends on effects of prior mods, you cannot support that using the Writebatch model. Likewise you can't do a simple Iterate + Modify loop where the modify actions alter the iteration scope. You can do that with real transactions, and most RDBMSs depend on this.

The WriteBatch model only supports blind writes - where what you're writing has no dependency on what already exists. Most RDBMS transactions that perform writes are read-modify-write operations.

This is why it's a bad idea to use LevelDB as a backend for an RDBMS (but the MariaDB folks are trying it anyway. Suckers.)

29 Sep 2013
07:41 AM

Rafal

So, now i'm not sure if i get Ayende's idea. The LevelDB API requires you to know upfront if you'll be reading or writing, so it's no better than the original LMDB/Voron API in this aspect. And the original API had an option to do R/W transactions, which LevelDB doesn't. Apart from that, i think it will not improve write concurrency too much. After all, the underlying database is based on a double-buffering mechanism so if you have two open read snapshots with two different versions no writing can be done, no matter how you shuffle and combine your transactions. So, the improvement in write performance can be only a result of these 'blind writes', which can be combined in any way because there's no reading in between.

29 Sep 2013
07:44 AM

Ayende Rahien

Howard, We run into that, and while it wasn't trivial to make that happen, it was pretty easy to have a merge of snapshot & write batch, resulting in pretty much the same thing.

29 Sep 2013
07:46 AM

Ayende Rahien

Udi, Great, now I have CQRS and messaging in my login page. And it doesn't handle the "last login time" for the success case, unless I make that into an event as well. There are many scenarios where pub/sub is never an issue (this blog, for example), and trying to introduce this in order to compensate for a feature issue would be bad.

29 Sep 2013
07:47 AM

Ayende Rahien

Rafal, The ability to read from a snapshot and create writes on the side, then write them in a single batch is what allows us to actually merge multiple transactions into a single lock.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB