Building data stores – Append Only

time to read 3 min | 533 words

One of the interesting aspects in building a data store is that you run head on into things that you would generally leave to the infrastructure. By far, most developers deal with concurrency by relegating that responsibility to a database.

When you write your own database, you have to build this sort of thing. In essence, we have two separate issues here:

Maximizing Concurrency – does readers wait for writers? does writers wait for readers? does writers wait for writers?
Ensuring Consistency – can I read uncommitted data? can I read partially written data?

As I mentioned in my previous post, there are two major options when building a data store, Transaction Log & Append Only. There are probably a better name for each, but that is how I know them.

This post is going to focus on append only. An append only store is very simple idea in both concept and implementation. It requires that you will always append to the file. It makes things a bit finicky with the type of data structures that you have to use, since typical persistent data structures rely on being able to modify data on the disk. But once you get over that issue, it is actually very simple.

An append only store works in the following manner:

On startup, the data is read in reverse, trying to find the last committed transaction.
That committed transaction contains pointers to locations in the file where the actual data is stored.
A crash in the middle of a write just means garbage at the end of the file that you have to skip when finding the last committed transaction.
In memory, the only thing that you have to keep is just the last committed transaction.
A reader with a copy of the last committed transaction can execute independently of any other reader / writer. It will not see any changes made by writers made after it started, but it also doesn’t have to wait for any writers.
Concurrency control is simple:

Readers don’t wait for readers
Readers don’t wait for writers
Writers don’t wait for readers
There can be only one concurrent writer

The last one is a natural fallout from the fact that we use the append only model. Only one thread can write to the end of the file at a given point in time. That is actually a performance boost, and not something that would slow the database down, as you might expect.

The reason for that is pretty obvious, once you start thinking about it. Writing to disk is a physical action, and the head can be only in a single place at any given point in time. By ensuring that all writes go to the end of the file, we gain a big perf advantage since we don’t do any seeks.

Tweet Share Share 10 comments

Tags:

Databases

Comments

18 Jun 2010
16:20 PM

Andrew

Good to see you posting again.

I'm not sure if its appropriate for your application, but there is an optimization you can use to possibly do bulk writes instead of doing them one at a time, to improve write performance/concurrency.

When a transaction calls Commit(), instead of taking the only writer and writing that transaction to disk and then releasing the writer (causing other transactions to repeat this step in sequence in a blocking fashion, waiting for IO to complete), you could do async or bulk writes.

This technique has been used before on InnoDB database by having a Prepare() method simply reserve the amount of space needed in the data file by advancing the 'next write position' pointer by the size of the data to be written (so the next transaction can start after our reserved space).

Then the actual Commit() method does the writing in one hit (eg several transactions are written to disk at once).

Of course this means you lose transaction safety when Commit() supposedly returns after the client requests it be committed, it still may be in memory pending a bulk write.

18 Jun 2010
16:28 PM

Ayende Rahien

Andrew,

Yeah, that is a nice optimization, but I consider the downside to be pretty bad.

18 Jun 2010
17:52 PM

Uriel Katz

isn`t a single write/flush atomic?

what i mean if you take all your record in a buffer then write it to a file and flush,is it written as a whole or not?

if yes you can have a sequential id(those can be made thread safe with CAS operations) in each record of the append only file,and when you read backwards you check if you got the biggest id.

18 Jun 2010
21:40 PM

Ayende Rahien

Uriel,

Writes to a single sector are supposed to be atomic. The problem is that HD no longer use sectors anymore.

I find it safest not to assume that and assume that writes aren't really atomic at the HD level until fsync is called.

19 Jun 2010
09:39 AM

Frans Bouma

In Windows, Disk IO is abstracted to a level where you'll never be able to say "this byte is now physically stored on disk at this very moment", you'll leave that to windows, to the HDD low level subsystem, to the HDD hardware and command queue. So this comes down to bundling commands to the HDD in such a way that the HDD diskhead steps as less as possible due to YOUR actions. You can never anticipate on another process getting the CPU and doing a disk step as well, and you shouldn't, that's what the low level subsystems are for. So you bundle as much commands as possible in such a way that diskstepping is avoided as much as possible _inside your bundle_, and that's what you can do, nothing more. What Andrew suggests seems to me impossible due to the nature of how windows handles IO.

About the article: what if you delete elements from the data? rdbms's use their own filesystem inside files, using pages (which typically are the size of a disk block or smaller, but fit in a diskblock without splitting them up) and have to have a form of re-use to avoid fragmentation. Appending to a file ultimately runs you into the situation where deletes fragment the file a lot. Unless your db is insert only, which is what you aren't striving for IMHO.

19 Jun 2010
10:06 AM

Ayende Rahien

Frans,

In Windows you can call FlushFileBuffers to ensure that the data is on the disk.

Without that facility, you couldn't build databases on Windows.

Yes, in a mutli application system, it is entirely possible that two different applications would queue different requests to different parts of the disk at the same time.

This reduce the benefit of seekless writes, but doesn't eliminate them.

Deletes are done by removing the data from the index and writing the new metadata.

Yes, that wastes space, and append only data stores usually handle that by running a compaction process every now and then.

20 Jun 2010
18:41 PM

Ajai Shankar

CouchDB has great documentation, and just linking to the free online book:

http://books.couchdb.org/relax/appendix/btrees
http://books.couchdb.org/relax/

Just helps to appreciate how raven db brings document + search so very nicely packaged for .NET developers...

Ajai

22 Jun 2010
23:56 PM

OmariO

Has SSDs changed anything?

23 Jun 2010
10:29 AM

Ayende Rahien

OmariO,

SSDs means that random reads are much faster, but random writes are still more expensive than sequential writes.

13 Jul 2010
11:37 AM

zihotki

Also one of the good points of append only data store is that you don't have to lock the data store or shut down it to do a backup. Backups a really fast and cheap.

Thanks a lot for the series and for pointing me to very interesting topics.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB