The building blocks of a database: Transactional & persistent hash table
I am working on managed storage solution for RavenDB in an off again on again fashion for a while now. The main problem Is that doing something like this is not particularly hard, but it is complex. You either have to go with a transaction log or an append only model.
There are more than enough material on the matter, so I won’t touch that. The problem is that building that is going to take time, and probably a lot of it I decided that it is better off to have something than nothing, and scaled back the requirements.
The storage has to have:
- ACID
- Multi threaded
- Fully managed
- Support both memory and files
- Fast
- Easy to work with
The last one is important. Go and take a look at the codebase of any of the available databases. They can be… pretty scary.
But something has to be give, so I decided that to make things easier, I am not going to implement indexing on the file system. Instead, I’ll store the data on the disk, and keep the actual index completely in memory. There is an overhead of roughly 16 bytes per key plus the key itself, let us round it to 64 bytes per held per key. Holding 10 million keys would cost ~600 MB. That sounds like a lot, because it is. But it actually not too bad. It isn’t that much memory for modern hardware. And assuming that our documents are 5 KB in size, we are talking about 50 GB for the database size, anyway.
Just to be clear, the actual data is going to be on the disk, it is the index that we keep in memory. And once we have that decision, the rest sort of follow on its own.
It is intentionally low level interface, and mostly it gives you is a Add/Read/Remove interface, but it gives you multiple tables, the ability to do key and range scans and full ACID compliance (including crash recovery).
And the fun part, it does so in 400 lines!
Comments
I remember years ago reading how Interbase does versioned transactions so that people can read existing data with repeatable read whilst someone else is still in a transaction that has updated the data.
Is that the approach you meant when you said "an append only model"?
@Peter Morris
I don't know anything about Interbase, and just a little bit of CouchDB, but this is how I understand CouchDB works.
CouchDB uses immutable trees for their tables. When you modify the table, a new tree is created (creating only one new leaf and only the branches from the leaf up to the root). This way you can/could ask CouchDb for the current root and then at your own pace browse through that snapshot, while the rest of the world races by, creating updates by creating new trees.
cont. (heh heh..)
CouchDB uses an append-only file to store your changes. At the end of the file, your data, the new leaf and all the new branches will be written.
'Snapshotting' doesn't require an immutable structure and an immutable structure doesn't require append-only, but with my feeble mind I wouldn't want to implement them any other way.
looks cool!
i have a question though
public Stream CreateTemporaryStream()
{
}
who gonna delete that file?
Dear Ayende,
what exactly makes you think that you as a (unquestionable exceptionally talented) developer and architect will not into the same troubles that all those huge clumsy databases ran after a while? There is a big codebase not only because all other coders suck.
Don't get me wrong: Sometimes small projects change the whole way we see things (Ruby is the most prominent example).
How about though starting with a more "humble" idea like: "Let's skip some things a see how far we get with it and whether it is usable for some scenarios."
Simplifying DBs is good, I support that. Conveying or implying the message that they are complicated without any reason is misleading.
Or is a bit of "Not Invented Here" again?
Peter,
I am not aware of the Interbase model, but an append only model make MVCC trivial to work with.
Tovin,
That is done in the ReplaceAtomically function
Rickrock,
Mostly, because I am not trying to target the same feature set.
And by not trying to hit the same feature set, I have drastically simplified my problem set.
I don't recall exactly how it works but this link looks as though it explains it quite well. It might give you some additional ideas...
conferences.embarcadero.com/.../32280
Something sounds wrong in this code:
You are writting commands to log twice. When do you write commands to data?
Jesús,
Please note that the codebase there has underwent a lot of changes, I'll have a lot to say about this shortly, but you can look at the current codebase here:
github.com/ayende/ravendb/tree/master/Raven.Munin/
I spent time going back into Ayende's archives starting from the beginning while he was on jail duty and coding. It reminded me a little of a science fiction short story I once read about a child who was a gifted composer whom they kept apart from all other people in order that he should not hear any music composed by others and have his creativity tainted (whether through imitation or by consciously trying to avoid imitation).
I think Ayende is the type who should try and rewrite everything from scratch.
For example even microsoft re-implemented javascript regex for IE9, as mentioned here:
blogs.msdn.com/.../...-in-internet-explorer-9.aspx
If the indexes are kept in memory it will take significant amount of time to rebild them every time.
Can you comment on it please?
Dmyrii,
Not really. We tested it with millions of items, and it just works
My first question would be exactly the same as Torvin already wrote: In "Path.Combine(basePath, Path.GetFileName(Path.GetTempFileName()));", who is going to delete that empty file created in the TEMP directory by Path.GetTempFileName()? Despite its name, GetTempFileName() doesn't just generate a name, it creates an actual file you are supposed to delete...
Comment preview