Get thou out of my head, damn idea
Sometimes I get ideas, and they just won’t leave my head no matter what I do.
In this case, I decided that I wanted to see what it would take to implement an event store in terms of writing a fully managed version.
I am not really interested in the actual event store, I care a lot more about the actual implementation idea that I had (I/O queues in append only mode, if you care to know).
After giving it some though, I managed to create a version that allow me to write the following code:
1: var diskData = new OnDiskData(new FileStreamSource(), "Data");2:
3: var data = JObject.Parse("{'Type': 'ItemCreated', 'ItemId': '1324'}");4: var sp = Stopwatch.StartNew();
5: Parallel.For(0, 1000*10, i =>
6: {
7: var tasks = new Task[1000];8: for (int j = 0; j < 1000; j++)9: {
10: tasks[j] = diskData.Enqueue("users/" + i, data);11: }
12: Task.WaitAll(tasks);
13: });
14:
15: Console.WriteLine(sp.ElapsedMilliseconds);
Admittedly, it isn’t a really interesting client code, but it is plenty good enough for what I need, and it allowed me to check something really interesting, just how hard would I have to go to actually get really good performance. As it turned out, not that far.
This code writes 10 million events, and it does so in under 1 minutes (on my laptop, SSD drive). Just to give you some idea, that is > 600 Mb of events, and about 230 events per milliseconds or about 230 thousands events per second. Yes, that is 230,000 events / sec.
The limiting factor seems to be the disk, and I have some ideas on how to implement that. I still got roughly 12MB/s, so there is certainly room for improvement.
How does this work? Here is the implementation of the Enqueue method:
1: public Task Enqueue(string id, JObject data)2: {
3: var item = new WriteState4: {
5: Data = data,
6: Id = id
7: };
8:
9: writer.Enqueue(item);
10: hasItems.Set();
11: return item.TaskCompletionSource.Task;12: }
In other words, this is a classic producer/consumer problem.
The other side is reading the events from the queue and writing them to disk. There is just one thread that is doing that, and it is always appending to the end of the file. Moreover, because of the way it works, we are actually gaining the ability to batch a lot of them together into a stream of really nice IO calls that optimize the actual disk access. When we finished with a batch of items and flushed them to disk, only then are we going to complete the task, so the fun part is that for all intents and purposes, we are doing that while preserving transactionability of the system. Once the Enqueue task returned, we can be sure that the data is fully saved on disk.
That was an interesting spike, and I wonder where else I would be able to make use of something like this in the future.
Yes, those are pretty small events, and yes, that is a fake test, but the approach seems to be very solid.
And just for fun, with absolutely no optimizations what so ever, no caching, no nothing, I am able to load 1,000 events per stream in less than 10 ms.
Comments
Ayende,
Yes the mechanism is age old and works very well (lots of systems use similar mechanisms sql server, bitcask in riak, cassandra), . A few comments on your implementation. You are flushing once per second or when the queue is empty in the code I have. This second one is a great optimization for lowering request latency when you don't have tons of requests.
if(hadWrites) { if ((DateTime.UtcNow - lastWrite).TotalSeconds > 1) {
While this will give you very high throughput it also introduces a massive amount of latency under the load you gave it (assuming a dual append commit model you are talking 1-2 seconds of latency per transaction assuming durability). If you look in the Event Store code there is actually a heuristic for working around this by looking at the queue and how long your disk takes to fsync.
Another comment worth mentioning is that you use
streamSource.Flush(file);
From looking at the code I was sent by you flush is implemented by:
This means that your test is not durable and is actually likely running solely in memory albeit kernel level OS memory but memory. Putting on the flush will knock down your performance quite a bit. The other thing you have to be very careful about here is that very often consumer hardware will deliberately lie when you tell it to flush and will "say sure I did it" even though it didn't. :)
In general though yes reading and writing sequential streams is very very fast.
I should add another beauty of the model is if you DONT want to flush. You can just allow read aheads.
It the code in question publicly available? Would be interesting to see it in context.
The write model should also work rather nicely with old magnetic-tape storage. If we use the numbers provided by wikipedia on the UNISERVO (128 chars / inch density), then in order to sustain the write speed you observed, based on a message size of 50 chars, you would need 456.4 metres of tape per second. Hence the tape would have to move at a speed of roughly Mach 1.34.
Assuming those tapes were available in length of 730 metres, you would have to change the tape every 1.6 seconds.
The title should read "Get thee..." or "Get thyself...". Just sayin'
Now you're thinking like me, which is dangerous - you will eventually end up writing your storage engine for RavenDB or something, heh ;-)
Not to be snarky, but 'just getting throughput' is the easy part, usually. Especially with the excellent language support you get in .NET. Getting everything to work with low latencies, transactionally and without corrupting when the power goes out is decidedly non-trivial. Or is this just an exercise in what the upper bounds of the performance can be?
I agree with Roy. People make careers out of writing production quality transactional storage engines. The benchmark numbers aren't interesting to me on a storage backend hacked in a weekend.
The compaction strategy and failure handling concerns me. I don't think it would do well in a busy system over time.
All of these intricate details affect benchmarks quite a bit. It's easy to get favourable numbers by taking liberties of things required by production grade software that can be ignored in a benchmark.
That's why weekend hacks always out perform well established software.
"That's why weekend hacks always out perform well established software"
Might want to rephrase that just a touch.
@Roy, Ayende blog about those cases and more in a few posts last week about how certain features in RavenDb were implemented.
Roy, The entire thing work with ACID guarantees. And the latency is max 200 ms under very heavy load with read latency that is very low.
Nick, No, that is actually pretty good. It is easy to get good perf when you don't have to consider all of the other stuff that you want done.
Still disagree that weekend hacks ALWAYS outperform established software. I don't buy into context-insensitive absolutes. Lazy thinking.
Ayende,
This is how most systems are built under the covers. Check out cass, bitcask or even Essent that you use in RavenDb. Bitcask is OSS you can see it does essentially the code you have written (well with more concerns but same general idea). The event store works this way as well and is OSS (BSD license) you could even use core bits in ravendb.
"The entire thing work with ACID guarantees"
There is no C here. If by ACID you mean Durable then yes I agree durability has been reached. What do you intend to make C? For us our indexes are C. As you know from Raven indexes that are C can be expensive to get.
Even doing something as simple as keeping a current version number of say the document you are writing to to provide basic optimistic concurrency is actually reasonably difficult and slow to provide on top when we start talking about not being able to fit all your keys in memory. Especially since in order to maintain your consistency it would either run on the same thread as your write or you would end up with lots of very intricate locking code.
To be clear on my consistency comment:
"The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including but not limited to constraints, cascades, triggers, and any combination thereof."
I guess if you have no rules at all to make consistent then you are consistent but most systems have some rules to actually make consistent :)
Greg,
+1 about consistency.
From my understanding I didn't see any isolation from ACID implementation either.
If I'm incorrect please point it out to me as I'd like to learn how it's implemented but as far as I could tell I couldn't see any.
Also from my understanding it doesn't have Atomicity from ACID either since a batch doesn't fail as a unit. As far as I can tell you can get partial writes.
Kelly & Greg,
Atomicity - We don't support the notion of a batch, so either we have successfully written an even to the disk, or we didn't. There isn't a middle ground.
Consistency - We don't make the newly written event visible to the app code until we have saved that to disk, therefor, we are consistent.
Isolated - Same as before, an event can't touch another event, or impact it in any way.
Durable - Data goes to disk, and you can verify when it is actually flushed there.
Atomicity - We don't support the notion of a batch, so either we have successfully written an even to the disk, or we didn't. There isn't a middle ground.
Most of the time you have to support a batch.
Consistency - We don't make the newly written event visible to the app code until we have saved that to disk, therefor, we are consistent.
This is an interesting version of consistency. There are basically no rules. Add a small rule. I can put an expected version (optimistic concurrency) for the stream when writing and things get quite a bit trickier especially with supporting batches
to be clear what you are describing here has no form of dependency between any two things being written or read.
What I read has no dependency on what I write When I write multiple things they can have no dependency on each other
Well yeah you can make this very fast. Its just a log file. You can get it even faster very simply. Put it on 5 drives.No write or read from any drive can possibly have a dependency on a read/write from another drive
Ayende,
That statement about isolation doesn't make sense to me. Perhaps you can elaborate?
If isolation is guaranteed, how are you isolating reads from writes?
I'll observe that the use of ACID here seems to be pure marketing. Since there is no batching, this is obviously about single records that are written into a fairly simple store. That's basic consistency. The common notion of transactions starts at two or more things that need to happen at the same time.
If isolation guarantees were provided, if I were to start a read operation where I want to iterate over the stream of 100 million events to make some sort of calculation that may take an hour there would be no possible way for me to retrieve events committed via other transactions during the same time period. Isolation by a cursor position isn't good enough either because you could guess what the next sequence is and get an event that didn't exist at the time the reads began.
Kelly, Isolation here works by having the reads always work on top of the idToPos dictionary, which contains the file positions. Only after we flush to disk will we update that dictionary. Therefor, reads see the "old" state, and only after the disk flush "tx commit", is that state visible.
Kelly, Let us say that you want to read an existing event stream. It has 10,000,000 events in it. And it takes an hour to iterate over it. During that hour, you have what is effectively snapshot isolation over this. You can only see if as it was when you started the read.
This is because of the way the system works. You get the latest file offset from the in memory data structure. Then you start moving backward. The file is immutable, and you are always moving backward, there is no chance of you seeing other data.
+1 Clemens
Comment preview