Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 6 min | 1006 words

There are some features that on completion, just made my day/week/month. This is one of them. I’ve only just started recovering from the marathon of build this feature (this is written on Friday, the feature was completed on Sunday, around 11 AM, after about 12 hours or so of work).

Why am I so excited, and why did it merit such efforts? Transaction lock handoff is a more accurate name to our version of early lock release, which is a feature that I have been wanting for over four years.

Let me try to explain why this is important. Voron is a single writer storage engine, which means that there can only be a single write transaction at any given point in time. A lot of that is mitigated by transaction merging, which means that we can do a lot of the preparation work ahead of time, and only send the processed work to be done as part of the transaction. But it does mean that there a single write transaction, and under load, it means that we have the following pattern:

image

The wavy line is # of writes / sec, and you can see that it is going up & down like crazy. The reason for that is that whenever we actually need to commit a transaction, we can’t continue processing requests. They have to wait for the next transaction to start . And that means that the old one has to complete, which require us to finish doing a write all the way to the disk.

So basically, the drops in performance happens whenever we have to wait for I/O. But we have to wait for the transaction to complete before we can start the next one, so we are effectively bottlenecked.

Early lock release is a technique which alleviate the problem. In effect, instead of waiting for the I/O to complete before starting the next transaction, we start it immediately, in parallel with the I/O work required to commit the previous transaction. The key part here is that we don’t report success on the first transaction until the commit has been successful, and that the 2nd transaction may fail because the first one had (this sounds bad, until you realize that failure to write to disk is pretty much always catastrophic for a database). 

If you look at the previous post (from Jan 2014!) about this, you’ll see that we actually implement that at the time, and rolled it back because it wasn’t doing much for us. I’ll have another post to explain what we are doing different now that allows us to take full advantage of this.

The idea with early lock release is that the transaction will free its lock as soon as it is done, and allow additional transactions to hold that lock while waiting for I/O. This isn’t actually what we have done.

The idea of transaction merging is deeply rooted into the design of RavenDB 4.0, and it isn’t something that we can (or want) to change. About 98% of all write work in RavenDB will always go through the transaction merger. That means that just releasing the lock isn’t really going to do much for us. The transaction merger thread will be busy waiting for the I/O to complete and then start a new transaction (re-acquiring the lock), so there isn’t actually any benefit here.

Instead, we implemented a different system. When a transaction (let’s call it tx #1) is over, it checks whatever there is additional work pending, and it there is, tx #1 generate a new transaction (tx #2). The second transaction has the same in memory state as tx #1, including all the modifications that tx #1 has made. More crucially, tx #1 also hand off all of the locks that it holds to tx #2, and then triggers the async process of writing tx #1 data to the journal.

In the meantime, tx #@ gets to run and operate (and doesn’t have to compete for any locks). Tx #2 will process work until tx #1 has completed its I/O work. At that point, tx #2 will call back into tx #1, letting it complete its commit process, and then we can  the cycle repeats, if there is even more work pending, tx #2 will generate tx #3, transfer the lock to it and initiate an async process of writing to the journal. Tx #3 will run until tx #2 is done with its I/O, and so forth.

Here it what this looks like:

image

The thread on the left is the transaction merger, processing incoming write requests. The thread on the right is the one doing the async write process. It is interesting to note that while we call it an async write process, the actual time we spend writing to disk is relatively low, we spend most of our time actually preparing to write. That involves running diffs against old version, compressing the data, etc.

The end result is that we get several very important properties:

  • We split the transaction processing work and the writes.
  • We get automatic adjustment of the system based on actual load (if the disk is slow, we’ll try to do more work and have larger merged transactions, for example).
  • The transaction merger doesn’t have to compete for the transaction lock.
  • We have managed to increase parallelism in a previously highly serial process.

The details of the change are gnarly, because we had to make sure that pieces of the code that assumed that we are running in a serial fashion can run concurrently, but the performance boost is over 45% under heavy load, and the behavior will auto adjust to handle the specific circumstances at hand, trying to keep all pieces of the system running at full throttle.

time to read 1 min | 98 words

In this talk from the RavenDB conference, Federico Lois is discussing building Codealike,a collaborative platform for developers analytics.

Codealike plugins in Visual Studio, Eclipse and Chrome, track developers while they code and perform analytic calculations at the millisecond level. For such write heavy workloads and using RavenDB as the main and only database was not without challenge. In this talk, we will reveal how we built and scaled such a solution, how we were able to improve performance with Voron and glance at our own mistakes and architectural choices down the line.

time to read 2 min | 228 words

image

This is probably very much related to this post. Our office manager has been sick for about a week a while back, and that led to an interesting observation on my part. There was milk in the office fridge.

Now, one of the (minor) things that she does is make sure that there are such essential things as coffee and milk are stocked.

She was sick for a week, and yet there was still milk in the fridge.

I’m not sure how it got there, I assume that the milk didn’t develop self awareness and the desire to be consumed in large quantities (those two seems to be quite unlikely to develop at the same time) and therefore manage to snick into our fridge.  So someone must have made that happen, but the really nice thing about it is that I have no idea how, or who.

I can guess the why, and I’m deeply appreciative (see attached image Smile).

But I do wonder if I need to find out who got the milk, or decide that if it works and you don’t know how, write a unit test that confirms it will continue to work and move to the next bug?

time to read 1 min | 115 words

In this talk from the RavenDB conference, Elemar Júnior is talking about CQRS and using RavenDB for event souring.

CQRS stands for Command Query Responsibility Segregation. That is, that command stack and query stack are designed separately. This leads to a dramatic simplification of design and potential enhancement of scalability.

Events are a new trend in software industry. In real-world, we perform actions and these actions generate a reaction. Event Sourcing is about persisting events and rebuilding the state of the aggregates from recorded events.

In this talk I will share a lot of examples about how to effective implementing CQRS and Event Sourcing with RavenDB

time to read 4 min | 720 words

This post is the story of RavenDB-6230, or as it is more commonly known as: “Creating auto-index on non-existent field breaks querying via Id”. It isn’t a big or important bug, and it has very little real world impact. But it is an interesting story because it shows one of the hardest things that we deal with, not an issue with a specific feature, but the behavior of the system as a whole, especially when we have multiple things that may affect the end result.

Let us see what the bug actually is. We have the following query:

image

And this works and gets you the right results. Under the cover, it will select the appropriate index (and create one if it isn’t there) to query and get the right results.

Now, issue the following query on a non existent field:

image

And issue the first query again. You’ll get no results.

That is a bit surprising, I’ll admit, but it makes absolute sense when you break it up to component parts.

First, a query by id can be answered by any index, but if you don’t have an index, one will be created for you. In this case, it will be on the “__document_id” field, since we explicitly queried on that. This isn’t typically done explicitly, which is important to understand the bug.

Then, we have another query, on another field, so we generate an index on that as well. We do that while taking into account the historical behavior of the queries on the server. However, we ignore the “__document_id” field because all indexes already contain it, so it is superfluous. That means that we have an index with one explicit field (Nice_Doggy_Nice, in this case) and an implicit one (__document_Id). Which works great, even though in one case it is implicit and the other explicit, there is no actual difference in how we treat them.

So what is the problem? The problem is that the field Nice_Doggy_Nice doesn’t actually exists. So when it comes the time to actually index documents using this index, we read the document and index that, but find that we have nothing to index. At this point, we have only a single field to index, just the document id, but as it is an implicit field, and we have nothing else, we skip indexing that document entirely.  The example I used in the office is that you can’t get an answer for “when was the last time you had given birth” if you ask a male (except Schwarzenegger, in which case the answer is 1994).

So far, it all makes sense. But we need to introduce another feature into the mix. The RavenDB query optimizer.

That component is responsible for routing dynamic queries to the most relevant index, and it is doing that with the idea that we should direct work to the biggest index around, because that would make it the most active, at which point we can retire the smaller indexes (which are superfluous once the new wider index is up to date).

All features are working as intended so far. The problem is that the query optimizer indeed selects the wider index, and it is an index that has filtered all the results, so the query by id returns nothing.

Everything works as designed, and yet the user is surprised.

In practice, there are several mitigating factors here. The only way you can get this issue is if you never made any queries on any other (valid) fields on the documents in question. If you have made even a single such query, you’ll not be able to reproduce it. So you have to really work hard at it to get it to fail. But the point isn’t so much the actual bug, but pointing how how multiple unrelated behaviors can combine to cause a bit of a problem.

Highly recommended reading: How Complex Systems Fail – Cook 2000

time to read 1 min | 138 words

In this post from the RavenDB conference, Hagay Albo talks about substantial performance gain as a result of using RavenDB.

oin a real uplift experience with Hagay Albo, the CTO of the Zap/Yellow Page Group in Israel, in which he explains how his team was able to take a legacy (slow and hard to modify) group of sites and make them easier to work with, MUCH faster and greatly simplified the operational environment.

By prioritizing high availability, flexible data modeling and focusing on raw speed Zap was able to reduce its load times by Two Orders of Magnitudes. Using RavenDB as the core engine behind Zap's new sites had improved site traffic, reduced time to market and made it possible to implement the next-gen features that were previously beyond reach.

time to read 5 min | 884 words

One of the key rules in optimization work is that you want to avoid work as much as possible. In fact, any time that you can avoid doing work that is a great help to the entire system. You can do that with caching, buffering, pooling or many other such common patterns.

With Voron, one of our most common costs is related to writing to files. We are doing quite a lot of work around optimizing that, but in the end, this is file I/O and it is costly.

A big reduction in the cost of doing such I/O is to pre-allocate the journal files. That means that instead of each write extending the file, we ask the operation system to allocate it to its full expected size upfront. This saves time and also ensures that the OS has a chance to allocate the entire file in as few fragments as it possible can.

However, كل كلب له يومه (every dog has its day), and eventually a journal has outlived its usefulness, which means that it is time to make a hotdog. Or, as the case may be, delete the now useless journal file.

Of course, eventually the current journal file will be full, and we’ll need a new journal file, in which case we’ll ask the OS to allocate us a new one, and pay the cost of doing all of this I/O and the cost of file allocations.

Hm… that seems pretty stupid, isn’t it, when you think about the whole system like that…

Instead we now reuse those journals. We rely on the fact that file rename is atomic in both Windows and Posix, and so we can avoid expensive allocation calls and reuse the buffers.

Here is what this looks like, when doing heavy writes benchmark:

image

It is important to note that we also have to do some management here (to only keep pending journals for a period of time if they aren’t being used) but also need to handle a very strange case. Because we are now reusing a valid journal file, we now have a case where we might read valid transactions, but ones that are obsolete. This means that we need to be aware that beyond just garbage, we might have to encounter some valid data that is actually invalid. That made us tighten our journal validation routine by quite a bit. 

There is also another advantage of this approach is that this also plays very well with the underlying hardware. The reuse of the already allocated files means that the disk has to do a lot less work, it reduces fragmentation and it allows much faster responses overall. According to research papers, the difference can be a factor of 4 difference on modern SSD drives. This is a really good thing, since this means that this approach has wide applicability across mass storage devices (SSD, HDD, etc). I actually had a meeting with a storage company to better understand the low level details of how a disk manages the bits, and some of this behavior is influenced by those discussions.

I’m ignoring a lot of previous work that we have done around that (aligned writes, fixed sizes, pre-allocation, etc) of course, and just focusing on the new stuff.

Some of that only applies to that particular manufacturer disks, but a lot of that has broader applicability. In short, the idea is that if we can keep the amount of writes we do to a few hot spots, the disk can recognize that and organize things so this would be optimized. You can read a bit more about this here, where it discusses the notion of multiple internal storage tiers inside a disk. The idea is that we provide the disk with an easily recognizable pattern of work that it can optimize. We looked at using the disk low level options to tell it directly what we expect from it, but that is both hard to do and will only work in specific brand of disks. In particular, with cloud storage, it is very common to just lose all such notions of being able to pass hints to the disk itself, even while the underlying storage could handle it. (In the previous presentation, this is call I/O tagging and latency / priority hints).

Instead, by intentionally formatting our I/O in easily recognizable pattern, we have much higher applicability and ensure that the Right Thing will happen. Sequential writes, in particular (the exact case for journals) will typically hit a non volatile buffer and stay there for a while, letting the disk optimize its I/O behavior even further.

Another good read on this is here, where it talks about StableBuffer (you can ignore all the other stuff about decomposing and reoredering I/O), just the metrics about how much a focused write like that can help is very good.

Other resource also indicate that this is an optimal data access pattern, preserving the most juice from the drive and giving us the best possible performance.

time to read 1 min | 196 words

In this talk from the RavenDB conference, Elemar Júnior is talking about the differences between relational and document databases, and how you can utilize RavenDB for best effect.

I’ll hint that the answer to the question in the title is: Yes, RavenDB.

For the last 40 years or so, we used relational databases successfully in nearly all business contexts and systems of nearly all sizes. Therefore, if you feel no pain using a RDBMS, you can stay with it. But, if you always have to work around your RDBMS to get your job done, a document oriented database might be worth a look.

RavenDB is a 2nd generation document database that allows you to write a data-access layer with much more freedom and many less constraints. If you have to work with large volumes of data, thousands of queries per second, unstructured/semi-structured data or event sourcing, you will find RavenDB particularly rewarding.

In this talk we will explore some document database usage scenarios. I will share some data modeling techniques and many architectural criteria to help you to decide where safely adopt RavenDB as a right choice.

time to read 1 min | 102 words

We recently made some big changes in how we handle writing to the Voron journal. As part of that, we introduced a subtle bug. It would only happen on specific data, and only if you were unlucky enough to hit it with the right time.

It took a lot of effort to track that done, but here is the offending line:

image

Sometimes, it just isn’t plain that the code is snigger to itself and thinking “stupid”.

time to read 1 min | 131 words

In this talk from the RavenDB conference, Federico Lois is discussing the kind of performance work and optimizations that goes into RavenDB.

Performance happens. Whether you're designed for it or not it doesn’t matter, she is always invited to the party (and you better find her in a good mood). Knowing the cost of every operation, and how it distributes on every subsystem will ensure that when you are building that proof-of-concept (that always ends up in production) or designing the latest’s enterprise-grade application; you will know where those pesky performance bugs like to inhabit. In this session, we will go deep into the inner working of every performance sensitive subsystem. From the relative safety of the client to the binary world of Voron.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}