Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 812 words

One of the most exciting new features that got into RavenDB 2.0 is the notion of bulk inserts. Unlike the “do batches in a loop” approach, we actually created an optimized approach and a hand crafted code path that reduce the time of the standard RavenDB saves (which does a lot, but come at a cost).

In particular, we made sure that we can parallelize the operation between the client and the server, so we don’t have to build the entire request in memory on the client and then wait for it all to be in memory on the server before we can start operation. Instead, we have a fully streamed operation from end to end.

Here is what the API looks like:

   1: using (var bulkInsert = store.BulkInsert())
   2: {
   3:     for (int i = 0; i < 1000*1000; i++)
   4:     {
   5:         bulkInsert.Store(new User {Name = "Users #" + i});
   6:     }
   7: }

This uses a single request to the server to do all the work. And here are the results:

image

This API has several limitations:

  • You must provide the id at the client side (in this case, generated via hilo).
  • It can't take part of DTC transactions
  • If you want updates, you need to explicitly state (other would throw).
  • Put triggers will execute, but the AfterCommit will not.
  • This bypass the indexing memory pre fetching layer.
  • Changes() will not be raised for documents inserted using bulk-insert.
  • There isn't a single transaction for the entire operation, rather, this is done in batches and each batch is transactional on its own.

This is explicitly meant to drop a very large number of records to RavenDB very fast, and it does this very well, typically an order of magnitude or more faster than the “batches in a loop”  approach.

A note about the last limitation, though. The whole idea here it to reduce, as much as possible, the costs of actually doing a bulk insert. That means that we can’t keep a transaction of millions of item open. Instead, we periodically flush the transaction buffer throughout the process. Assuming the default batch size of 512 documents, that means that an error in one of those documents will result in the entire batch of 512 being rolled back, but will not roll back previously committed batches.

This is done to reduce transaction log size and to make sure that even during a bulk insert operation, we can index the incoming documents while they are being stream in.

time to read 4 min | 617 words

I mentioned before that the hard part in building RavenDB now isn’t the actual features that we add, it is the intersection of features that is causing problems.

Case in point, let us look at the new referenced document indexing, which allows you to index data from a related document, and have RavenDB automatically keep it up to date. This was a feature that was requested quite often. Implementing that was complex, but straightforward. We now track what are the documents are referenced by each document, and we know how to force reindexing of a document if a document it was referencing was changed.

So far, so good. It was actually quite easy for us to force re-indexing, all we had to do was to force the referencing document etag to change, and the indexing code would pick it up and re-index that. Simple & easy.

Except… we use Etags for a lot more than just indexing. For example, we use etags for replication.

Now, imagine, if you will, two nodes setup as master/master. Both nodes have an index that uses LoadDocument to refer to another document.

We are now in a stable state, both nodes have all documents.  We modify a document, which causes that document to be replicated to the second node. That trigger (on both servers) re-indexing of the referencing document.  And that, in turn, would cause both servers to want to replicate the new “change” to the other one. What is worse, RavenDB is smart enough to detect that isn’t a conflict, so what we actually get is an infinite distributed loop.

Or, another case, pre fetching. As you probably know, an important optimization in RavenDB is the ability to prefetch documents from disk and not have to wait for them. We even augment that by putting incoming documents directly into the prefetching queue, never needing to hit the disk throughout the process.

Except that when we designed prefetching, there was never the idea of a having holes in the middle. But touching a document (updating its etag), causes just that. Let us assume that we have three documents (items/1, items/2, items/3).

We are saving items/1 and items/3 as part of our standard work. items/1 is being referenced by items/2. That means that on disk, we would have the following etags: (4 – items/1, 5 – items/2, 6 – items/3). However, the prefetching queue will have just (4 – items/1, 6 - items/3). This is a hole, and we didn’t use to have those (we might have gaps, but there were never any documents in those gaps). So we had to re-write the prefeteching behavior to accommodate that (along the way, we made it much better, but still).

Then there were issues relating to optimizations, it turned out that allowing a lot of holes was also not a good idea, so we changed our etag implementation to reduce the chance of holes, and…

It is interesting work, but it can be quite a hurdle when we want to do a new feature.

And then there are the really tough questions. When we load another document during the indexing of another document, what operation should we pass to the read trigger that decide if we can or cannot see this index? Is it Index operation, which means that you won’t be able to load versioned documents? Or is it Load documents, which would allow us to read versioned documents, but bring the question of how to deal with this situation? Add a new option? And make each read trigger chose its own behavior?

It is a sign of maturity, and I really like the RavenDB codebase, but it is increasing in complexity.

time to read 2 min | 305 words

The following code will not result in the expected output:

using(var mem = new MemoryStream())
{
    using(var gzip = new GZipStream(mem, CompressionMode.Compress, leaveOpen:true))
    {
        gzip.WriteByte(1);
        gzip.WriteByte(2);
        gzip.WriteByte(1);
        gzip.Flush();
    }
    
    using (var gzip = new GZipStream(mem, CompressionMode.Compress, leaveOpen: true))
    {
        gzip.WriteByte(2);
        gzip.WriteByte(1);
        gzip.WriteByte(2);
        gzip.Flush();
    }

    mem.Position = 0;

    using (var gzip = new GZipStream(mem, CompressionMode.Decompress, leaveOpen: true))
    {
        Console.WriteLine(gzip.ReadByte());
        Console.WriteLine(gzip.ReadByte());
        Console.WriteLine(gzip.ReadByte());
    }


    using (var gzip = new GZipStream(mem, CompressionMode.Decompress, leaveOpen: true))
    {
        Console.WriteLine(gzip.ReadByte());
        Console.WriteLine(gzip.ReadByte());
        Console.WriteLine(gzip.ReadByte());
    }
}

Why? And what can be done to solve this?

time to read 2 min | 350 words

One of the really annoying things about doing production readiness testing is that you often run into the same bug over & over again. In this case, we have fixed memory obesity issues over and over again.

Just recently, we had the following major issues that we had to deal with:

Overall, not fun.

But the saga ain’t over yet. We had a test case, we figure out what was going on, and we fixed it, damn it. And then we went to prod and figured out that we didn’t fix it after all. I’ll spare you the investigative story, suffice to say that we finally ended up figuring out that we are to blame for optimizing for a specific scenario.

In this case, we have done a lot of work to optimize for very large batches (import scenario), and we set the Lucene merge factor at a very high level (way too high, as it turned out). That was perfect for batching scenarios. But not so good for non batching scenarios. That resulted in us having to hold in memory a lot of lucene segments. Segments aren’t expensive, but they each have their own data structures. That works, sure, but when you start having tens of thousands of those, we are back in the previous story, where a relatively small objects come together in unexpected ways to kill us in nasty ways. Reducing the merge factor meant that we would keep only very small amount of segments, and avoided the problem entirely.

The best thing about this? I had to chase a bunch of false leads and ended up fixing what would have been a separate memory leak that would have gone unnoticed otherwise Smile.

And now, let us see if stopping work at quarter to six in the morning is conductive for proper rest, excuse me, I am off to bed.

time to read 1 min | 121 words

image

To start with, I don’t have any association with them, I got nothing (no money, free license, promise of goodwill or anything else at all) from the SciTech Software (the creators of .NET Memory Profiler.

This tool has been instrumental in figuring out our recent memory issues. I have tried dotTrace Memory, JustTrace and WinDBG, but this tool outshone them all and was able to point us quite quickly to the root cause that we had to deal with, and from there, it was quite easy to reach a solution.

Highly recommended.

time to read 3 min | 593 words

I am pretty sure that this feature is going to be at the very top of the chart when people talk about 2.0 features that they love. This is a feature that we didn’t plan for in 2.0. But we got held up by the memory issues, and I really needed to do something awesome rather than trolling through GBs of dump files. So I decided to give myself a little gift and do a big cool feature as a reward.

Let us imagine the following scenario:

image

We want to search for invoices based on the customer name. That one is easily enough to do, because you can use the approach outlined here. First do a search on the customer name, then do a search based on the customer id. In most cases, this actually result in better UX, because you have the chance to do more stuff to find the right customer.

That said, a common requirement that also pops up is the need to sort based on the customer name. And that is were things gets complex. You need to do things like multi map reduce, and it get hairy (or get bald, depending if you tear at your hair often or not).

Let us look at another example:

image

I want to looks for courses that have students named Oren.  There are solutions for that, but they aren’t nice.

Here is where we have the awesome feature, indexing related documents:

image

And now we can query things like this:

image

And obviously, we can do all the rest, such as sort by it, do full text searching, etc.

What about the more complex example? Students & Courses? This is just as easy:

image

And then we can query it on:

image

But wait! Yes, I know what you are thinking. What about updates? RavenDB will take care of all of that for you behind the scenes. When the referenced document change, the value will be reindexed automatically, meaning that you will get the updated value easily.

image

image

This feature is going to deal with a lot of pain points that people are currently getting, and I am so excited I can barely sit.

time to read 7 min | 1352 words

Well, we got it. Dear DB, get your hands OFF my memory (unless you really need it, of course).

The actual issue was so hard to figure out because it was not a memory leak. It exhibit all of the signs for that, sure, but it was not.

Luckily for RavenDB, we have a really great team, and the guy who provided the final lead is Arek, from AIS.PL, who does really great job. Arek manage to capture the state in a way that showed that a lot of the memory was help by the OptimizedIndexReader class, to be accurate, about 2.45GB of it. That made absolutely no sense, since OIR is a relatively cheap class, and we don’t expect to have many of them.

Here is the entire interesting part of the class:

   2: public class OptimizedIndexReader<T> where T : class
   3: {
   4:     private readonly List<Key> primaryKeyIndexes;
   5:     private readonly byte[] bookmarkBuffer;
   6:     private readonly JET_SESID session;
   7:     private readonly JET_TABLEID table;
   8:     private Func<T, bool> filter;
   9:  
  10:     public OptimizedIndexReader(JET_SESID session, JET_TABLEID table, int size)
  11:     {
  12:         primaryKeyIndexes = new List<Key>(size);
  13:         this.table = table;
  14:         this.session = session;
  15:         bookmarkBuffer = new byte[SystemParameters.BookmarkMost];
  16:     }

As you can see, this isn’t something that looks like it can hold 2.5GB. Sure, it has a collection, but the collection isn’t really going to be that big.  It may get to a few thousands, but it is capped at around 131,072 or so. And the Key class is also small. So that can’t be it.

There was a… misunderstanding in the way I grokked the code. Instead of having one OIR with a collection of 131,072 items. No, the situation was a lot more involved. When using map/reduce indexes, we would have as many of the readers as we would have (keys times buckets). When talking about large map/reduce indexes, that meant that we might need tens of thousands of the readers to process a single batch. Now, each of those readers would usually contain just one or two items, so that wasn’t deemed to be a problem.

Except that we have this thing on line 15. BookmarkMost is actualy 1,001 bytes. With the rest of the reader, let us call this an even 1Kb. And we had up to of 131,072 of those around, per index. Now, we weren’t going to hang on to those guys for a long while, just until we were done indexing. Except… Since this took up a lot of memory, this also meant that we would create a lot of garbage memory for the GC to work on, that would slow everything down, and result in us needing to process larger and larger batches.  As the size of the batches would increase, we would use more and more memory. And eventually we would start paging.

Once we did that, we were basically is slowville, carrying around a lot of memory that we didn’t really need. If we were able to complete the batch, all of that memory would instance turn to garbage, and we could move on. But if we had another batch with just as much work to do…

And what about prefetching? Well, as it turned out, we had our own problems with prefetching, but they weren’t relating to this. Prefetching simply made things so fast that they served the data to the map/reduce index at a rate fast enough to expose this issue, ouch!

We probably still need to go over some things, but this looks good.

time to read 2 min | 390 words

In the past few days, it sometimes felt like RavenDB is a naughty boy who want to eat all of the cake and leave none for others.

The issue is that under certain set of circumstances, RavenDB memory usage would spike until it would consume all of the memory on the machine. The problem is that we are pretty sure what is the root cause of the problem, it is the prefetching data that is killing us. Proven by the fact that when we disable that, we seem to be operating fine. And we did find quite a few such issues. And we got them fixed.

And still the problem persists… (picture torn hair and head banging now).

To make things worse, in our standard load tests, we couldn’t see this problem. It was our dog fooding tests that actually caught it. And it only happened after a relatively long time in production. That sucked, a lot.

The good news is that I eventually sat down and wrote a test harness that could pretty reliably reproduce this issue. That narrowed down things considerably. This issue is related to map/reduce and to prefetching, but we are still investigating.

Here are the details:

  • Run RavenDB on a machine that has at least 2 GB of free RAM.
  • Run the Raven.SimulatedWorkLoad, it will start writing documents and creating indexes
  • After about 50,000 – 80,000 documents have been imported, you’ll begin seeing memory rises rapidly, to use as much free memory as you have.

On my machine, it got to 6 GB before I had to kill it. I took a dump of the process memory at around 4.3GB, and we are analyzing this now. The frustrating thing is that the act of taking the mem dump dropped the memory usage to 1.2GB.

I wonder if we aren’t just creating so much memory garbage that the GC just let us consume all available memory. The problem with that is that it gets so bad that we start paging, and I don’t think the GC should allow that.

The dump file can be found here (160MB compressed), if you feel like taking a stab in it. Now, if you’ll excuse me, I need to open WinDBG and see what I can find.

time to read 1 min | 99 words

Well, as the year draws to a close, it is that time again, I got older, apparently. Yesterday marked my 31th trip around the sun.

To celebrate, I decided to give the first 31 people a 31% discount for all of our products.

This offer applies to:

This also applies to our support & consulting services.

All you have to do is to use the following coupon code: goodbye-2012

Enjoy the end of the year, and happy holidays.

time to read 2 min | 269 words

We just finished doing a big optimization in RavenDB, and one of the things that we needed to do was to store additional (internal) information so we could act upon it later on. If you must know, we now keep track of stats during indexing and can select the appropriate indexing approach based on the amount of data that we have available.

The details about this aren’t that important. What is important is that this is a piece of data that is used by RavenDB to make decisions. That means that just about the worst thing that we could possibly do is leave things at this state:

Think about what will happen in production, when you have an annoyed (and tired) ops team trying to figure out what is going on. Having a black box is the worst thing that you could possibly do, because you give the admin absolutely no input. And remember, you are going to be the one on call when the support phone rings.

One of the very final touches that we did was to add a debug endpoint that will expose those details to the user, so we could actually inspect them at runtime, and in production.  We have a lot of those, some are intended for monitoring purposes, such as the /admin/stats or the /databases/db-name/stats endpoints, some are meant for troubleshooting, such as the /databases/db-name/logs?type=error endpoint and some are purely for debugging purposes, such as /databases/db-name/indexes/index/name?debug=keys which gives you the stats about all the keys in a map/reduce index.

Trust me, you are going to need those, at some point.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}