Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 457 words

In 2009, I decided to make a pretty big move. I decided to take RavenDB and turn that from a side project into a real product. That was something that I actually had a lot of trouble with. Unlike the profiler suite, which is a developer tool, and has a relatively short time to purchase, building a database was something that I knew was going to be a lot more complex in terms of just getting sales.

Unlike a developer tool, which is a pretty low risk investment, a database is something that is pretty significant, and that means that it would take time to settle into the market, and even if a user starts developing with RavenDB, it is usually 3 – 6 months minimum just to get to the part where they order the license. Add that to the cost of brining a new product to market, and…

Anyway, it wasn’t an easy decision. Today I was looking at some reports when I noticed something interesting. The following is the breakdown of our product based revenue since the first sale of NHibernate Profiler. Note that there is no doubt that NH Prof is a really good product for us. But it is actually pretty awesome that RavenDB is at second place.

image

This is especially significant in that the profilers has several years of lead time in the market over RavenDB. In fact, running the numbers, until 2011, we sold precious few licenses of RavenDB. In fact, here are the sales numbers for the past few years:

image

Obviously, the numbers for 2013 are still not complete, but we have already more than surpassed 2012, and we still have a full quarter to go.

For that matter, looking at the number just for 2013, we see:

image

So NH Prof is still a very major product, but RavenDB is now our top performing product for 2013, which makes me a whole lot better.

Of course, it also means that we probably need to get rid of a few other products, in particular, LLBLGen, Linq to Sql and Hibernate profilers don’t look like they are worth the trouble to keep them going. But that is a matter for another time.

time to read 22 min | 4333 words

A while ago I got a comment to a post of mine “if you [ in here the commenter is talking to another commenter ] think good code does need comments you are obviously unqualified to be in this conversation”. And I wanted to demonstrate that this is a very false statement.

Let us look at the following piece of code.

   1: private long GetNewLength(long current)
   2: {
   3:     DateTime now = DateTime.UtcNow;
   4:     TimeSpan timeSinceLastIncrease = (now - _lastIncrease);
   5:     if (timeSinceLastIncrease.TotalSeconds < 30)
   6:     {
   7:         _increaseSize = Math.Min(_increaseSize * 2, MaxIncreaseSize);
   8:     }
   9:     else if (timeSinceLastIncrease.TotalMinutes > 2)
  10:     {
  11:         _increaseSize = Math.Max(MinIncreaseSize, _increaseSize / 2);
  12:     }
  13:     _lastIncrease = DateTime.UtcNow;
  14:     current = Math.Max(current, 256 * PageSize);
  15:     var actualIncrease = Math.Min(_increaseSize, current / 4);
  16:     return current + actualIncrease;
  17: }

This piece of code is responsible for decided the new file size when we run out of room in the current space we use.

That is 10 lines of code. And what happens is quite clear, even if I say so myself. But the why for this code is lost. In particular, there is a lot of reasoning behind the actual logic here. You can see that we play with the actual increase size, to make sure that if we increase often, we will reserve more space from the OS. That seems clear enough, but what about lines 14 – 15?

In this case, just before those two lines we have:

   1: // At any rate, we won't do an increase by over 25% of current size, to prevent huge empty spaces
   2: // and the first size we allocate is 256 pages (1MB)
   3: // 
   4: // The reasoning behind this is that we want to make sure that we increase in size very slowly at first
   5: // because users tend to be sensitive to a lot of "wasted" space. 
   6: // We also consider the fact that small increases in small files would probably result in cheaper costs, and as
   7: // the file size increases, we will reserve more & more from the OS.
   8: // This also plays avoids "I added 300 records and the file size is 64MB" problems that occur when we are too
   9: // eager to reserve space
  10:  

And yes, this isn’t about comments such as // increment by 1. I try writing comments like that when explaining policy or heuristics. Or when I am doing something pretty funky all around for some (hopefully) good reason.

Another example might be this:

   1: // here we try to optimize the amount of work we do, we will only 
   2: // flush the actual dirty pages, and we will do so in sequential order
   3: // ideally, this will save the OS the trouble of actually having to flush the 
   4: // entire range
   5: long start = sortedPagesToFlush[0];
   6: long count = 1;
   7: for (int i = 1; i < sortedPagesToFlush.Count; i++)
   8: {
   9:     var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];
  10:     // if the difference between them is not _too_ big, we will just merge it into a single call
  11:     // we are trying to minimize both the size of the range that we flush AND the number of times
  12:     // we call flush, so we need to balance those needs.
  13:     if (difference < 64)
  14:     {
  15:         count += difference;
  16:         continue;
  17:     }
  18:     FlushPages(start, count);
  19:     start = sortedPagesToFlush[i];
  20:     count = 1;
  21: }
  22: FlushPages(start, count);

Again, relatively short code. And really easy to follow. But good luck trying to figure out that this code is responsible for a 100% perf boost, or that you need to balance multiple aspects to get an optimal solution without getting comments.

Now, I think that when people talk about “good code doesn’t need comments”, they think about code like this (all samples taken from a popular OSS project):

   1: var query = _backInStockSubscriptionRepository.Table;
   2: //customer
   3: query = query.Where(biss => biss.CustomerId == customerId);
   4: //store
   5: if (storeId > 0)
   6:     query = query.Where(biss => biss.StoreId == storeId);
   7: //product
   8: query = query.Where(biss => !biss.Product.Deleted);
   9: query = query.OrderByDescending(biss => biss.CreatedOnUtc);

Yes, I can read, thank you. And here is an example of two comments, the first one good, and the second bad:

   1: if (_commonSettings.UseStoredProceduresIfSupported && _dataProvider.StoredProceduredSupported)
   2: {
   3:     //stored procedures are enabled and supported by the database. 
   4:     //It's much faster than the LINQ implementation below 
   5:  
   6:     #region Use stored procedure
   7:     
   8:     //pass category identifiers as comma-delimited string
   9:     string commaSeparatedCategoryIds = "";
  10:     if (categoryIds != null)
  11:     {
  12:         for (int i = 0; i < categoryIds.Count; i++)

The first comment explain why. And that is crucial. You should only need to explain the how very rarely, and then, you need a comment to justify why you need the comment. (Performance, maybe?)

time to read 1 min | 180 words

One of the things that always bugged me when working at a corporate environment was the sheer amount of work you had to do just to be able to do your work.  For example, I know of people who purchase our profilers out of their own pocket, because they want to do a better job, but their workplace won’t or can’t authorize that.

A lot of the time this has to do with the way the business allocate money. They may have a developer budget, but no tools budget, etc. Honestly, I think that this is stupid. And one of the things that I try to do is avoid being stupid.

In the end, almost any tool a developer want is going to cost significantly less then the lack of tool.

Hence…

image

Sure, it is a 10$ charge, but we do the same for pretty much every tool requested.

time to read 19 min | 3616 words

In the RavenDB mailing list, we had a question about how we can combine the standard unit of work pattern of working with RavenDB in MVC applications with async. In particular, the problematic code was:

   1: public class HomeController : Controller
   2: {
   3:     public IAsyncDocumentSession Db { get; set; }
   4:  
   5:     public async Task<ActionResult> Index()
   6:     {
   7:         var person = new Person {Name = "Khalid Abuhakmeh"};
   8:         await Db.StoreAsync(person);
   9:  
  10:         return View(person);
  11:     }
  12:  
  13:     protected override void OnActionExecuting(ActionExecutingContext filterContext)
  14:     {
  15:         Db = MvcApplication.DocumentStore.OpenAsyncSession();
  16:         base.OnActionExecuting(filterContext);
  17:     }
  18:  
  19:     protected override void OnActionExecuted(ActionExecutedContext filterContext)
  20:     {
  21:         Db.SaveChangesAsync()
  22:             .ContinueWith(x => { });
  23:  
  24:         base.OnActionExecuted(filterContext);
  25:     }
  26:  
  27:     public class Person
  28:     {
  29:         public string Id { get; set; } 
  30:         public string Name { get; set; }
  31:     }
  32: }

As you probably noticed, the problem is with line 21. We want to execute the save changes in an async manner, but we don’t want to do that in a way that would block the thread. The current code just assume the happy path, and any error would be ignored. That ain’t right. If we were using Web API, this would be trivially easy, but we aren’t. So let us see what can be done about it.

I created a new MVC 4 application and wrote the following code:

image

As you can see, I have a break point after the await, which means that when that break point is hit, I’ll be able to see what is responsible for handling async calls in MVC4. When the breakpoint was hit, I looked at the call stack, and saw:

image

Not very useful, right? But we can fix that:

image

And now we get:

image

This is a whole bunch of stuff that doesn’t really help, I am afraid. But then I thought about putting the breakpoint before the await, which gave me:

image

And this means that I can check the code here. I got the code and started digging. At first I thought that I couldn’t do it, but then I discovered that I could. See, all you have to do is to create you own async action invoker, like so:

   1: public class UnitOfWorkAsyncActionInvoker : AsyncControllerActionInvoker
   2: {
   3:     protected override IAsyncResult BeginInvokeActionMethod(
   4:         ControllerContext controllerContext,
   5:         ActionDescriptor actionDescriptor,
   6:         IDictionary<string, object> parameters, AsyncCallback callback,
   7:         object state)
   8:     {
   9:         return base.BeginInvokeActionMethod(controllerContext, actionDescriptor, parameters,
  10:                                             result => DoSomethingAsyncAfterTask().ContinueWith(task => callback(task)),
  11:                                             state);
  12:     }
  13:  
  14:     public async Task DoSomethingAsyncAfterTask()
  15:     {
  16:         await Task.Delay(1000);
  17:     }
  18: }

And then register it:

   1: DependencyResolver.SetResolver(type =>
   2:     {
   3:         if (type == typeof (IAsyncActionInvoker))
   4:             return new UnitOfWorkAsyncActionInvoker();
   5:         return null;
   6:     }, type => Enumerable.Empty<object>());

And you are golden.

Note: Except for doing a minimum of F5 in the debugger, I have neither tested nor verified this code. It appears to do what I want it to, and since I am only getting to this because a customer asked about this in the mailing list, that is about as much investigation time that I can dedicate to it.

time to read 2 min | 333 words

A fairly common question that gets ask is how to maintain identifiers between RavenDB and a secondary system. For this example, we will assume that we have a table in SQL Server that have a list of users. In SQL Server, we use a identity to identify the users. But in RavenDB we want to use different ids (maybe we want to have ids that are created only in RavenDB).

That means that we have something like:

image

As you can see, users 1 – 2 map to the same id in the RDBMS system, but users/3 was created in the RavenDB system, and the other users are on different ids.

The question now becomes, how do you keep track of that, especially with regards to having updates later on. The easiest way to go about it would be to just keep the RDMBS id in the document, and query on that. That means that you need to query, and that subject to RavenDB BASE queries. So we want to try to do something better.

We can just keep a document reference, so we could do something like Load(“mapping/3”) –> “users/4”. This means that we keep a document with the RDBMS  id that just points to the actual user’s document. That works, but that isn’t very optimal. Mostly because you will need to do a lot of requests (or load a lot of documents) to get that working nicely in bulk scenarios.

But that isn’t the only option. Instead, you are going to use a document per 1000 ids. That would be called something like: “mappings/1-1000”, “mappings/1001-2000”, etc. The idea here is that we can load a single document, and get a thousand mapping all at once. Since we will usually be processing those values in order, this gives us a cheap way to preload stuff.

time to read 4 min | 695 words

I have been testing out several approaches for writing out to files. And I thought that the results are interesting enough to share. In all cases, I was writing a 128Kb buffer of random data to a file with size of 256Mb.

The first thing that I wanted to try was the trivial managed memory map approach:

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        for (int i = 0; i < accessor.Capacity; i += buffer.Length)
        {
            accessor.WriteArray(i, buffer, 0, buffer.Length);
        }
        accessor.Flush();
    }
}

This completed in 3.871 seconds.

Next, I Wanted to see what would happen if I were using direct memory access, and used CopyMemory to do that:

[DllImport("kernel32.dll", EntryPoint = "RtlMoveMemory")]
static extern void CopyMemory(byte* dst, byte* src, long size);

using (var mmf = MemoryMappedFile.CreateFromFile("test.bin", FileMode.Create, "test", 1024*1024*256))
{
    using (var accessor = mmf.CreateViewAccessor())
    {
        byte* p = null;
        accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
        fixed (byte* src = buffer)
        {
            for (int i = 0; i < accessor.Capacity; i += buffer.Length)
            {
                CopyMemory(p + i, src, buffer.Length);
            }
        }
        accessor.SafeMemoryMappedViewHandle.ReleasePointer();
        accessor.Flush();
    }
}

As you can see, this is somewhat more complex, and require unsafe code. But this completed in 2.062 seconds. Nearly twice as fast.

Then I decided to try with raw file IO:

using (var f = new FileStream("test.bin",FileMode.Create))
{
    f.SetLength(1024*1024*256);
    for (int i = 0; i < f.Length; i += buffer.Length)
    {
        f.Write(buffer, 0, buffer.Length);
    }
    f.Flush(true);
}

This is about the most trivial code that you can think of, and this completed in about 1.956 seconds. Slightly faster, but within the margin of error (note, in repeated tests, they were consistently very close, and the file I/O was very near).

So, in other words, the accessor code adds a lot of overhead when using Memory Mapped Files.

time to read 2 min | 215 words

Okay, having gone through the LMDB codebase with a fine toothed comb, I think that I can safely say that it is both a very impressive codebase and one the dearly need some TLC. I’ll freely admit that I am by no means a C guy. And it is entirely possible that a lot of the issues that I have been bugging me are standard C things. But I don’t think so. Methods that go on for hundreds of lines, duplicated code and plethora of gotos hardly seem to be the things that pop to mind when I hear good C code.

But beyond my issues with the code, the implementation is really quite brilliant. The way LMDB manages to pack so much functionality by not doing things is quite impressive. Interestingly, you couldn’t write this database even 5 years ago. LMDB relies on being able to map the db into memory, and up until x64 became prevalent, you just couldn’t do that for any db with a meaningful size. With x64 and the effectively unlimited address space we have (will I be laughing at my naivety in a few years?), that is no longer an issue.

I learned quite a lot from the project, and it has been frustrating, annoying and fascinating experience.

time to read 4 min | 641 words

Kelly has an interesting post about memory mapped files and the cloud. This is in response to a comment on my post where I stated that we don’t reserve space up front in Voron because we  support cloud providers that charge per storage.

From Kelly’s post, I assume she thinks about running it herself on her own cloud instances, and that is what here pricing indicates. Indeed, if you want to get a 100GB cloud disk from pretty much anywhere, you’ll pay for the full 100GB disk from day 1. But that isn’t the scenario that I actually had in mind.

I was thinking about the cloud providers. Imagine that you want to go to RavenHQ, and get a db there. You sigh up for a 2 GB plan, and all if great. Except that on the very first write, we allocate a fixed 10 GB, and you start paying overage charges. This isn’t what you pay when you run on your own hardware. This is what you would have to deal with as a cloud DBaaS provider, and as a consumer of such a service.

That aside, let me deal a bit with the issues of memory mapped files & sparse files. I created 6 sparse files, each of them 128GB in size in my E drive.

As you can see, this is a 300GB disk, but I just “allocated” 640GB of space in it.

image

This also shows that there has been no reservation of space on the disk. In fact, it is entirely possible to create files that are entirely too big for the volume they are located on.

image

I did a lot of testing with mmap files & sparseness, and I came to the conclusion that you can’t trust it. You especially can’t trust it in a cloud scenario.

But why? Well, imagine the scenario where you need to use a new page, and the FS needs to allocate one for you. At this point, it need to find an available page. That might fail, let us imagine that this fails because of no free space, because that is easiest.

What happens then? Well, you aren’t access things via an API, so there isn’t an error code it can return, or an exception to be thrown.

In Windows, it will use Standard Exception Handler to throw the error. In Linux, that will be probably generate a SIVXXX error. Now, to make things interesting, this may not actually happen when you are writing to the newly reserved page, it may be deferred by the OS to a later point in time (or if you call msync / FlushViewOfFile).  At any rate, that means that at some point the OS is going to wake up and realize that it promised something it can’t deliver, and in that point (which, again, may be later than the point you actually wrote to that page) you are going to find yourself in a very interesting situation. I’ve actually tested that scenario, and it isn’t a good one form the point of view of reliability. You really don’t want to get there, because then all bets are off with regards to what happens to the data you wrote. And you can’t even do graceful error handling at that point, because you might be past the point.

Considering the fact that disk full is one of those things that you really need to be aware about, you can’t really trust this intersection of features.

time to read 3 min | 526 words

This is more of less how a page is represented in memory.

image

There are a few things to note. The entries immediately following the page header points to the actual data at the end of the page. Why is that? Well, remember that we are working with a fixed size page. And we need to add data to the page in a sorted fashion, and do this cheaply. In order to do that, we always add the actual data at the end of the file, before any other data. When the page is empty, we add the value in the end. And the next value is immediately before the previous value, etc.

However, as you can see in the image, the fact that we add values in a particular order doesn’t mean that they are sorted in that order. In order to get good sorting, we keep a list of the entries offsets in the beginning of the page, just after the header. This is a sorted list, which gives us the ability to easy add / move an item in the page without actually moving the page, just by changing the offset position it is located on. If we would add another entry to this page, with the key C, the actual data would go on top of entry# 3. But the offset recording this entry would go between B & A at the head of the page.

In other words, the actual data of the page grows upward, and the offsets pointing to the locations of entries in the page grows downward. A page is full when you can’t fit the next data item and its offset in the space between the data & the offsets.

So that is addition. How about deletion. When we delete, we need to recover the space, but that is actually very easy to handle. Let us say that we want to delete the A entry. That means that we need to do the following. Move all the higher entries data downward to cover the space taken by the entry, and then update the offsets to reflect that. In the end, we are left with:

image

Updates work as a pair of delete/add operations, with some important edge cases. What happen if we have a page that is nearly full, and we want to update an entry that is bigger than it can currently hold? We have to check if the value is also too big if the previous value is removed. If not, we can fit it into the currently page. Otherwise, we would have to do a page split to accommodate it.

Surprisingly enough, the code that is required to handle that is quite pretty, although the actual mechanism probably require a whiteboard to explain. And a lot of hand waving.

time to read 3 min | 417 words

In the previous post, I gave a brief introduction about B+Trees. Now I want to talk about a crucial part of handling them. Let us start with the basic, we have the following tree, which consist of a single page:

image

As you can imagine, attempting to add a single item to this page isn’t going to happen (there isn’t enough space), so we are going to have to something about it. That something is call page splitting. The easiest way to do that is to simply cut things in half, like so:

image

We have split the page, and now we are left with two pages that we can start writing to. Everyone is happy. Almost… in this case, you can see that we are doing sequential inserts. So page #0 is never going to have any additional data written to it. That means that we are wasting half this page!

What I have done is to add just a tiny bit of smarts into the mix. Instead of doing a 50/50 split, we can detect if this is a sequential insert that is causing the split. If it is, we are going to split the data in a way that put more of it in the original page, like so:

image

We leave 85% of the data in the original page because that give some room to play with if one of the entries change there. This gives us a better page utilization in the common case of sequential inserts. But what about non sequential inserts? Well, if we detect that the page is being split because of a non sequential insert, we are going to do a 50/50 split, expecting that there would be additional non sequential inserts along the way.

I am not sure how useful that would be, but it seems like something that could be a nice optimization for a common case, and it requires nothing else from the user, we can do it all on our own.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}