Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 772 words

One of the most important things that you need to do for high performance is to control your allocations. Indeed, the blittable format is almost entirely implemented in unmanaged memory. And we get a great deal of performance from not having the GC poke its annoying little nose into our data structures. That said, it means that we take the onus of managing the memory ourselves.

This post is about a couple of changes that we made to the memory management system in the blittable format, and some future directions we intend to explore.

The first thing that we did was to implement an unmanaged memory pool based on ConcurrentDictionary<int, ConcurrentStack<IntPtr>>. The int is the size (in power of two) of the allocated blocks. And we would allocate from the heap, and then store the buffers internally for reuse. That worked, but profiling showed that we had quite a bit of work in just leasing and returning the buffers to the thread safe pool.

Note that this was when we only tested using a single thread, we expect that the cost of working with it would only increase when using multiple threads.

That sucked.

Luckily, we use the context pattern for pretty much everything in blittable format (and more generally in RavenDB), so we can take advantage of that. We have a three stages process.

First, we allocate the memory when we first request it. Then, when we are done using it, we return it to the context, not the global pool. This allows us to avoid cross thread coordination costs entirely. And it is quite common for threads to need to run the same buffers multiple times, so we save the “check in the buffer” just to “lease me this buffer again, please” style of work.

Here are the costs when using a single shared, thread safe pool:

image

image

As you can see, we spent quite a bit of time just managing the thread safe collections. You don’t see it in this profile, but other profiling sessions show the concurrent dictionary under load as well. And this is in a single threaded, no contention benchmark.

We moved the memory allocations to the context. Because a context is single treaded, we can rely on much simpler (and cheaper) collections, and we can also reuse contexts. A request checks out a context, which has its own cached buffers. It runs through the request, then return the context as a whole to the buffer. Each of those contexts can then manage their own memory and only rarely need to go to the global pool. There is some complexity about making sure that overly fat contexts hang around, but that is basically it.

And when we need to release the data from the context, we can do all of that in a single bulk operation. Here are the profiler results after:

image

image

I admit, it is a bit disappointing to see only a 100% improvement, but I can comfort myself with knowing that the saving in multi threaded scenarios are much higher.

I mentioned that we also have some ideas on improving this furhter. This idea (which we deferred right now because there are other more important things to do) include just allocating a big buffer (128MB in size) per context. We'll not commit all of it, merely reserve the address space, and start allocating from it directly. Basically, each allocation would simply be a pointer increment (with occational calls to commit the rest of the reserved address space).

Now, in order to free the memory, all we need would be to reset the position to zero, and we are set to reuse the context again at effectively zero cost. If this sounds familiar to you, this is because this is primarily how allocations actually work in .NET, except that we explicitly control the size and the lifetime of all the objects in there.

It also has no cost in GC, which is quite nice. As I said, we are thinking about this, but at this point, I think that we are fast enough that we don't need to go there yet. We'll have to see under concurrent usage what this will be.

time to read 6 min | 1066 words

This series of posts is continuting the work I have outlined in this series. While in the previous series, I focused on the overall picture, here I want to talk about the small things we did to speedup pretty much every aspect of the system. In some cases, those are order of magnitude differences.

In this post, I want to talk about the following:

“Oren\r\nEini”

Such a simple thing, if you just look at the surface, and yet if you’ll profile the Json.NET code, you’ll see an interesting issue. I took an existing JSON file (about 67MB in size) and continiously wrote it to /dev/null under a profiler. The results are actually quite surprising.

image

In fact, this is a bit worse.

image

dotTrace has a great feature that allows you to remove “noise” from a trace. Here is what happens if we eliminate this call:

image

So, pretty obivously, this is an really expensive call. But how expensive is it, really? Note that this profiling run used a Null Stream, so there isn’t any I/O cost to consider. This is the pure computational cost of calling this method.

Let us look at a deeper profiling profiler of this method:

image

As you can see, WriteEscapedString does about 9% of work, then call WriteEscapedJavaScriptString, which make some calls to the StreamWriter. But about 33% of the time is spend doing its own thing, looking at the code, it is clear what is going on:

image

Basically, before it can write anything, this code needs to go over the string and find out if it has any escape characters. If it does, it needs write the escape sequence, then resume writing the string. This means that this code has to scan the string that it is writing several times, and much worse, it means that you can’t really do justice with those operations. Consider the string above, writing it out is actually several different calls to the writer. In fact, we have:

  • Oren
  • \r
  • \n
  • Eini

The problem here is that we lose a very powerful feature, the ability to batch multiple operations into a single I/O call. Sure, the TextWriter is going to be doing a bit of buffering for us, but that doesn’t help us really. Let us consider one of the most important calls we have in this code:

image

For the string above, we call this method twice, once for each portion of the string that isn’t escaped. That means that argument validation needs to run each and every time. And we also have to copy the memory in itsy bitsy pieces. Memory copy routines can get far greater speed if they can copy a large bunch of memory all at once, rather than being called on small sections of it.

In short, I hope that this convinced you that escaping strings is expensive. Stupidly so, when you come right down to it.

But what can you do, that is the cost of doing business, right?

Indeed, JSON.Net code is heavily optimized, and there isn’t much you can do about improving things at this point. You are stuck with the JSON format and the implications of it.

This is already a long post, but bear with me while I explain the reasoning.

The blittable format isn’t JSON. We read JSON and output JSON, but we aren’t limited to the same constraints as someone who is just consuming JSON. There are two separate use cases for the document using the blittable format. The first is when we need to work with them on the server side. This include such things as indexing, transformers, patching, etc. For those cases, if there is an escaped value, we need to get the unescaped version. In other words, I need the string with the line break in it, not the \r\n escape sequences. The other common use case, unfortunately, is when we need to write the document to the user, in which case we do need to write it out with all of the escape sequences.

What is left for us, apparently, is to choose our poision. We can optimize for either scenario, but not for both.

Except, that maybe we can. And that is what we actually did.

When we write a string value into the blittable format, we read escape sequences, and translate them into their unescaped values. In other words, we take a “\n” and store just the byte 13. But we also remember its position. The string we keep talking about? It would be stored as:

[10][O][r][e][n][10][13][E][i][n][i][2][4][0]

The first byte (10), is the length of the string. Then we have the actual unescaped version. If the server side code wants to read this string, we can just serve it directly from this memory, without any additional work. But what about when we need to write it down, and maintain the escape sequences? That is where the numbers in the end come into play. We need to skip 10 bytes past the length, where we’ll encounter 3 bytes: 2, 4, 0.

The first byte is the number of escape sequences in the string. The second is the number of bytes that we can copy until we get to the next escape sequence. Which means that our code for writing string has narrowed down to:

image

And that, in turn, means that if you don’t have any escape sequences, we can just call memcpy directly to write to the output buffer. And if you do have escape sequences, we just need to scan the escaped characters, we don’t need to scan the entire string.

There is so much saving in the air, it feels like Black Friday.

time to read 5 min | 930 words

Unlike previous posts in this series, this is actually something that happened to our own production server today. It resulted in our website being inaccessible for a a couple of hours, and like most such stories, its root cause is tremedously simple, and through a series of unfortunate accidents, it had escalated to a major issue.

First, this post is dedicted to this book, which should be read by any self respecting developer whose code is expected to hit production.

Cover image for Release It!

This post is also written a few hours only after the incident was resolved. Before we actually implemented anything except a temporary workaround. I’ll probably have another post in a couple of days to talk about the steps we are going to take to alleviate a repeat of this incident.

The incident started innocently enough, when one of the guys on the team discovered that the startup time of a certain instance jumped by a lot. Investigating into why he realized that the issue was extremely slow responses from our server. That was a cause of triple concern, actually. First, why are we accepting such slow responses instead of time limiting them? Second, why are we making a remote syncronous call during startup? And third, why on Earth is our server so slow?

Logging into the server, it didn’t take long to see what the problem was. The www.RavenDB.net website (the code that runs the RavenDB website, not RavenDB itself) was consuming a lot of CPU and quite a bit of memory. In a bit to restore the other services which reside on the same box, we reset the process. Our main concern at the time was to restore service as soon as possible, and we planned on investigating further through the logs.

However, in a few minutes, the www.RavenDB.net website started consuming more and more resources. At that point, we started considering a DoS attach of some sort and looked a the logs. The logs did show a very high number of requests, much more than I would expect. But looking further into them, it looked like they were mostly bots of various kinds indexing our site.

Considering that this might be the case of Google hammering us, we configured a robots.txt on the site and waited to see if this would have an impact. It didn’t.

The next step was to take a process dump of the process, and then analyze it. During this period, we had to shut down www.RavenDB.net because it was killing all other services running on the server.

Looking at the dump in WinDBG, we started with the obvious commands.

  • !runaway – to find out the thread cpu times
  • switching to the busiest threads
  • !clrstack - to see what it is doing

Honestly, this is a much nicer way of looking at this, though:

image

As you can see, the threads are currently actually doing parsing of a Razor template, and seems to be doing that on a fairly continous basis, consuming all system resources.

At that point, I started getting concerned to the well being of the poor guy’s inbox as a result of this code. That was the point where we actually did what should have probably been our first action, and looked at the error log of the website.

Previously, we looked at the live metrics and the request log, but didn’t consider looking into the error log for the system. The error log for the website, for today only, was 6GB in size, and was pretty full of errors such as:

- error: (221, 88) 'HibernatingRhinos.Orders.Common.EmailProcessing.EmailTemplates.RavenDBWebsite.Models.DownloadQuestionMailInput' does not contain a definition for 'Unsubscribe' and no extension method 'Unsubscribe' accepting a first argument of type

And at that point, we had enough to point a suspicious finger. We have an email that we send out, and we used to have a valid template. At some point, the code was changed and the Unsubscribe was removed. Nothing broke because the template is just a file, not actually compiling code. However, in production, when we tried to send the email, Razor would parse the text, fail compilation because of the missing member, and basically thorw a hissy fit.

Update: We investigated further, and it looks like the following was the actual “solution” to the outage:

image

The “solution” is in quotes, becasue this fixes the problem, but we need to still implement steps to ensure that something like that doesn’t repeat.

Unforuntately, at that point, we would consider this email as failing, and move on to the next one. That next one would also fail, and so would the next one, etc. Because all of them failed, they would get picked up again next time this run.

Once we knew where the problem was. The workaround was to deploy a version with no email sending. For this weekend, that will work. But come Sunday, someone is going to go over this piece of code with a veyr fine comb. I’ll post more about it once this actually roll around.

time to read 1 min | 100 words

Aren’t you surprised by that title? Today we were summarizing what our current status in the recruitment round after a particularly disappointing interview, and we actually looked at the number and I was pretty bummed.

Here they are:

image

The good news it that we have a couple of really promising candidates coming by soon, and I remember feeling this annoyed with the whole process in pretty much all other recruitment rounds.

time to read 4 min | 800 words

After quite a bit of work, we have finished (at least for now) the Blittable format. As you can probably guess from the previous posts, we have decided to also implement small string compression in the Blittable format, based on the Smaz algorithm.

We also noticed that there are two distinct use cases for the Blittable format. The first is when we actually want to end up with the document on disk. In this case, we would like to trade CPU time for I/O time (especially since usually we write once, but read a lot, so we'll keep paying that I/O cost). But there are also many cases in which we read JSON from the network jus to do something with it. For example, if we are going to read an index definition (which is sent as a JSON document), there is no need to compress it, we'll just have to immediately uncompress it again.

So we added the option to choose what mode you'll use the data. Here are the benchmarks for speed and size for Json, Blittable (for disk) and Blittable (for memory)

image image

As expected, we see a major difference in performance between Blit to disk and Blit to mem. Blit to mem is comparable or better from Json in nearly all scenarios, while Blit to disk is significantly smaller than Json while having comparable or better performance at parsing time, even though it is compressing a lot of the data.

For large documents, this is even more impressive:

image image

Blit to mem is 40% – 75% of the speed of Json at parsing, while Blit to disk can be much more expensive (us to x2.5 times more than Json), depending on how compressible the data is. On the other hand, Blit to mem is already smaller than Json in 33% – 20% for most datasets, but Blit to disk improves on that significantly, often by as much as 60% of the original JSON. In the companies.json case, the original file size is 67.25 MB, using Blit to mem we have a 45.29 MB file, and using Blit to disk gives us a 40.97MB.

This is a huge reduction in size (compared to appreciable fraction of a second on most I/O systems).

So far, we look at the big boys, but what happens if we try this on a large number of small documents? For example, as would commonly happen during indexing?

In this case, we see the following:

image

Note that Blit to disk is very expensive in this mode, this is because we actually do a lot. See the next chart:

image

Blit to disk is actually compressing the data by close to 20MB, and it is 5 MB smaller than Blit to memory.

The question on what exactly is the difference between Blit to memory and Blit to disk came up in our internal mailing list. The actual format is identical, there are currently only two differences:

  • Blit to disk will try to compress all strings as much as possible, to reduce I/O. This is often at the cost of parsing time. Blit to mem just store the strings as is, which is much faster, obviously. It takes more memory, but we generally have a lot of that available, and when using Blit to memory, we typically work with small documents, anyway.
  • Blit to disk also does additional validations (for example, that a double value is a valid double), while Blit to memory skip those. The assumption is that if the double value isn’t valid, it will fail when you actually access the double’s value, whereas we want to have such issues happen when we persist the data if we are actually going to access it later when using Blit to disk.

Overall, a pretty good job all around.

time to read 1 min | 79 words

I admit, she only go the interview because of who her father is. But she still managed to show more enthusiasm for the job than some candidates.

image

Her test was:

try
{
	Tamar.LearnToWalk();
	Tamar.LearnToProgram();
	Tamar.Debug();
}
catch(PoopException)
{
	Tamar.Change(DiaperMode.Clean);
}

She passed, but decided to take a nap instead of the job.

time to read 2 min | 291 words

In the previous post, I discussed the way the Smaz library works. Smaz uses a static shared dictionary, that was produced once from a known training set. FemtoZip, on the other hand, also uses a shared dictionary, but it is using a much more sophisticated system. To start with, it allows you to create the shared dictionary from your own data, not just the fixed data such as Smaz. However, that does require you to carefully pick your training set.

And of course, once you have picked a training set, it is very hard if not impossible to change without complex versioning rules. I would have expected the FemtoZip version to compress much better, but it depend heavily on the training set it has, and selecting the appropriate one is relatively hard in a general purpose manner.

In my tests, I wasn't able to find a training set that would do better than Smaz, I tried a bunch, including using Smaz's own dictionary, but it seems like Smaz has done a pretty good job in selecting the relevant outputs. Or perhaps the strings I was testing were optimal for Smaz, I'm not sure.

I also looked at Shoco, but it took very little time to rule it out. It only supports ASCII strings, which is about as far from helpful as you can get, in this day and age. Smaz isn't doing too great on non ASCII characters, but it has a fixed growth per length of unfamiliar terms, which is much better than doubling the size as in Shoco.

For our needs, I believe that we'll need to have a static version of the dictionary anyway, so there isn't a major benefit to being able to train them.

time to read 9 min | 1737 words

Smaz is a small library for compressing small strings. It is strange to talk about reverse engineering such a project, because Smaz is open source and is available on GitHub. Nevertheless the code on GitHub is actually only part of the story. It is dense,  and it was generated by a script that isn't included. That means that when we read the code, we are missing some key components, in particular, you are looking at an end result, without knowing how you got there.

Compression algorithms take advantage of repetitions in text to reduce the overall size. State of the art algorithms typically use Lempel-Ziv and Huffman coding to compress the data. But because they are typically making no assumptions about the data, and they require a bit of length before they can gather enough state to truly gather some steam and start reducing the output length.

That means that for small strings (up to a hundred bytes or so), there is no real benefit in standard compression algorithms, and often you'll see an increase in the space taken. Smaz is meant to solve that for most common cases. It does so by using a prepared dictionary of common terms that can be easily compressed.

The first thing that you'll encounter when you read the Smaz code is this:

image

This goes on for about 50 lines, and at first glance, it looks utterly incomprehensible. We'll ignore this for now, and read further. The next item of interest is:

image

The cb postfix stands for code book, and rcb stands for reverse code book.

It took me a while to figure it out, but the idea in the code book entries is that this is actually a hash table. Each string in the codebook is actually multiple entries, composed of a single byte length, the term text, and then the index in the codebook. There are possibly multiple entries in each value in the array, and we use the length of each entry to index into them.

Let us see how this works:

int smaz_compress(char *in, int inlen, char *out, int outlen) {
    unsigned int h1,h2,h3=0;
    int verblen = 0, _outlen = outlen;
    char verb[256], *_out = out;

    while(inlen) {
        int j = 7, needed;
        char *flush = NULL;
        char *slot;

        h1 = h2 = in[0]<<3;
        if (inlen > 1) h2 += in[1];
        if (inlen > 2) h3 = h2^in[2];
        if (j > inlen) j = inlen;

This code just setup a temporary buffer, and hash the current input term's bytes. Then we start getting interesting.

        /* Try to lookup substrings into the hash table, starting from the
         * longer to the shorter substrings */
        for (; j > 0; j--) {
            switch(j) {
            case 1: slot = Smaz_cb[h1%241]; break;
            case 2: slot = Smaz_cb[h2%241]; break;
            default: slot = Smaz_cb[h3%241]; break;
            }
            while(slot[0]) {
                if (slot[0] == j && memcmp(slot+1,in,j) == 0) {
                    /* Match found in the hash table,

The value j is the current size we are checking, and we are using the hash value to get a particular slot in the hash table. Note that only the first 3 bytes are actually hashed. After the appropriate slot is found, we check whatever the term length is a match, and if so, whatever the actual strings match. If they don't, we run the following code:

slot += slot[0]+2;

this was a bit confusing at first, but what is basically going on here is that we move to the next entry in this slot. This is pretty neat, since it covers both empty slots (defined in the code as "") and multi value slots and take advantage on the fact that C strings end with \0 without having to specify it explicitly.

So what happens if we have a match?

image

In this case, we check if there are any verbatim bytes (bytes that we haven't been able to compress that are stored in a buffer.). If there are such bytes, we do the following:

  • Compute the space needed for the verbatim value.
  • Set the next flush point in the buffer for verbatim values to the current output location.
  • Move the output pointer after the location where the verbatim string will be written.
  • Reduce the length of the remaining output size by the verbatim length.

Note that we aren't actually going to write anything yet. First, we need to emit the newly captured match to the code book:

image

This is done by first checking if there is space (if there isn't, we return an error).

Then we write a byte to the output buffer. But this is a very strange byte.

What is this "slot[slot[0]+1]" ? Well, remember the structure that we talked about for the hash table entries? The first byte is a length, and the last byte is the index into the code book. So what we are actually doing is indexing into the entry to get the last value, which his the code book index, which we write to the output.

The rest is pretty much just book keeping information. We move the input buffer pointer according to the just discovered term, etc.

image

If no match was found, we just add the current byte to the verbatim buffer, and move on to the next one.

Now, let us look at this out label, and what it does:

image

Basically, it is a repeat of the earlier code when finding a match. If we have a large enough verbatim string, we need to flush it, so this takes care of it. The actual flushing is interesting:

image

If there is a single byte, we write 254 (single verbatim byte marker), then we write the byte. If there is more than a byte, we write 255 (variable length size), the length, then the verbatim string.

It is interesting to note that this can be written after the output byte has been written. It took me a while to understand this. Nothing wrong with it, but I like sequential writes much better.

The decompression is very simple. Read a byte from the compressed input, if it is  254, then the next byte should be copied verbatim to the output. If the byte is 255, then read the next byte, which is the length, and copy the next length bytes to the output. Finally, if the byte isn't 255 or 254, it is an index into the code book table, and the value should be taken from there.

I wrote a managed implementation of this, which allows me to play much more easily with the codebook definition. This was because there are a bunch of terms in the Smaz codebook that aren't really relevant for what we need. In particular, it looks like it was trained to find the most common values from a set of HTML documents, it has entries for "<div>", "><", etc. Instead of going with the 254 values that can be compressed, I pruned the list a bit, and ended up with just 245 items.

I was then able to change the compression format to be:

  • If it is a value under 245, use the term from the codebook.
  • If it is a value higher than 245, it is a verbatim value whose actual length need to be figured out by subtracting 245 from it.

This allows me to save a byte if we have more than two verbatim bytes.

Smaz is a really simple and clean library, but it isn't doing something very complex. It is a static shared dictionary approach, without the use of more advanced approaches. I previously written about a more complex compression systems for small strings, such as FemtoZip. You can find the previous post here (there was a whole series of them).

In my next post, I'll try to compare the two options.

time to read 1 min | 169 words

I’m currently reviewing CVs, seemingly by the hundreds*. And I run into a guy which has a Github profile link in the CV. Such links are always followed, because seeing someone’s actual work is so much better than just reading some document about it.

But then I saw this:

image

And looking into the actual repository we have:

image

While this isn’t quite enough to give you a Darwin Award in the job hunting department (sadly, I saw worse), how could anyone think that having a publicly visible repository that says “I do illegal things to software” is a good idea. Leaving aside that you link that from your CV.

* It isn’t that many, it is just annoying.

time to read 4 min | 725 words

We recently had to deal with an interesting problem in an unreliable application. To make it simple, I'll use this blog as an example, since that is much easier.

Let us assume that you are reading this post, and have a sudden desire to post a comment. Now, you can do so, we'll refresh the page, and you can see your new comment. So far, pretty simple and easy. That is the way we have been writing web applications since the nineties.

However, let us say that for whatever reason, we decided to put the actual content of the page inside a CDN, like Amazon CloudFront. The minimum amount of time we can store a page in a CDN is 5 minutes or so. The reason we want to use a CDN is so users' won't be impacted when the site is down for whatever reason.

Now, there are two types of user interactions in the blog. There is me, writing articles, and posting them. And then there is the reader, who read the post, and then send a scathing comment about the post. The software in question suffers from occasional outages, so we needed to have a way to cover that without the users' noticing. In this case, we divided the operations into two categories. Rare, which require the software to be fully operational (in the example above, posting a new blog post) and common, such as a user posting a comment, which should work even in reduced availability mode.

The solution was to utilize the CDN both for caching and reliability. The blog post itself is going to be hosted on a CDN, so if the core software fail, the user doesn't see that. But what about the actual interactions? As it turns out, while we want the user to think that everything is fine, we are actually okay with delaying some things. In this case, the perceived operation is actually more important than the actual operation.

Let us consider the standard system behavior, when there is no failure. The user goes to a particular page, and get the cached results from CDN. At the same time, we contact the live site, to try to get anything that is happening after the cached version. In the blog post example, the Ajax call would be something like "give me all the comments after 123". This takes us half way there, even though we use a CDN, we now have all the updated data when the user loads the page.

Except that we can take it further. Instead of doing an Ajax call, we can open a web socket, and ask the server "stream me everything after comment 123". As new comments come in, they will show up on the page automatically. If there is an outage, the user is going to see the latest version cached in the CDN, and won't be aware that they aren't getting the new comments. When the site is up again, we'll have all the new comments show up, just like magic.

This takes care of the user's reading habits. But what happens if the user is actually commenting? In the good case, the system is up and running, accept the new comment, distribute it to all the web sockets listening to the page's comment, and everything si good. But what happens in the failure case?

In this case, we are going to accept the user's comment, and then write it to local storage. And from local storage, we will display it to the user so as far as the user is concerned, the comment was successfully posted. Naturally we need to retry on the background, and eventually we'll send the comment to the server successfully.

And yes, this is only relevant for cases where we can actually afford to have the user believe that an operation has been successful when it hasn't been. But for stuff like comments, this is a really good system to make sure that everyone has a good user experience.

Oh, and this also has the benefit of having an easy way to handle abuse. You can not send the comment to the server, but do show it to the user. This way the spammer think that they have successfully posted, but they are just spamming themselves.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}