Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 234 words

We had a need to generate unique ids for connections in our db. We used the following code:

public class NotificationsClientConnection : IDisposable
{
    private static int _counter = 0;

    private readonly WebSocket _webSocket;
    private readonly DocumentDatabase _documentDatabase;
    private readonly DateTime _startedAt;


    public NotificationsClientConnection(WebSocket webSocket, DocumentDatabase documentDatabase)
    {
        _webSocket = webSocket;
        _documentDatabase = documentDatabase;
        _startedAt = SystemTime.UtcNow;
    }

    public int Id => Interlocked.Increment(ref _counter);

    public TimeSpan Age => SystemTime.UtcNow - _startedAt;
}

Do you see the bug? It is a subtle one.

When you are ready, peek below.

.

.

.

.

.

.

.

.

.

.

.

.

.

The issue is in the following line:

    public int Id => Interlocked.Increment(ref _counter);

The intent was to create a field that would be initialized on startup. Unfortunately, the usage of the lambda symbol "=>" instead of assignment "=" turned that into a computed property, which will return a new value on every call.

time to read 3 min | 459 words

In the new blittable format we use in RavenDB, we make heavy use of compression to reduce the size of documents.

Compression has an unfortunate problem, though. It doesn't work for all inputs. The proof of that is a very neat one:

  • We have a set of compression / decompression function: compress(plainText) –> compressedText and decompress(compressedText) –> plainText.
  • Those are lossless functions, that is, for any input decompress( compress(X) ) == X
  • Let us assume that for any input, size(plainText) > size(compressedText)

But that causes a problem. Let us assume that our plain text size is N, and that the compression algorithm reduce that size by just one bit, so the size of the compressedText is N-1.

We'll compress all possible permutations of N bits using this algorithm. Given that the compression results in at least N-1  bits, there must now be two different values of the plainText that result in the same compressedText. That breaks the ability to decompress them successfully. Because there isn't a one to one mapping between them. Common compression algorithms rely on the source data to either have repetitions (LZ derivatives) or are based on shared dictionaries that match a particular set of data.

In practice, this has real world implications when you are designing a data format. For example, the blittable format compress strings using two different algorithms. For large strings, we use LZ4, because it has much higher compression rate and doesn't require any special knowledge of the data. For small strings, we use a Smaz variants, which is a shared dictionary of common terms. Because the dictionary is small, we can't put a lot of data into it, so we concentrated on common Latin character sequences.

That means that if you are trying to compress a Unicode string like:

רוח צפונית

You are going to use up more bytes than the original plain text. This is easy to experiment with using Smaz variant, because it is very simple. But it also happens using LZ4 for certain inputs.

That causes a problem for the blittable format, because we want to compress the data, but for certain inputs, that means that we are going to get more data.

We solved that by doing conditional compression. We designate a buffer that is smaller than the plain text that we are compressing (this reflect the maximum amount of compression that is valuable for us), and compress to that buffer. If the compression routine was unable to compress to that buffer (because it needed more space), we fail the compression, and just store the plain text.

Now we have an optimal compression rate, this is going to always be equal to or (hopefully usually) smaller than the original text.

time to read 8 min | 1552 words

I got into a discussion with Mark about the usefulness of async. In particular, Mark said:

Sync will always have less CPU usage because of less kernel transitions, less allocation, less synchronization.

This effect only comes into play when after issuing the IO the thread pool can immediately process a queued task. This is only the case when the CPU is highly saturated. At 90% saturation you have a good chance that this is happening. But nobody runs their production systems that way.

And while this is sort of correct, in the sense that a major benefit of async is that you free the working thread for someone else to work on, and that this is typically mostly useful under very high load, async is most certainly not useful just for high throughput situations.

The fun part about having async I/O is the notion of interleaving both I/O and computation together. Mark assumes that this is only relevant if you have high rate of work, because if you are starting async I/O, you have to wait for it to complete before you can do something interesting, and if there isn't any other task waiting, you are effectively blocked.

But that doesn't have to be the case. Let us take the following simple code. It isn't really doing something amazing, it is just filtering a text file:

public void FilterBadWords(string intputFile, string outputFile)
{
    var badWords = new[] { "opps", "blah", "dang" };

    using (var reader = File.OpenText(intputFile))
    using (var writer = File.AppendText(outputFile))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            bool hasBadWord = false;
            foreach (var word in badWords)
            {
                if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1)
                {
                    hasBadWord = true;
                    break;
                }
            }

            if(hasBadWord == false)
                writer.WriteLine(line);
        }
    }
}

Here is the async version of the same code:

public async Task FilterBadWords(string intputFile, string outputFile)
{
    var badWords = new[] { "opps", "blah", "dang" };

    using (var reader = File.OpenText(intputFile))
    using (var writer = File.AppendText(outputFile))
    {
        string line;
        while ((line = await reader.ReadLineAsync()) != null)
        {
            bool hasBadWord = false;
            foreach (var word in badWords)
            {
                if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1)
                {
                    hasBadWord = true;
                    break;
                }
            }

            if(hasBadWord == false)
                await writer.WriteLineAsync(line);
        }
    }
}

If we'll assume that we are running on a slow I/O system (maybe large remote file), in both version of the code, we'll see execution pattern like so:

image

In the sync case, the I/O is done in a blocking fashion, in the async case, we aren't holding up a thread ,but the async version need to do a more complex setup, so it is likely to be somewhat slower.

But the key is, we don't have to write the async version in this manner. Consider the following code:

 public async Task FilterBadWords(string intputFile, string outputFile)
 {
     var badWords = new[] { "opps", "blah", "dang" };

     using (var reader = File.OpenText(intputFile))
     using (var writer = File.AppendText(outputFile))
     {
         var lineTask = reader.ReadLineAsync();
         Task writeTask = Task.CompletedTask;
         while (true)
         {
             var currentLine = await lineTask;
             await writeTask;
             if (currentLine == null)
                 break;
             lineTask = reader.ReadLineAsync();

             bool hasBadWord = false;
             foreach (var word in badWords)
             {
                 if (currentLine.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1)
                 {
                     hasBadWord = true;
                     break;
                 }
             }

             if(hasBadWord == false)
                 writeTask = writer.WriteLineAsync(currentLine);
         }
     }
 }

The execution pattern of this code is going to be:

image

The key point is that we start async I/O, but we aren't going to await of it immediately. Instead, we are going to do some other work first (processing the current line while we fetch & write the next line).

In other words, when we schedule the next bit of I/O to be done, we aren't going to ask the system to find us some other piece of work to execute, we are the next piece of work to execute.

Nitpicker corner: This code isn't actually likely to have this usage pattern, this code is meant to illustrate a point.

time to read 7 min | 1393 words

I got the following feedback on this post on Twitter.

image

Using trie for routing isn't new by a long shot. It is a fairly obvious optimization. But not all tries are equal to one another. I decided to see what it would look like using Nancy.

Here is the code I run:

 class Program
 {
     static void Main(string[] args)
     {
         using (var host = new NancyHost(new Uri("http://localhost:8080")))
         {
             host.Start();
             Console.ReadLine();
         }
     }
 }

 public class SampleModule : Nancy.NancyModule
 {
     public SampleModule()
     {
         Get["/"] = _ => "Hello World!";
     }
 }

And like before, we tested this by running it in Release mode using the following gobench command:

.\gobench.exe -c 100 -r 25000 -u http://localhost:8080/

The results weren't good.

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             23,364 hits/sec
Read throughput:                   3,785,046 bytes/sec
Write throughput:                  2,009,345 bytes/sec
Test time:                               107 sec

On the same machine, similar tests using WebAPI gave me a 56,818 hits/sec and using my routing trie we get 86,206 hits/sec. Nancy also seems to be using a lot more CPU than the other two solutions.

Now, we could leave it like that, but that wouldn't be fair. Let us dig a little bit further and see if we can figure out what is costing us there?

I run the same scenario under the profiler, and we got some interesting results.

image

In terms of percentage, it compares favorably to my trie's results. 1.91% compared to 1.02%. But it terms of absolute time (same load, both running under the profiler), we see 64.7 seconds for 2.5 million routes in the nancy implementation vs. 2.2 seconds for 2.5 million routes in my implementation. We'll ignore the rest of the costs of Nancy, because there is quite a lot here that doesn't matter. Let us focus specifically on the routing trie, shall we?

image

Now, here is the really sad thing. This is running under a profiler, which slows things down tremendously.

Under those conditions, this code still manage to process a route in just 25 μs. That is pretty awesome, regardless of how you look at it. And for the vast majority of applications, that is way more than what you'll see in most applications. The moment you start actually doing stuff, you aren't even in the same territory. But one of the things that we have been focusing on is the need to speed RavenDB up by [not ready to divulge yet, but let us say… a lot]. So we pay attention to every little bit.

Now, let us look in detail on this code in GetMatches, which at the time of this writing looked something like this:

return this.routeTries[method].GetMatches(path.Split(splitSeparators, StringSplitOptions.RemoveEmptyEntries), context)
                                          .ToArray();

I'm actually going to ignore the code we see here and just talk about the profiler output, because in this case, the code is misleading.

We can see a whole bunch of things that would fail our current code review. Here is a commonly called method that does a whole bunch of expensive allocations.

On each call, we:

  • Split the string – which is both expensive in computation  and creates multiple sub strings and the array to contain them.
  • Create two dictionaries:
    • one is done inside the TrieNode.GetMatches call:
      return this.GetMatches(segments, 0, new Dictionary<string, object>(this.AdditionalParameters), context);
    • one is done inside the TrieNode.BuildResults call:
      var parameters = new Dictionary<string, object>(capturedParameters);
  • Call ToArray – which create another allocation for the results.
  • Uses Any & Select – which often requires allocation of a lambda instance.

There are also the dictionary lookup (which is actually done twice, once using ConstainsKey, then using the indexer):

if (!this.routeTries.ContainsKey(method))
{
    return MatchResult.NoMatches;
}

return this.routeTries[method] /* removed for clarity */;

Note that just replacing this with a TryGetValue is likely to give us about 8% improvement in this method.

It actually gets better, looking at the code, it looks like we hit the code path for the root route, while other routes need to do more work. I changed the route to be Get["/values"], and run the benchmark under profiler again.

The only change I made is that now I have a non root route, the cost for routing is up by about 75%:

image

 

Now, I'm being utterly unfair here. NancyFX actually has a really sophisticated routing system, it does a lot. I mean, really a lot.

My routing trie, on the other hand, is intentionally dumb and require the request handling code to participate in some decisions. This is done intentionally, because it gives me the chance to do a whole bunch of optimizations cheaply. To start with, I don't need to do any allocations whatsoever throughout the call, and the request handling code can also skip string allocations in most cases.

This isn't an issue of algorithms. It is an issue of what you are trying to do.

For comparisons, doing the same amount of work in my current version (which was even further optimized), give us this:

image

You can see what the cost of routing is for us. While the optimized root route cost for Nancy is 25 μs, we are able to route the request in 0.23 μs, that is two orders of magnitudes faster. My entire routing cost can fit 10 times in the cost of a single dictionary lookup in the NancyFX case.

So, to answer the original question, no, I really couldn't just use NancyFX code to get "this optimization". The algorithm is important, but the actual implementation details are far more interesting and relevant.

time to read 3 min | 480 words

Typically when people talk about dependencies they talk about how easy it is to version them, deploy them, change & replace them, etc. There seems to be a very deep focus on the costs of dependencies during development.

Today I want to talk about another aspect of that. The cost of dependencies when you have a crisis. In particular, consider the case of having a 2 AM support call that is rooted to one of your dependencies. What do you do then?

The customer see a problem in your software, so they call you, and you are asked to resolve it. After you narrowed the problem down to a particular dependency, you now need to check whatever this is your usage of the dependency that is broken, or whatever there is a genuine issue with the dependency.

Let us take a case in point with a recent support call we had. When running RavenDB on a Windows Cluster with both nodes sharing the same virtual IP, authentication doesn’t work. It took us a while to narrow it down to Windows authentication doesn’t work, and that is where we got stuck. Windows authentication is a wonderfully convenient  tool, but if there is an error, just finding out about it require specialized knowledge and skills. After verifying that our usage of the code looked correct, we ended up writing a minimal reproduction with about 20 lines of code, which also reproduced the issue.

At that point, we were able to escalate to Microsoft with the help of the customer. Apparently this is a Kerberos issue and you need to use NTLM and there was a workaround with some network configuration (check our docs if you really care about the details). But the key point here is that we would really have absolutely no way to figure it out on our own. Our usage of Windows authentication was according to the published best practices, but in this scenario you had to do something different to get it to work.

The point here is that if we weren’t able to escalate that to Microsoft, we would be in a pretty serious issue with the customer “we can’t fix this issue” is something that no one wants to hear.

As much as possible, we try to make sure that any dependencies that we take are either:

  • Stuff that we wrote and understand.
  • Open source* components that are well understood.
  • Have a support contract that we can fall back on, with the SLA we require.
  • Non essential / able to be disabled without major loss of functionality.

* Just taking OSS component from some GitHub repo is a bad idea. You need to be able to trust them, which means that you need to be sure that you can go into the code and either fix things or understand why they are broken.

time to read 8 min | 1522 words

As you might have noticed, we are doing a lot of performance work. We recently moved some of our code to use async I/O in the hope of getting even more performance from the system.

The result was decidedly not what we expected. On average we saw about 10% – 30% reduction in speed, just from the use of aysnc operations. So we decided to test this.

The test is simple, make a read of a large file (1.4GB) from a network drive without buffering. The synchronous code is:

 private static void SyncWork(int pos)
 {
     var sp = Stopwatch.StartNew();
     var buffer = new byte[1024 * 4];
     long size = 0;
     using (var sha = SHA1.Create())
     using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
         FileOptions.SequentialScan | FILE_FLAG_NO_BUFFERING))
     {
         stream.Seek(pos * ReportSize, SeekOrigin.Begin);
         int read;
         while ((read = stream.Read(buffer, 0, buffer.Length)) != 0)
         {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb sync {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
         }
     }
 }

To make things interesting, we are reading 32 MB in 4KB chunks and computing their SHA1 hash. The idea is that this is a mix of both I/O and CPU operations. The machine I’m testing this on has 8 cores, so I run 16 copies of this code, with different start positions.

for (int i = 0; i < 16; i++)
{
    var copy = i;
    new Thread(state =>
    {
        SyncWork(copy);
    }).Start();
    Thread.Sleep(250);
}

The basic idea was to simulate work coming in, doing different things, and to simulate slow I/O and computation. 16 threads means that I have more threads than CPU cores, so we’ll have some context switches. Note that the use of unbuffered I/O means that we have to go over the network (slow).

The output of this code is:

Read 32 mb sync 8,666
Read 32 mb sync 8,794
Read 32 mb sync 8,995
Read 32 mb sync 9,080
Read 32 mb sync 9,123
Read 32 mb sync 9,299
Read 32 mb sync 9,359
Read 32 mb sync 9,593
Read 32 mb sync 9,376
Read 32 mb sync 9,399
Read 32 mb sync 9,381
Read 32 mb sync 9,337
Read 32 mb sync 9,254
Read 32 mb sync 9,207
Read 32 mb sync 9,218
Read 32 mb sync 9,243

Now, let us look at the equivalent async code:

private static async Task AsyncWork(int pos)
{
    var sp = Stopwatch.StartNew();
    var buffer = new byte[1024 * 4];
    using (var sha = SHA1.Create())
    using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
        FileOptions.SequentialScan | FileOptions.Asynchronous | FILE_FLAG_NO_BUFFERING))
    {
        stream.Seek(pos * ReportSize, SeekOrigin.Begin);
        long size = 0;
        
        int read;
        while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0)
        {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb async {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
        }
    }
}

Note that here I’m using async handle, to allow for better concurrency. My expectation was that this code will interleave I/O and CPU together and result in less context switches, more CPU utilization and overall faster responses.

Here is the network utilization during the async test:

image

And here is the network utilization during the sync test:

image

Trying the async using 64Kb buffers gives better results:

image

And output of:

Read 32 mb async  8,290
Read 32 mb async 11,445
Read 32 mb async 13,327
Read 32 mb async 14,088
Read 32 mb async 14,569
Read 32 mb async 14,922
Read 32 mb async 15,053
Read 32 mb async 15,165
Read 32 mb async 15,188
Read 32 mb async 15,148
Read 32 mb async 15,040
Read 32 mb async 14,889
Read 32 mb async 14,764
Read 32 mb async 14,555
Read 32 mb async 14,365
Read 32 mb async 14,129

So it is significantly worse than the sync version when using 4KB buffers. The bad thing is that when using 64Kb buffer in the sync version, we have:

image

And the whole process completed in about 2 seconds.

I’m pretty sure that I’m doing everything properly, but it seems like the sync version is significantly cheaper.

Short summary, the solution is throw all of async code way in favor of pure sync code, because it is so much faster. Banish async, all hail to the synchronous overload.

However, the plot thickens!

Before before declaring death to asynchronicity, with thunderous applause, I decided to look further into things, and pulled out my trusty profiler.

Here is the sync version:

image

As expected, most of the time is spent in actually doing I/O. The async version is a bit harder to look at:

image

This is interesting, because no I/O actually occurs here. At first I thought that this is because we are only using async I/O, so all of the missing time (notice that this is just 625 ms) is lost to the I/O system. But then I realized that we are also missing the ComputeHash costs.

Profiling async code is a bit harder, because you can’t just track the method calls. We found the missing costs here:

image

And this is really interesting. As you can see, most of the cost is in the ReadAsync method. My first thought was that I accidently opened the file in sync mode, turning the async call into a sync call. That didn’t explain the different in costs from the sync version, through, and I verified that the calls are actually async.

Then I looked deeper:

image

Why do we have so many seeks?

The answer lies in this code. And that explained it, including a big comment on why this happens. I created an issue to discuss this.

Calling SetFilePointer is typically very fast, since the OS just need to update an internal structure. For some reason, it seems much more expensive on a remote share. I assume it need to communicate with the remote share to update it on its position. The sad thing is that this is all wasted anyway, since the file position isn’t used in async calls, each actual call to ReadFileNative will be given the offset to read there.

time to read 4 min | 764 words

In my previous post, I discussed the cost of routing in MVC vs. our own implementation. The MVC routing / infrastructure consumed about 90% of the time, while the code that actually does stuff is left with less than 10%. In our implementation, the actual request handling code gets about 85% of the time, while all the infrastructure is handled in the other 15%.

Of that , a large portion of that time is actually spent on routing. In our tests, we saw something like 40% of the time spent in selecting the right controller action to execute, this is when there is just one such action that can be run. Now, to be fair, this is a scenario where we don’t actually have any meaningful logic. We are just saying “hello world” over and over again, so the infrastructure costs are really noticeable in this benchmark. But that is the point, we want to see what are the fixed costs of the infrastructure we use.

Now, in RavenDB we have several hundreds routes (last count was something upward of 400), and we noticed that routing take a lot of time to select the right action to run. We optimized the hell out of that to the point by specializing that to our own situation and based on our own knowledge. But when we started to look at moving our code to Kestrel, we took the time to move to a system that is a better fit for us.

The major reason routing is such an expensive issue in MVC is that it does quite a lot. It needs to handle route selection, capturing values, coercing data to the right types and a lot more. All of that takes time, memory and cpu cycles. But as it turn out, we don’t really need all of that. We have relatively simple routing needs.

Here are a few example of RavenDB routes:

  • /admin/stats
  • /databases/{databaseName}/docs
  • /databases/{databaseName}/indexes/{*id}

So we have the fixed routes, like /admin/stats. We have routes that contain a database name in the middle, and we have routes with a suffix. We can take advantage of that to create an optimal solution.

In our case, we used a trie:

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.

The cost of querying a trie is O(L), where L is the length of the key that we are querying. Here is how a very small trie looks in our tests.

image

We start from the route, matching each character against the trie node key. Once we have consumed the full key, we check if it has any children starting with the next character in the url. Note that the Children array here is of size 127. This allows us to index directly into the array (at cost of O(1)) to find the next trie node. We didn’t want to use a dictionary here because of its size and its cost relative to an array access.

This has some interesting implications:

  • The route definition is limited to ASCII characters (but the actual urls are not, mind).
  • Regardless of the number of routes, we are ensured of reliable performance.

But that still doesn’t let us handle routes like the ones above. In order to do that, we had to extend the syntax a bit.

/admin/stats would match exactly. That is easiest.

/databases/*/docs is a route that has a capture. The * symbol says “matches anything but /”, so we can get the database name captured without complex matching. Note that by capture, I mean just record the start position and length, we don’t want to pull a substring out of that and pay for an extra allocation.

/databases/*/indexes/$ is another route that has special meaning. The $ symbol says “the match is done, even if the url has additional characters”. That way we can capture embedded ids easily (again, no allocations).

The end result?

image

Or roughly a route match in 0.9 μs. Pretty good, even if I say so myself.

time to read 8 min | 1565 words

I was asked to see what is the cost of WebAPI on Kestrel, and it lines up nicely with a topic that I wanted to talk about, routing. So I thought I’ll start with a small benchmark.

Here is a small Kestrel based server, using MVC / WebAPI:

public class Program
{
    public void ConfigureServices(IServiceCollection services)
    {
        services.AddMvc();
    }

    public void Configure(IApplicationBuilder app, ILoggerFactory loggerfactory)
    {
        app.UseMvc();
    }

    public static void Main(string[] args)
    {
        var configBuilder = new ConfigurationBuilder()
            .Add(new MemoryConfigurationProvider(new Dictionary<string, string>
            {
                ["server.urls"] = "http://localhost:8080"
            }));

        IHostingEngine application = new WebHostBuilder(configBuilder.Build())
            .UseStartup<Program>()
            .UseServer("Microsoft.AspNet.Server.Kestrel")
            .Build();


        using (application.Start())
        {
            Console.WriteLine("ready");
            Console.ReadLine();
        }
    }
}

And here is a simple controller:

[Route("api/[controller]")]
public class ValuesController : Controller
{
    // GET api/values
    [HttpGet()]
    public string Get()
    {
        return "Hello World";
    }
}

As you can see, this couldn’t really be any more trivial. I have tested this with release mode, using the following gobench command:

.\gobench.exe -c 100 -r 25000 -u http://localhost:8080/api/values

This was run after running a couple thousand requests as warm up.  In other words, we are going to do a 2.5 million requests. I intentionally choose that value, because it force the server to do enough work to be meaningful, and the test run is long enough to see other factors creeping in. Note that this isn’t a “real” benchmark, both gobench and this code are actually running on the same machine. But it is a good baseline.

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             56,818 hits/sec
Read throughput:                   9,261,363 bytes/sec
Write throughput:                  5,454,545 bytes/sec
Test time:                                44 sec

Those are good numbers. Now let us see that with my code:

[RavenAction("/api/values", "GET")]
public Task ValuesGet()
{
    return HttpContext.Response.WriteAsync("Hello world");
}

There is no funny business here. We just route the request using my code, and then just write it to the client. The results:

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             86,206 hits/sec
Read throughput:                  10,517,241 bytes/sec
Write throughput:                  8,275,862 bytes/sec
Test time:                                29 sec

That is over 25% improvement.

Now, there are a couple of things that works out better for my code. To start with, it is calling Response.WriteAsync directly, while the standard WebAPI code needs to get the result, figure out how to write it, etc.

In the profiler, this looks really interesting:

image

So the time for actually executing the action is less than 10% of the request time. Everything else is some sort of setup.

In fact, let us dig deeper, the time to run our action is 0.2% of the 10% that this method runs:

image

To be fair, the method just returns a constant string. I would be shocked if it wasn’t really fast.

Most of the time happens in the CreateActionResult, in which we select how we write. Let us look at this a bit differently:

image

The InvokeExceptionFilterAsync is the part responsible for actually calling our code and getting the result.

The InvokeAllResultFiltersAsync is responsible for figuring out how to write the output to the client. And this is the part where we actually write to the network:

image

And here is how we do in the same scenario:

image

You can note that 85% of the time we are spending in servicing the request, with just 15% dedicated to managing the infrastructure.

I’ll discuss the details of that in my next post, though.

time to read 9 min | 1629 words

In my previous post, I have discussed how using Smaz-like compression, I can achieve a smaller (and more professional looking) license key.

Here is the original license:

{
  "id": "cd6fff02-2aff-4fae-bc76-8abf1e673d3b",
  "expiration": "2017-01-17T00:00:00.0000000",
  "type": "Subscription",
  "version": "3.0",
  "maxRamUtilization": "12884901888",
  "maxParallelism": "6",
  "allowWindowsClustering": "false",
  "OEM": "false",
  "numberOfDatabases": "unlimited",
  "fips": "false",
  "periodicBackup": "true",
  "quotas": "false",
  "authorization": "true",
  "documentExpiration": "true",
  "replication": "true",
  "versioning": "true",
  "maxSizeInMb": "unlimited",
  "ravenfs": "true",
  "encryption": "true",
  "compression": "false",
  "updatesExpiration": "2017-Jan-17",
  "name": "Hibernating Rhinos"
}

And here is the resulting output:

0172020007 93050A0B28 0D0D0D682C 24080D0D2C
260D080C2C 090A29282C 2A08090D23 0C2829250B
2509060C1F 171019081B 1016150587 2C672C7795
5422220322 2203222295 2E22222222 2222220692
058F064407 4A0537064B 0528064C8C 4D8C4E0549
065A8C4F8D 528C538D54 8D558D568D 5705490659
8D508D518C 5805872C5B 2C77069105 954810090C
1915081B10 150E962052 0F1015161A 069553100E
15081B1C19 0C05954511 740C1C984D 4B4641270B
95560F1608 209841492F 4108962B58 0B9556120C
95590C954C 2195441695 5623954C29 2695482009
1D1C97514D 2B1117974E 492B150E96 3D3D070157
A9E859DD06 1EE0EC6210 674ED4CA88 C78FC61D20
B1650BF992 978871264B 57994E0CF3 EA99BFE9G1

It occurred to me that I was approaching it in completely the wrong direction. Instead of trying to specialize a generic solution, why not actually create a dedicated solution.

I started by defining the license terms:

public static string[] Terms =
{
    "id","expiration","type","version","maxRamUtilization","maxParallelism",
    "allowWindowsClustering","OEM","numberOfDatabases","fips","periodicBackup",
    "quotas","authorization","documentExpiration","replication","versioning",
    "maxSizeInGb","ravenfs","encryption","compression","updatesExpiration",
};

Then I observed that I really only had five date types: boolean, int, date, guid and string. And that I only had (right now) 21 terms to work with. String is problematic, and we only ever use it for either the name on the license or for enum values (such as the type of the license). We can switch to using numeric values for enums, and we'll ignore the name for now. The dates we have are just that, dates, we don't need timing information. We also have a relatively small range of valid dates. From now to (lets be generous) 100 years from now. Finally, our integers are mostly very small. Version numbers and the like. All except the maxRamUtilization, which is in bytes. I switch that one to GB, which gives us small numbers again. We also have values that are numeric, but can have the string "unlimited", we'll use value 0 as the magic value to say unlimited.

What is the point of all of that the data we need to store is:

  • Booleans
  • Integers
  • Dates

Since we want to conserve space, we'll limit the integers to byte size (pun intentional Smile) and we'll store the dates in DOS format:

image

This means that we can pack a date (but not time, mind) into two bytes.

So the format is going to be pretty simple:

[index into the terms, type of value, value]

However, since we have so few value types, we can do better and actually store them as bits.

So the first 5 bits in the token would be the terms index (which limits us to 32 terms, since we currently only have 21, I'm fine with that) and the last 3 bits are used for the type of the value. I'm using the following types:

image

Note that we have two definitions of boolean value, true & false. The idea is that we can use a single byte to both index into the terms and specify what the actual value is. Since a lot of our properties are boolean, that saves quite a lot of space.

The rest of the format is pretty easy, and the whole thing can be done with the following code:

 var ms = new MemoryStream();
 var bw = new BinaryWriter(ms);
 foreach (var attribute in attributes)
 {
     if (attribute.Value == null)
         throw new InvalidOperationException("Cannot write a null value");

     var index = Array.IndexOf(Terms, attribute.Key);
     if (index == -1)
         throw new InvalidOperationException("Unknown term " + attribute.Key);


     if (attribute.Value is bool)
     {
         var type = (byte)((bool)attribute.Value ? ValueType.True : ValueType.False) << TypeBitsToShift;
         bw.Write((byte)((byte)index | type));
         continue;
     }
     if (attribute.Value is DateTime)
     {
         bw.Write((byte)((byte)index | ((byte)ValueType.Date << TypeBitsToShift)));
         var dt = (DateTime)(attribute.Value);
         bw.Write(ToDosDateTime(dt));
         continue;
     }
     if (attribute.Value is int || attribute.Value is long)
     {
         var val = Convert.ToByte(attribute.Value);
         bw.Write((byte)((byte)index | ((byte)ValueType.Int << TypeBitsToShift)));
         bw.Write((byte)val);
         continue;
     }

     throw new InvalidOperationException("Cannot understand type of " + attribute.Key + " because it is " + attribute.Value.GetType().FullName);
 }

Nothing truly interesting, I'll admit. But it means that I can pack this:

{
  "expiration": "2017-01-17T00:00:00",
  "type": 1,
  "version": 3,
  "maxRamUtilization": 12,
  "maxParallelism": 6,
  "allowWindowsClustering": false,
  "OEM": false,
  "numberOfDatabases": 0,
  "fips": false,
  "periodicBackup": true,
  "quotas": false,
  "authorization": true,
  "documentExpiration": true,
  "replication": true,
  "versioning": true,
  "maxSizeInGb": 0,
  "ravenfs": true,
  "encryption": true,
  "compression": false,
  "updatesExpiration": "2017-01-17T00:00:00"
}

Into 30 bytes.

Those of you with sharp eyes might have noticed that we dropped two fairly important fields. The license id and the name for whom the license is issued to. Let us deal with the name first.

Instead of encoding the name inside the license, we can send it separately. The user will have two fields to enter, the name and the actual license key.  But the 30 bytes we compacted the license attributes into aren't really useful. Anyone can generate them, after all. What we need it to sign them, and we do that using DSA public key.

Basically, we take the license attributes that we built, concat them with the name's bytes, and then generate a digital signature for that. Then we just add that to the license. Since DSA signature is 40 bytes in size, it means that our license has ballooned into whooping 70 bytes.

Using base 64, we get the following license key:

Hibernating Rhinos

YTFKQgFDA0QMRQYGB0gACSoLLC0uL1AAMTITdDFKTDd2c6+XrGQW/+wEvo5YUE7g55xPC+FS94s7rUmKOto8aWo/m7+pSg==

And now that looks much more reasonable. This also explains why we dropped the license id. We don't need it anymore. The license itself (short & compact) gives us as good a way to refer to the license as the license id used to be, and it isn't much longer.

For clarity's sake it might be clearer to understand if we split this into separate fields:

{
  "Name": "Hibernating Rhinos",
  "Options": "YTFKQgFDA0QMRQYGB0gACSoLLC0uL1AAMTITdDFK",
  "Signature": "LAfgqs3MPzfRERCY+DWjZoso95lh+AzmOdt2+fC+p2TgC16hWKDESw=="
}

I'll admit that I went a bit overboard here and started doing all sort of crazy things here. For example, here is another representation of the same license scheme:

test

For mobile developers, I think that this would be an excellent way to enter the product registration Smile.

And here is the code, if you want to go deep: https://gist.github.com/ayende/6ecb9e2f4efb95dd98a0

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}