Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 9 min | 1642 words

The problem was that this took time - many days or multiple weeks - for us to observe that. But we had the charts to prove that this was pretty consistent. If the RavenDB service was restarted (we did not have to restart the machine), the situation would instantly fix itself and then slowly degrade over time.

The scenario in question was performance degradation over time. The metric in question was the average request latency, and we could track a small but consistent rise in this number over the course of days and weeks. The load on the server remained pretty much constant, but the latency of the requests grew.

That the customer didn’t notice that is an interesting story on its own. RavenDB will automatically prioritize the fastest node in the cluster to be the “customer-facing” one, and it alleviated the issue to such an extent that the metrics the customer usually monitors were fine. The RavenDB Cloud team looks at the entire system, so we started the investigation long before the problem warranted users’ attention.

I hate these sorts of issues because they are really hard to figure out and subject to basically every caveat under the sun. In this case, we basically had exactly nothing to go on. The workload was pretty consistent, and I/O, memory, and CPU usage were acceptable. There was no starting point to look at.

Those are also big machines, with hundreds of GB of RAM and running heavy workloads. These machines have great disks and a lot of CPU power to spare. What is going on here?

After a long while, we got a good handle on what is actually going on. When RavenDB starts, it creates memory maps of the file it is working with. Over time, as needed, RavenDB will map, unmap, and remap as needed. A process that has been running for a long while, with many databases and indexes operating, will have a lot of work done in terms of memory mapping.

In Linux, you can inspect those details by running:


$ cat /proc/22003/smaps


600a33834000-600a3383b000 r--p 00000000 08:30 214585                     /data/ravendb/Raven.Server
Size:                 28 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  28 kB
Pss:                  26 kB
Shared_Clean:          4 kB
Shared_Dirty:          0 kB
Private_Clean:        24 kB
Private_Dirty:         0 kB
Referenced:           28 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    0
VmFlags: rd mr mw me dw
600a3383b000-600a33847000 r-xp 00006000 08:30 214585                     /data/ravendb/Raven.Server
Size:                 48 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  48 kB
Pss:                  46 kB
Shared_Clean:          4 kB
Shared_Dirty:          0 kB
Private_Clean:        44 kB
Private_Dirty:         0 kB
Referenced:           48 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    0
VmFlags: rd ex mr mw me dw

Here you can see the first page of entries from this file. Just starting up RavenDB (with no databases created) will generate close to 2,000 entries. The smaps virtual file can be really invaluable for figuring out certain types of problems. In the snippet above, you can see that we have some executable memory ranges mapped, for example.

The problem is that over time, memory becomes fragmented, and we may end up with an smaps file that contains tens of thousands (or even hundreds of thousands) of entries.

Here is the result of running perf top on the system, you can see that the top three items that hogs most of the resources are related to smaps accounting.

This file provides such useful information that we monitor it on a regular basis. It turns out that this can have… interesting effects. Consider that while we are running the scan through all the memory mapping, we may need to change the memory mapping for the process. That leads to contention on the kernel locks that protect the mapping, of course.

It’s expensive to generate the smaps file

Reading from /proc/[pid]/smaps is not a simple file read. It involves the kernel gathering detailed memory statistics (e.g., memory regions, page size, resident/anonymous/shared memory usage) for each virtual memory area (VMA) of the process. For large processes with many memory mappings, this can be computationally expensive as the kernel has to gather the required information every time /proc/[pid]/smaps is accessed.

When /proc/[pid]/smaps is read, the kernel needs to access memory-related structures. This may involve taking locks on certain parts of the process’s memory management system. If this is done too often or across many large processes, it could lead to contention or slow down the process itself, especially if other processes are accessing or modifying memory at the same time.

If the number of memory mappings is high, and the frequency with which we monitor is short… I hope you can see where this is going. We effectively spent so much time running over this file that we blocked other operations.

This wasn’t an issue when we just started the process, because the number of memory mappings was small, but as we worked on the system and the number of memory mappings grew… we eventually started hitting contention.

The solution was two-fold. We made sure that there is only ever a single thread that would read the information from the smaps (previously it might have been triggered from multiple locations).  We added some throttling to ensure that we aren’t hammering the kernel with requests for this file too often (returning cached information if needed) and we switched from using smaps to using smaps_rollup instead. The rollup version provides much better performance, since it deals with summary data only.

With those changes in place, we deployed to production and waited. The result was flat latency numbers and another item that the Cloud team could strike off the board successfully.

time to read 3 min | 457 words

We ran into a memory issue recently in RavenDB, which had a pretty interesting root cause. Take a look at the following code and see if you can spot what is going on:


ConcurrentQueue<Buffer> _buffers = new();


void FlushUntil(long maxTransactionId)
{
    List<Buffer> toFlush = new();
    while(_buffers.TryPeek(out buffer) && 
        buffer.TransactionId <= maxTransactionId)
    {
        if(_buffers.TryDequeue(out buffer))
        {
            toFlush.Add(buffer);
        }
    }


    FlushToDisk(toFlush);
}

The code handles flushing data to disk based on the maximum transaction ID. Can you see the memory leak?

If we have a lot of load on the system, this will run just fine. The problem is when the load is over. If we stop writing new items to the system, it will keep a lot of stuff in memory, even though there is no reason for it to do so.

The reason for that is the call to TryPeek(). You can read the source directly, but the basic idea is that when you peek, you have to guard against concurrent TryTake(). If you are not careful, you may encounter something called a torn read.

Let’s try to explain it in detail. Suppose we store a large struct in the queue and call TryPeek() and TryTake() concurrently. The TryPeek() starts copying the struct to the caller at the same time that TryTake() does the same and zeros the value. So it is possible that TryPeek() would get an invalid value.

To handle that, if you are using TryPeek(), the queue will not zero out the values. This means that until that queue segment is completely full and a new one is generated, we’ll retain references to those buffers, leading to an interesting memory leak.

time to read 15 min | 2973 words

RavenDB is a transactional database, we care deeply about ACID. The D in ACID stands for durability, which means that to acknowledge a transaction, we must write it to a persistent medium. Writing to disk is expensive, writing to the disk and ensuring durability is even more expensive.

After seeing some weird performance numbers on a test machine, I decided to run an experiment to understand exactly how durable writes affect disk performance.

A few words about the term durable writes. Disks are slow, so we use buffering & caches to avoid going to the disk. But a write to a buffer isn’t durable. A failure could cause it to never hit a persistent medium. So we need to tell the disk in some way that we are willing to wait until it can ensure that this write is actually durable.

This is typically done using either fsync or O_DIRECT | O_DSYNC flags. So this is what we are testing in this post.

I wanted to test things out without any of my own code, so I ran the following benchmark.

I pre-allocated a file and then ran the following commands.

Normal writes (buffered) with different sizes (256 KB, 512 KB, etc).


dd if=/dev/zero of=/data/test bs=256K count=1024
dd if=/dev/zero of=/data/test bs=512K count=1024

Durable writes (force the disk to acknowledge them) with different sizes:


dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,sync
dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,sync

The code above opens the file using:


openat(AT_FDCWD, "/data/test", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_DIRECT, 0666) = 3

I got myself an i4i.xlarge instance on AWS and started running some tests. That machine has a local NVMe drive of about 858 GB, 32 GB of RAM, and 4 cores. Let’s see what kind of performance I can get out of it.

Write sizeTotal writesBuffered writes

256 KB 256 MB 1.3 GB/s
512 KB 512 MB 1.2 GB/s
1 MB 1 GB 1.2 GB/s
2 MB 2 GB 731 Mb/s
8 MB 8 GB 571 MB/s
16 MB 16 GB 561 MB/s
2 MB 8 GB 559 MB/s
1 MB 1 GB 554 MB/s
4 KB 16 GB 557 MB/s
16 KB 16 GB 553 MB/s

What you can see here is that writes are really fast when buffered. But when I hit a certain size (above 1 GB or so), we probably start having to write to the disk itself (which is NVMe, remember). Our top speed is about 550 MB/s at this point, regardless of the size of the buffers I’m passing to the write() syscall.

I’m writing here using cached I/O, which is something that as a database vendor, I don’t really care about. What happens when we run with direct & sync I/O, the way I would with a real database? Here are the numbers for the i4i.xlarge instance for durable writes.

Write sizeTotal writesDurable writes

256 KB 256 MB 1.3 GB/s
256 KB 1 GB 1.1 GB/s
16 MB 16 GB 584 GB/s
64 KB 16 GB 394 MB/s
32 KB 16 GB 237 MB/s
16 KB 16 GB 126 MB/s

In other words, when using direct I/O, the smaller the write, the more time it takes. Remember that we are talking about forcing the disk to write the data, and we need to wait for it to complete before moving to the next one.

For 16 KB writes, buffered writes achieve a throughput of 553 MB/s vs. 126 MB/s for durable writes. This makes sense, since those writes are cached, so the OS is probably sending big batches to the disk. The numbers we have here clearly show that bigger batches are better.

My next test was to see what would happen when I try to write things in parallel. In this test, we run 4 processes that write to the disk using direct I/O and measure their output.

I assume that I’m maxing out the throughput on the drive, so the total rate across all commands should be equivalent to the rate I would get from a single command.

To run this in parallel I’m using a really simple mechanism - just spawn processes that would do the same work. Here is the command template I’m using:


parallel -j 4 --tagstring 'Task {}' dd if=/dev/zero of=/data/test bs=16M count=128 seek={} oflag=direct,sync ::: 0 1024 2048 3072

This would write to 4 different portions of the same file, but I also tested that on separate files. The idea is to generate a sufficient volume of writes to stress the disk drive.

Write sizeTotal writesDurable & Parallel writes

16 MB 8 GB 650 MB/s
16 KB 64 GB 252 MB/s

I also decided to write some low-level C code to test out how this works with threads and a single program. You can find the code here.  I basically spawn NUM_THREADS threads, and each will open a file using O_SYNC | O_DIRECT and write to the file WRITE_COUNT times with a buffer of size BUFFER_SIZE.

This code just opens a lot of files and tries to write to them using direct I/O with 8 KB buffers. In total, I’m writing 16 GB (128 MB x 128 threads) to the disk. I’m getting a rate of about 320 MB/sec when using this approach.

As before, increasing the buffer size seems to help here. I also tested a version where we write using buffered I/O and call fsync every now and then, but I got similar results.

The interim conclusion that I can draw from this experiment is that NVMes are pretty cool, but once you hit their limits you can really feel it. There is another aspect to consider though, I’m running this on a disk that is literally called ephemeral storage. I need to repeat those tests on real hardware to verify whether the cloud disk simply ignores the command to persist properly and always uses the cache.

That is supported by the fact that using both direct I/O on small data sizes didn’t have a big impact (and I expected it should). Given that the point of direct I/O in this case is to force the disk to properly persist (so it would be durable in the case of a crash), while at the same time an ephemeral disk is wiped if the host machine is restarted, that gives me good reason to believe that these numbers are because the hardware “lies” to me.

In fact, if I were in charge of those disks, lying about the durability of writes would be the first thing I would do. Those disks are local to the host machine, so we have two failure modes that we need to consider:

  • The VM crashed - in which case the disk is perfectly fine and “durable”.
  • The host crashed - in which case the disk is considered lost entirely.

Therefore, there is no point in trying to achieve durability, so we can’t trust those numbers.

The next step is to run it on a real machine. The economics of benchmarks on cloud instances are weird. For a one-off scenario, the cloud is a godsend. But if you want to run benchmarks on a regular basis, it is far more economical to just buy a physical machine. Within a month or two, you’ll already see a return on the money spent.

We got a machine in the office called Kaiju (a Japanese term for enormous monsters, think: Godzilla) that has:

  • 32 cores
  • 188 GB RAM
  • 2 TB NVMe for the system disk
  • 4 TB NVMe for the data disk

I ran the same commands on that machine as well and got really interesting results.

Write sizeTotal writesBuffered writes

4 KB 16 GB 1.4 GB/s
256 KB 256 MB 1.4 GB/s
2 MB 2 GB 1.6 GB/s
2 MB 16 GB 1.7 GB/s
4 MB 32 GB 1.8 GB/s
4 MB 64 GB 1.8 GB/s

We are faster than the cloud instance, and we don’t have a drop-off point when we hit a certain size. We are also seeing higher performance when we throw bigger buffers at the system.

But when we test with small buffers, the performance is also great. That is amazing, but what about durable writes with direct I/O?

I tested the same scenario with both buffered and durable writes:

ModeBufferedDurable

1 MB buffers, 8 GB write 1.6 GB/s 1.0 GB/s
2 MB buffers, 16 GB write 1.7 GB/s 1.7 GB/s

Wow, that is an interesting result. Because it means that when we use direct I/O with 1 MB buffers, we lose about 600 MB/sec compared to buffered I/O. Note that this is actually a pretty good result. 1 GB/sec is amazing.

And if you use big buffers, then the cost of direct I/O is basically gone. What about when we go the other way around and use smaller buffers?

ModeBufferedDurable

128 KB buffers, 8 GB write 1.7 GB/s 169 MB/s
32 KB buffers, 2 GB 1.6 GB/s 49.9 MB/s
Parallel: 8, 1 MB, 8 GB 5.8 GB/s 3.6 GB/s
Parallel: 8, 128 KB, 8 GB 6.0 GB/s 550 MB/s

For buffered I/O - I’m getting simply dreamy numbers, pretty much regardless of what I do 🙂.

For durable writes, the situation is clear. The bigger the buffer we write, the better we perform, and we pay for small buffers. Look at the numbers for 128 KB in the durable column for both single-threaded and parallel scenarios.

169 MB/s in the single-threaded result, but with 8 parallel processes, we didn’t reach 1.3 GB/s (which is 169x8). Instead, we achieved less than half of our expected performance.

It looks like there is a fixed cost for making a direct I/O write to the disk, regardless of the amount of data that we write.  When using 32 KB writes, we are not even breaking into the 200 MB/sec. And with 8 KB writes, we are barely breaking into the 50 MB/sec range.

Those are some really interesting results because they show a very strong preference for bigger writes over smaller writes.

I also tried using the same C code as before. As a reminder, we use direct I/O to write to 128 files in batches of 8 KB, writing a total of 128 MB per file. All of that is done concurrently to really stress the system.

When running iotop in this environment, we get:


Total DISK READ:         0.00 B/s | Total DISK WRITE:       522.56 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:     567.13 M/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND
 142851 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142901 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142902 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142903 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142904 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
... redacted ...

So each thread is getting about 4.09 MB/sec for writes, but we total 522 MB/sec across all writes. I wondered what would happen if I limited it to fewer threads, so I tried with 16 concurrent threads, resulting in:


Total DISK READ:         0.00 B/s | Total DISK WRITE:        89.80 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:     110.91 M/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND
 142996 be/4 kaiju-1     0.00 B/s    5.65 M/s ./a.out
 143004 be/4 kaiju-1     0.00 B/s    5.62 M/s ./a.out
 142989 be/4 kaiju-1     0.00 B/s    5.62 M/s ./a.out
... redacted ..

Here we can see that each thread is getting (slightly) more throughput, but the overall system throughput is greatly reduced.

To give some context, with 128 threads running, the process wrote 16GB in 31 seconds, but with 16 threads, it took 181 seconds to write the same amount. In other words, there is a throughput issue here. I also tested this with various levels of concurrency:

Concurrency(8 KB x 16K times - 128 MB)Throughput per threadTime / MB written

1 15.5 MB / sec 8.23 seconds / 128 MB
2 5.95 MB / sec 18.14 seconds / 256 MB
4 5.95 MB / sec 20.75 seconds / 512 MB
8 6.55 MB / sec 20.59 seconds / 1024 MB
16 5.70 MB / sec 22.67 seconds / 2048 MB

To give some context, here are two attempts to write 2GB to the disk:

ConcurrencyWriteThroughputTotal writtenTotal time

16 128 MB in 8 KB writes 5.7 MB / sec 2,048 MB 22.67 sec
8 256 MB in 16 KB writes 12.6 MB / sec 2,048 MB 22.53 sec
16 256 MB in 16 KB writes 10.6 MB / sec 4,096 MB 23.92 sec

In other words, we can see the impact of concurrent writes. There is absolutely some contention at the disk level when making direct I/O writes. The impact is related to the number of writes rather than the amount of data being written.

Bigger writes are far more efficient. And concurrent writes allow you to get more data overall but come with a horrendous latency impact for each individual thread.

The difference between the cloud and physical instances is really interesting, and I have to assume that this is because the cloud instance isn’t actually forcing the data to the physical disk (it doesn’t make sense that it would).

I decided to test that on an m6i.2xlarge instance with a 512 GB io2 disk with 16,000 IOPS.

The idea is that an io2 disk has to be durable, so it will probably have similar behavior to physical hardware.

DiskBuffer SizeWritesDurableParallelTotalRate

io2              256.00                1,024.00  No                         1.00              256.00    1,638.40
io2          2,048.00                1,024.00  No                         1.00          2,048.00    1,331.20
io2                   4.00    4,194,304.00  No                         1.00    16,384.00    1,228.80
io2              256.00                1,024.00  Yes                         1.00              256.00            144.00
io2              256.00                4,096.00  Yes                         1.00          1,024.00            146.00
io2                64.00                8,192.00  Yes                         1.00              512.00              50.20
io2                32.00                8,192.00  Yes                         1.00              256.00              26.90
io2                   8.00                8,192.00  Yes                         1.00                64.00                7.10
io2          1,024.00                8,192.00  Yes                         1.00          8,192.00            502.00
io2          1,024.00                2,048.00  No                         8.00          2,048.00    1,909.00
io2          1,024.00                2,048.00  Yes                         8.00          2,048.00    1,832.00
io2                32.00                8,192.00  No                         8.00              256.00    3,526.00
io2                32.00                8,192.00  Yes                         8.00              256.00 150.9
io2                   8.00                8,192.00  Yes                         8.00                64.00              37.10

In other words, we are seeing pretty much the same behavior as on the physical machine, unlike the ephemeral drive.

In conclusion, it looks like the limiting factor for direct I/O writes is the number of writes, not their size. There appears to be some benefit for concurrency in this case, but there is also some contention. The best option we got was with big writes.

Interestingly, big writes are a win, period. For example, 16 MB writes, direct I/O:

  • Single-threaded - 4.4 GB/sec
  • 2 threads - 2.5 GB/sec X 2 - total 5.0 GB/sec
  • 4 threads - 1.4 X 4  - total 5.6 GB/sec
  • 8 threads - ~590 MB/sec x 8 - total 4.6 GB/sec

Writing 16 KB, on the other hand:

  • 8 threads - 11.8 MB/sec x 8 - total 93 MB/sec
  • 4 threads - 12.6 MB/sec x 4- total 50.4 MB/sec
  • 2 threads - 12.3 MB/sec x 2 - total 24.6 MB/sec
  • 1 thread - 23.4 MB/sec

This leads me to believe that there is a bottleneck somewhere in the stack, where we need to handle the durable write, but it isn’t related to the actual amount we write. In short, fewer and bigger writes are more effective, even with concurrency.

As a database developer, that leads to some interesting questions about design. It means that I want to find some way to batch more writes to the disk, especially for durable writes, because it matters so much.

Expect to hear more about this in the future.

time to read 7 min | 1357 words

When building RavenDB, we occasionally have to deal with some ridiculous numbers in both size and scale. In one of our tests, we ran into an interesting problem. Here are the performance numbers of running a particular query 3 times.

First Run: 19,924 ms

Second Run: 3,181 ms

Third Run: 1,179 ms

Those are not good numbers, so we dug into this to try to figure out what is going on. Here is the query that we are running:


from index 'IntFloatNumbers-Lucene' where Int > 0

And the key here is that this index covers 400 million documents, all of which are actually greater than 0. So this is actually a pretty complex task for the database to handle, mostly because of the internals of how Lucene works.

Remember that we provide both the first page of the results as well as its total number. So we have to go through the entire result set to find out how many items we have. That is a lot of work.

But it turns out that most of the time here isn’t actually processing the query, but dealing with the GC. Here are some entries from the GC log while the queries were running:


2024-12-12T12:39:40.4845987Z, Type: GC, thread id: 30096, duration: 2107.9972ms, index: 25, generation: 2, reason: Induced
2024-12-12T12:39:53.1359744Z, Type: GC, thread id: 30096, duration: 1650.9207ms, index: 26, generation: 2, reason: Induced
2024-12-12T12:40:07.5835527Z, Type: GC, thread id: 30096, duration: 1629.1771ms, index: 27, generation: 2, reason: Induced
2024-12-12T12:40:20.2205602Z, Type: GC, thread id: 30096, duration: 776.24ms, index: 28, generation: 2, reason: Induced

That sound you heard was me going: Ouch!

Remember that this query actually goes through 400M results. Here are the details about its Memory Usage & Object Count:

  • Number of objects for GC (under LuceneIndexPersistence): 190M (~12.63GB)
  • Managed Memory: 13.01GB
  • Unmanaged Memory: 4.53MB

What is going on? It turns out that Lucene handles queries such as Int>0 by creating an array with all the unique values, something similar to:


string[] sortedTerms = new string[190_000_000];
long[] termPostingListOffset = new long[190_000_000];

This isn’t exactly how it works, mind. But the details don’t really matter for this story. The key here is that we have an array with a sorted list of terms, and in this case, we have a lot of terms.

Those values are cached, so they aren’t actually allocated and thrown away each time we query. However, remember that the .NET GC uses a Mark & Sweep algorithm. Here is the core part of the Mark portion of the algorithm:


long _marker;
void Mark()
{
    var currentMarker = ++_marker;


    foreach (var root in GetRoots())
    {
        Mark(root);
    }


    void Mark(object o)
    {
        // already visited
        if (GetMarket(o) == currentMarker)
            return;


        foreach (var child in GetReferences(node))
        {
            Mark(child);
        }
    }
}

Basically, start from the roots (static variables, items on the stack, etc.), scan the reachable object graph, and mark all the objects in use. The code above is generic, of course (and basically pseudo-code), but let’s consider what the performance will be like when dealing with an array of 190M strings.

It has to scan the entire thing, which means it is proportional to the number of objects. And we do have quite a lot of those.

The problem was the number of managed objects, so we pulled all of those out. We moved the term storage to unmanaged memory, outside the purview of the GC. As a result, we now have the following Memory Usage & Object Count:

  • Number of objects for GC (under LuceneIndexPersistence): 168K (~6.64GB)
  • Managed Memory: 6.72GB
  • Unmanaged Memory: 1.32GB

Looking at the GC logs, we now have:


2024-12-16T18:33:29.8143148Z, Type: GC, thread id: 8508, duration: 93.6835ms, index: 319, generation: 2, reason: Induced
2024-12-16T18:33:30.7013255Z, Type: GC, thread id: 8508, duration: 142.1781ms, index: 320, generation: 2, reason: Induced
2024-12-16T18:33:31.5691610Z, Type: GC, thread id: 8508, duration: 91.0983ms, index: 321, generation: 2, reason: Induced
2024-12-16T18:33:37.8245671Z, Type: GC, thread id: 8508, duration: 112.7643ms, index: 322, generation: 2, reason: Induced

So the GC time is now in the range of 100ms, instead of several seconds. This change helps both reduce overall GC pause times and greatly reduce the amount of CPU spent on managing garbage.

Those are still big queries, but now we can focus on executing the query, rather than managing maintenance tasks. Incidentally, those sorts of issues are one of the key reasons why we built Corax, which can process queries directly on top of persistent structures, without needing to materialize anything from the disk.

time to read 9 min | 1622 words

RavenDB is a database, a transactional one. This means that we have to reach the disk and wait for it to complete persisting the data to stable storage before we can confirm a transaction commit. That represents a major challenge for ensuring high performance because disks are slow.

I’m talking about disks, which can be rate-limited cloud disks, HDD, SSDs, or even NVMe. From the perspective of the database, all of them are slow. RavenDB spends a lot of time and effort making the system run fast, even though the disk is slow.

An interesting problem we routinely encounter is that our test suite would literally cause disks to fail because we stress them beyond warranty limits. We actually keep a couple of those around, drives that have been stressed to the breaking point, because it lets us test unusual I/O patterns.

We recently ran into strange benchmark results, and during the investigation, we realized we are actually running on one of those burnt-out drives. Here is what the performance looks like when writing 100K documents as fast as we can (10 active threads):

As you can see, there is a huge variance in the results. To understand exactly why, we need to dig a bit deeper into how RavenDB handles I/O. You can observe this in the I/O Stats tab in the RavenDB Studio:

There are actually three separate (and concurrent) sets of I/O operations that RavenDB uses:

  • Blue - journal writes - unbuffered direct I/O - in the critical path for transaction performance because this is how RavenDB ensures that the D(urability) in ACID is maintained.
  • Green - flushes - where RavenDB writes the modified data to the data file (until the flush, the modifications are kept in scratch buffers).
  • Red - sync - forcing the data to reside in a persistent medium using fsync().

The writes to the journal (blue) are the most important ones for performance, since we must wait for them to complete successfully before we can acknowledge that the transaction was committed. The other two ensure that the data actually reached the file and that we have safely stored it.

It turns out that there is an interesting interaction between those three types. Both flushes (green) and syncs (red) can run concurrently with journal writes. But on bad disks, we may end up saturating the entire I/O bandwidth for the journal writes while we are flushing or syncing.

In other words, the background work will impact the system performance. That only happens when you reach the physical limits of the hardware, but it is actually quite common when running in the cloud.

To handle this scenario, RavenDB does a number of what I can only describe as shenanigans. Conceptually, here is how RavenDB works:


def txn_merger(self):
  while self._running:
    with self.open_tx() as tx:
      while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
        curOp = self._operations.take()
        if curOp is None:
          break # no more operations
        curOp.exec(tx)
      tx.commit()
      # here we notify the operations that we are done
      tx.notify_ops_completed()

The idea is that you submit the operation for the transaction merger, which can significantly improve the performance by merging multiple operations into a single disk write. The actual operations wait to be notified (which happens after the transaction successfully commits).

If you want to know more about this, I have a full blog post on the topic. There is a lot of code to handle all sorts of edge cases, but that is basically the story.

Notice that processing a transaction is actually composed of two steps. First, there is the execution of the transaction operations (which reside in the _operations queue), and then there is the actual commit(), where we write to the disk. It is the commit portion that takes a lot of time.

Here is what the timeline will look like in this model:

We execute the transaction, then wait for the disk. This means that we are unable to saturate either the disk or the CPU. That is a waste.

To address that, RavenDB supports async commits (sometimes called early lock release). The idea is that while we are committing the previous transaction, we execute the next one. The code for that is something like this:


def txn_merger(self):
  prev_txn = completed_txn()
  while self._running:
    executedOps = []
    with self.open_tx() as tx:
      while tx.total_size < MAX_TX_SIZE and tx.time < MAX_TX_TIME:
        curOp = self._operations.take()
        if curOp is None:
          break # no more operations
        executedOps.append(curOp)
        curOp.exec(tx)
        if prev_txn.completed:
           break
      # verify success of previous commit
      prev_txn.end_commit() 
      # only here we notify the operations that we are done
      prev_txn.notify_ops_completed()
      # start the commit in async manner
      prev_txn = tx.begin_commit()

The idea is that we start writing to the disk, and while that is happening, we are already processing the operations in the next transaction. In other words, this allows both writing to the disk and executing the transaction operations to happen concurrently. Here is what this looks like:

This change has a huge impact on overall performance. Especially because it can smooth out a slow disk by allowing us to process the operations in the transactions while waiting for the disk. I wrote about this as well in the past.

So far, so good, this is how RavenDB has behaved for about a decade or so. So what is the performance optimization?

This deserves an explanation. What this piece of code does is determine whether the transaction would complete in a synchronous or asynchronous manner. It used to do that based on whether there were more operations to process in the queue. If we completed a transaction and needed to decide if to complete it asynchronously, we would check if there are additional operations in the queue (currentOperationsCount).

The change modifies the logic so that we complete in an async manner if we executed any operation. The change is minor but has a really important effect on the system. The idea is that if we are going to write to the disk (since we have operations to commit), we’ll always complete in an async manner, even if there are no more operations in the queue.

The change is that the next operation will start processing immediately, instead of waiting for the commit to complete and only then starting to process. It is such a small change, but it had a huge impact on the system performance.

Here you can see the effect of this change when writing 100K docs with 10 threads. We tested it on both a good disk and a bad one, and the results are really interesting.

The bad disk chokes when we push a lot of data through it (gray line), and you can see it struggling to pick up. On the same disk, using the async version (yellow line), you can see it still struggles (because eventually, you need to hit the disk), but it is able to sustain much higher numbers and complete far more quickly (the yellow line ends before the gray one).

On the good disk, which is able to sustain the entire load, we are still seeing an improvement (Blue is the new version, Orange is the old one). We aren’t sure yet why the initial stage is slower (maybe just because this is the first test we ran), but even with the slower start, it was able to complete more quickly because its throughput is higher.

time to read 4 min | 771 words

In RavenDB, we really care about performance. That means that our typical code does not follow idiomatic C# code. Instead, we make use of everything that the framework and the language give us to eke out that additional push for performance. Recently we ran into a bug that was quite puzzling. Here is a simple reproduction of the problem:


using System.Runtime.InteropServices;


var counts = new Dictionary<int, int>();


var totalKey = 10_000;


ref var total = ref CollectionsMarshal.GetValueRefOrAddDefault(
                               counts, totalKey, out _);


for (int i = 0; i < 4; i++)
{
    var key = i % 32;
    ref var count = ref CollectionsMarshal.GetValueRefOrAddDefault(
                               counts, key, out _);
    count++;


    total++;
}


Console.WriteLine(counts[totalKey]);

What would you expect this code to output? We are using two important features of C# here:

  • Value types (in this case, an int, but the real scenario was with a struct)
  • CollectionMarshal.GetValueRefOrAddDefault()

The latter method is a way to avoid performing two lookups in the dictionary to get the value if it exists and then add or modify it.

If you run the code above, it will output the number 2.

That is not expected, but when I sat down and thought about it, it made sense.

We are keeping track of the reference to a value in the dictionary, and we are mutating the dictionary.

The documentation for the method very clearly explains that this is a Bad Idea. It is an easy mistake to make, but still a mistake. The challenge here is figuring out why this is happening. Can you give it a minute of thought and see if you can figure it out?

A dictionary is basically an array that you access using an index (computed via a hash function), that is all. So if we strip everything away, the code above can be seen as:


var buffer = new int[2];
ref var total = ref var buffer[0];

We simply have a reference to the first element in the array, that’s what this does behind the scenes. And when we insert items into the dictionary, we may need to allocate a bigger backing array for it, so this becomes:


var buffer = new int[2];
ref var total = ref var buffer[0];
var newBuffer = new int[4];
buffer.CopyTo(newBuffer);
buffer = newBuffer;


total = 1;
var newTotal = buffer[0]

In other words, the total variable is pointing to the first element in the two-element array, but we allocated a new array (and copied all the values). That is the reason why the code above gives the wrong result. Makes perfect sense, and yet, was quite puzzling to figure out.

time to read 5 min | 803 words

We received a really interesting question from a user, which basically boils down to:

I need to query over a time span, either known (start, end) or (start, $currentDate), and I need to be able to sort on them.

That might sound… vague, I know. A better way to explain this is that I have a list of people, and I need to sort them by their age. That’s trivial to do since I can sort by the birthday, right? The problem is that we include some historical data, so some people are deceased.

Basically, we want to be able to get the following data, sorted by age ascending:

NameBirthdayDeath

Michael Stonebraker 1943 N/A
Sir Tim Berners-Lee 1955 N/A
Narges Mohammadi 1972 N/A
Sir Terry Prachett 1948 2015
Agatha Christie 1890 1976

This doesn’t look hard, right? I mean, all you need to do is something like:


order by datediff( coalesce(Death, now()), Birthday )

Easy enough, and would work great if you have a small number of items to sort. What happens if we want to sort over 10M records?

Look at the manner in which we are ordering, that will require us to evaluate each and every record. That means we’ll have to scan through the entire list and sort it. This can be really expensive. And because we are sorting over a date (which changes), you can’t even get away with a computed field.

RavenDB will refuse to run queries that can only work with small amounts of data but will fail as the data grows. This is part of our philosophy, saying that things should Just Work. Of course, in this case, it doesn’t work, so the question is how this aligns with our philosophy?

The idea is simple. If we cannot make it work in all cases, we will reject it outright. The idea is to ensure that your system is not susceptible to hidden traps. By explicitly rejecting it upfront, we make sure that you’ll have a good solution and not something that will fail as your data size grows.

What is the appropriate behavior here, then? How can we make it work with RavenDB?

The key issue is that we want to be able to figure out what is the value we’ll sort on during the indexing stage. This is important because otherwise we’ll have to compute it across the entire dataset for each query. We can do that in RavenDB by exposing that value to the index.

We cannot just call DateTime.Today, however. That won’t work when the day rolls over, of course. So instead, we store that value in a document config/current-date, like so:


{ // config/current-date
  "Date": "2024-10-10T00:00:00.0000000"
}

Once this is stored as a document, we can then write the following index:


from p in docs.People
let end = p.Death ?? LoadDocument("config/current-date", "Config").Date
select new
{
  Age = end - p.Birthday 
}

And then query it using:


from index 'People/WithAge'
order by Age desc

That works beautifully, of course, until the next day. What happens then? Well, we’ll need to schedule an update to the config/current-date document to correct the date.

At that point, because there is an association created between all the documents that loaded the current date, the indexing engine in RavenDB will go and re-index them. The idea is that at any given point in time, we have already computed the value, and can run really quick queries and sort on it.

When you update the configuration document, it is a signal that we need to re-index the referencing documents. RavenDB is good at knowing how to do that on a streaming basis, so it won’t need to do a huge amount of work all at once.

You’ll also note that we only load the configuration document if we don’t have an end date. So the deceased people’s records will not be affected or require re-indexing.

In short, we can benefit from querying over the age without incurring query time costs and can defer those costs to background indexing time. The downside is that we need to set up a cron job to make it happen, but that isn’t too big a task, I think.

You can utilize similar setups for other scenarios where you need to query over changing values. The performance benefits here are enormous. And what is more interesting, even if you have a huge amount of data, this approach will just keep on ticking and deliver great results at very low latencies.

time to read 5 min | 862 words

It has been almost a year since the release of RavenDB 6.0. The highlights of the 6.0 release were Corax (a new blazing-fast indexing engine) and Sharding (server-side and simple to operate at scale). We made 10 stable releases in the 6.0.x line since then, mostly focused on performance, stability, and minor features.

The new RavenDB 6.2 release is now out and it has a bunch of new features for you to play with and explore. The team has been working on a wide range of new features, from enabling serverless triggers to quality-of-life improvements for operations teams.

RavenDB 6.2 is a Long Term Support (LTS) release

RavenDB 6.2 is a Long Term Support release, replacing the current 5.4 LTS (released in 2022). That means that we’ll support RavenDB 5.4 until Oct 2025, and we strongly encourage all users to upgrade to RavenDB 6.2 at their earliest convenience.

You can get the new RavenDB 6.2 bits on the download page. If you are running in the cloud, you can open a support request and ask to be upgraded to the new release.

Data sovereignty and geo-distribution via Prefixed Sharding

In RavenDB 6.2 we introduced a seemingly simple change to the way RavenDB handles sharding, with profound implications for what you can do with it. Prefixed sharding allows you to define which shards a particular set of documents will go to.

Here is a simple example:

In this case, data for users in the US will reside in shards 0 & 1, while the EU data is limited to shards 2 & 3. The data from Asia is spread over shards 0, 2, & 4.  You can then assign those shards to specific nodes in a particular geographic region, and with that, you are done.

RavenDB will ensure that documents will stay only in their assigned location, handling data sovereignty issues for you. In the same manner, you get to geographically split the data so you can have a single world-spanning database while issuing mostly local queries.

You can read more about this feature and its impact in the documentation.

Actors architecture with Akka.NET

New in RavenDB 6.2 is the integration of RavenDB with Akka.NET. The idea is to allow you to easily manage state persistence of distributed actors in RavenDB. You’ll get both the benefit of the actor model via Akka.NET, simplifying parallelism and concurrency, while at the same time freeing yourself from persistence and high availability concerns thanks to RavenDB.

We have an article out discussing how you use RavenDB & Akka.NET, and if you are into that sort of thing, there is also a detailed set of notes covering the actual implementation and the challenges involved.

Azure Functions integration with ETL to Azure Queues

This is the sort of feature with hidden depths. ETL to Azure Queue Storage is fairly simple on the surface, it allows you to push data using RavenDB’s usual ETL mechanisms to Azure Queues. At a glance, this looks like a simple extension of our already existing capabilities with queues (ETL to Kafka or RabbitMQ).

The reason that this is a top-line feature is that it also enables a very interesting scenario. You can now seamlessly integrate Azure Functions into your RavenDB data pipeline using this feature. We have an article out that walks you through setting up Azure Functions to process data from RavenDB.

OpenTelemetry integration

In RavenDB 6.2 we have added support for the OpenTelemetry framework. This allows your operations team to more easily integrate RavenDB into your infrastructure. You can read more about how to set up OpenTelemetry for your RavenDB cluster in the documentation.

OpenTelemetry integration is in addition to Prometheus, Telegraf, and SNMP telemetry solutions that are already in RavenDB. You can pick any of them to monitor and inspect the state of RavenDB.

Studio Omni-Search

We made some nice improvements to RavenDB Studio as well, and probably the most visible of those is the Omni-Search feature.  You can now hit Ctrl+K in the Studio and just search across everything:

  • Commands in the Studio
  • Documents
  • Indexes

This feature greatly enhances the discoverability of features in RavenDB as well as makes it a joy for those of us (myself included) who love to keep our hands on the keyboard.

Summary

I’m really happy about this release. It follows a predictable and stable release cadence since the release of 6.0 a year ago. The new release adds a whole bunch of new features and capabilities, and it can be upgraded in place (including cross-version clusters) and deployed to production with no hassles.

Looking forward, we have already started work on the next version of RavenDB, tentatively meant to be 7.0. We have some cool ideas about what will go into that release (check the roadmap), but the key feature is likely to make RavenDB a more intelligent database, one might even say, artificially so.

time to read 1 min | 121 words

Corax was released just under a year ago, and we are seeing more customers deploying that to production. During a call with a customer, we noticed the following detail:

image

Let me explain what we are seeing here. The two indexes are the same, operating on the same set of documents. The only difference between those indexes is the indexing engine.

What is really amazing here is that Corax is able to index in 3:21 minutes what Lucene takes 17:15 minutes to index. In other words, Corax is more than 5 times faster than Lucene in a real world scenario.

And these news make me very happy.

time to read 21 min | 4146 words

RavenDB is a pretty old codebase, hitting 15+ years in production recently. In order to keep it alive & well, we make sure to follow the rule of always leaving the code in a better shape than we found it.

Today’s tale is about the StreamBitArray class, deep in the guts of Voron, RavenDB’s storage engine. The class itself isn’t really that interesting, it is just an implementation of a Bit Array that we have for a bitmap. We wrote it (based on Mono’s code, it looks like) very early in the history of RavenDB and have never really touched it since.

The last time anyone touched it was 5 years ago (fixing the namespace), 7 years ago we created an issue from a TODO comment, etc. Most of the code dates back to 2013, actually. And even then it was moved from a different branch, so we lost the really old history.

To be clear, that class did a full tour of duty. For over a decade, it has served us very well. We never found a reason to change it, never got a trace of it in the profiler, etc. As we chip away at various hurdles inside RavenDB, I ran into this class and really looked at it with modern sensibilities. I think that this makes a great test case for code refactoring from the old style to our modern one.

Here is what the class looks like:

Already, we can see several things that really bug me. That class is only used in one context, to manage the free pages bitmap for Voron. That means we create it whenever Voron frees a page. That can happen a lot, as you might imagine.

A single bitmap here covers 2048 pages, so when we create an instance of this class we also allocate an array with 64 ints. In other words, we need to allocate 312 bytes for each page we free. That isn’t fun, and it actually gets worse. Here is a typical example of using this class:


using (freeSpaceTree.Read(section, out Slice result))
{
    sba = !result.HasValue ? 
              new StreamBitArray() : 
              new StreamBitArray(result.CreateReader());
}
sba.Set((int)(pageNumber % NumberOfPagesInSection), true);
using (sba.ToSlice(tx.Allocator, out Slice val))
    freeSpaceTree.Add(section, val);

And inside the ToSlice() call, we have:


public ByteStringContext.InternalScope ToSlice(ByteStringContext context,
ByteStringType type, out Slice str)
{
    var buffer = ToBuffer();
    var scope = context.From(buffer, 0, buffer.Length, 
type, out ByteString byteString);
    str = new Slice(byteString);
    return scope;
}


private unsafe byte[] ToBuffer()
{
    var tmpBuffer = new byte[(_inner.Length + 1)*sizeof (int)];
    unsafe
    {
        fixed (int* src = _inner)
        fixed (byte* dest = tmpBuffer)
        {
            *(int*) dest = SetCount;
            Memory.Copy(dest + sizeof (int), (byte*) src, 
                                             tmpBuffer.Length - 1);
        }
    }
    return tmpBuffer;
}

In other words, ToSlice() calls ToBuffer(), which allocates an array of bytes (288 bytes are allocated here), copies the data from the inner buffer to a new one (using fixed on the two arrays, which is a performance issue all in itself) and then calls a method to do the actual copy. Then in ToSlice() itself we allocate it again in native memory, which we then write to Voron, and then discard the whole thing.

In short, somehow it turns out that freeing a page in Voron costs us ~1KB of memory allocations. That sucks, I have to say. And the only reasoning I have for this code is that it is old.

Here is the constructor for this class as well:


public StreamBitArray(ValueReader reader)
{
    SetCount = reader.ReadLittleEndianInt32();
    unsafe
    {
        fixed (int* i = _inner)
        {
            int read = reader.Read((byte*)i, _inner.Length * sizeof(int));
            if (read < _inner.Length * sizeof(int))
                throw new EndOfStreamException();
        }
    }
}

This accepts a reader to a piece of memory and does a bunch of things. It calls a few methods, uses fixed on the array, etc., all to get the data from the reader to the class. That is horribly inefficient.

Let’s write it from scratch and see what we can do. The first thing to notice is that this is a very short-lived class, it is only used inside methods and never held for long. This usage pattern tells me that it is a good candidate to be made into a struct, and as long as we do that, we might as well fix the allocation of the array as well.

Note that I have a hard constraint, I cannot change the structure of the data on disk for backward compatibility reasons. So only in-memory changes are allowed.

Here is my first attempt at refactoring the code:


public unsafe struct StreamBitArray
{
    private fixed uint _inner[64];
    public int SetCount;


     public StreamBitArray()
     {
         SetCount = 0;
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[0]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[8]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[16]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[24]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[32]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[40]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[48]);
         Vector256<uint>.Zero.StoreUnsafe(ref _inner[56]);
     }


     public StreamBitArray(byte* ptr)
     {
         var ints = (uint*)ptr;
         SetCount = (int)*ints;
         var a = Vector256.LoadUnsafe(ref ints[1]);
         var b = Vector256.LoadUnsafe(ref ints[9]);
         var c = Vector256.LoadUnsafe(ref ints[17]);
         var d = Vector256.LoadUnsafe(ref ints[25]);
         var e = Vector256.LoadUnsafe(ref ints[33]);
         var f = Vector256.LoadUnsafe(ref ints[41]);
         var g = Vector256.LoadUnsafe(ref ints[49]);
         var h = Vector256.LoadUnsafe(ref ints[57]);


         a.StoreUnsafe(ref _inner[0]);
         b.StoreUnsafe(ref _inner[8]);
         c.StoreUnsafe(ref _inner[16]);
         d.StoreUnsafe(ref _inner[24]);
         e.StoreUnsafe(ref _inner[32]);
         f.StoreUnsafe(ref _inner[40]);
         g.StoreUnsafe(ref _inner[48]);
         h.StoreUnsafe(ref _inner[56]);
     }
}

That looks like a lot of code, but let’s see what changes I brought to bear here.

  • Using a struct instead of a class saves us an allocation.
  • Using a fixed array means that we don’t have a separate allocation for the buffer.
  • Using [SkipLocalsInit] means that we ask the JIT not to zero the struct. We do that directly in the default constructor.
  • We are loading the data from the ptr in the second constructor directly.

The fact that this is a struct and using a fixed array means that we can create a new instance of this without any allocations, we just need 260 bytes of stack space (the 288 we previously allocated also included object headers).

Let’s look at the actual machine code that these two constructors generate. Looking at the default constructor, we have:


StreamBitArray..ctor()
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: xor eax, eax
    L0008: mov [ecx+0x100], eax
    L000e: vxorps ymm0, ymm0, ymm0
    L0012: vmovups [ecx], ymm0
    L0016: vmovups [ecx+0x20], ymm0
    L001b: vmovups [ecx+0x40], ymm0
    L0020: vmovups [ecx+0x60], ymm0
    L0025: vmovups [ecx+0x80], ymm0
    L002d: vmovups [ecx+0xa0], ymm0
    L0035: vmovups [ecx+0xc0], ymm0
    L003d: vmovups [ecx+0xe0], ymm0
    L0045: vzeroupper
    L0048: pop ebp
    L0049: ret

There is the function prolog and epilog, but the code of this method uses 4 256-bit instructions to zero the buffer. If we were to let the JIT handle this, it would use 128-bit instructions and a loop to do it. In this case, our way is better, because we know more than the JIT.

As for the constructor accepting an external pointer, here is what this translates into:


StreamBitArray..ctor(Byte*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: mov eax, [edx]
    L0008: mov [ecx+0x100], eax
    L000e: vmovups ymm0, [edx+4]
    L0013: vmovups ymm1, [edx+0x24]
    L0018: vmovups ymm2, [edx+0x44]
    L001d: vmovups ymm3, [edx+0x64]
    L0022: vmovups ymm4, [edx+0x84]
    L002a: vmovups ymm5, [edx+0xa4]
    L0032: vmovups ymm6, [edx+0xc4]
    L003a: vmovups ymm7, [edx+0xe4]
    L0042: vmovups [ecx], ymm0
    L0046: vmovups [ecx+0x20], ymm1
    L004b: vmovups [ecx+0x40], ymm2
    L0050: vmovups [ecx+0x60], ymm3
    L0055: vmovups [ecx+0x80], ymm4
    L005d: vmovups [ecx+0xa0], ymm5
    L0065: vmovups [ecx+0xc0], ymm6
    L006d: vmovups [ecx+0xe0], ymm7
    L0075: vzeroupper
    L0078: pop ebp
    L0079: ret

This code is exciting to me because we are also allowing instruction-level parallelism. We effectively allow the CPU to execute all the operations of reading and writing in parallel.

Next on the chopping block is this method:


public int FirstSetBit()
{
    for (int i = 0; i < _inner.Length; i++)
    {
        if (_inner[i] == 0)
            continue;
        return i << 5 | HighestBitSet(_inner[i]);
    }
    return -1;
}


private static int HighestBitSet(int v)
{


    v |= v >> 1; // first round down to one less than a power of 2 
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;


    return MultiplyDeBruijnBitPosition[(uint)(v * 0x07C4ACDDU) >> 27];
}

We are using vector instructions to scan 8 ints at a time, trying to find the first one that is set. Then we find the right int and locate the first set bit there. Here is what the assembly looks like:


StreamBitArray.FirstSetBit()
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: vzeroupper
    L0006: xor edx, edx
    L0008: cmp [ecx], cl
    L000a: vmovups ymm0, [ecx+edx*4]
    L000f: vxorps ymm1, ymm1, ymm1
    L0013: vpcmpud k1, ymm0, ymm1, 6
    L001a: vpmovm2d ymm0, k1
    L0020: vptest ymm0, ymm0
    L0025: jne short L0039
    L0027: add edx, 8
    L002a: cmp edx, 0x40
    L002d: jl short L000a
    L002f: mov eax, 0xffffffff
    L0034: vzeroupper
    L0037: pop ebp
    L0038: ret
    L0039: vmovmskps eax, ymm0
    L003d: tzcnt eax, eax
    L0041: add eax, edx
    L0043: xor edx, edx
    L0045: tzcnt edx, [ecx+eax*4]
    L004a: shl eax, 5
    L004d: add eax, edx
    L004f: vzeroupper
    L0052: pop ebp
    L0053: ret

In short, the code is simpler, shorter, and more explicit about what it is doing. The machine code that is running there is much tighter. And I don’t have allocations galore.

This particular optimization isn’t about showing better numbers in a specific scenario that I can point to. I don’t think we ever delete enough pages to actually see this in a profiler output in such an obvious way. The goal is to reduce allocations and give the GC less work to do, which has a global impact on the performance of the system.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}