Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,163
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 589 words

The previous post has a code sample in it that was figuratively* physically painful for me to write. Avoiding the number of syscalls that are invoked, the code isn’t all too efficient as I now measure things, it uses way too much managed memory and it is subject to failures as we increase the amount of data we push through. For this post, I’m going to be rewriting the CodexWriter class as I would for code that is going into RavenDB.

* I literally know what literally used to mean, amazing.

I’m sorry, there is going to be a big jump in the complexity of the code, because I’m going to try to handle performance, parallelism and resource utilization all at once. The first thing to do is to go into the project’s settings and enable both unsafe code (without which is it nearly impossible to write high performance code) and C# 7.3 features, we’ll need these.

We can divide the task of gather the inputs into several stages. First, we need to write the data to the file. This is similar to the way we did it before, here is the Add() method:

As you can see, there isn’t really much that changed here, but we have this notion of a segment, which is created every million keys. But what is this segment?

It is a way to refer to a specific section of records in the file. In particular, it has just one primary role, it exists to sort the records. Let’s take a look at the code:

There are a few key points. Instead of using file I/O directly, we are using memory mapped files. Why is that? Because, as we have seen, the cost of syscalls is non trivial in the extreme, and using memory mapped files means that we can access the data natively without having to pay any price aside from page fault if the data isn’t already in memory.

The EnsureSorted() method is also interesting, it spawns a new task to sort the entries inside the segment in parallel with inserting the data to the main file. The actual sort is handled in the Compare() methods.

As we write the data into the codex, we sort the data as we run through it, but what happens in the end? In this case, we have about 13 million items that we inserted, so we have 13 segments that are each individually sorted. To get the final sort, we basically merge from all of them. Here is the relevant code:

This used the SortedSet as a heap, to always get the minimum value from the sorted inner values in the set. Note that we need to wait for the parallel searches to complete, then merge from all of them to the final result. We can write the result of the sort directly to the end of the file.

Overall, this process takes: 59.01 seconds to complete. Remember that this is when we are pushing unsorted data through. If we pass the data sorted, we get a significant improvement and only take: 35.91 seconds.

To compare, I run the same sort of test on Voron, and I got: 59.15 seconds for the unsorted case and for the sorted case: 13.85 seconds. This is when Voron is also doing ACID writes, which we obviously don’t in Codex.

I guess that spending four to five years with a whole team doing performance optimization is a better way to get storage performance than a couple of evenings hacking before I go to bed, who knew?

time to read 3 min | 553 words

We are exploring a few data structure for a particular feature in RavenDB, and I run into something that is elegant, simple, easy and deep enough that we can discuss serious implementation details upon without getting too bogged down in the details.

The idea is that I’m going to be using this series of blog post to post a detailed walk through about building a key value store from scratch. Including all the intermediate steps and wrong turns along the way. In other words, this is a “Show YourWork” kind of series. The end result is going to be a key/value store that can:

  • Store arbitrary keys / values.
  • Get key’s value by the key.
  • Support range queries and iteration.
  • Support some form of ACID.

In this case, I’m going to start from the very basics and build up. The challenge we are going to deal with is ingesting all the titles of articles in Wikipedia, about 277MB of them. I took them from here: (https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz). There are 13,755,095 of them in my case.

I’m calling the KV store that we’ll be creating Codex. And I’m going to start from the most basic of example, just being able to store and check if a value exists in the store. Here is the code that reads from the titles list and add them to the store. Note that the articles are sorted, but we don’t want this advantage of adding sorted data, so we randomize things.

The question here, how are we going to store these titles in a way that allow us fast retrieval?  Here is the idea, we are going to write the strings to the output file as they come, and also record their positions. When we are done inserting strings to the codex, we’ll run a sort based on the positions, and that will give us an array of offsets to the strings in the files, sorted by their value. The first version of this code looks like this:

If you’ll run this code on the Wikipedia titles, you’ll find that it takes a while to run. On my machine, that took just under 50 minutes.

Well, we are dealing with the full set of Wikipedia titles, but even so, that doesn’t sound like it should take this long. What gives?

Let’s analyze what is going on here, okay? If you run this code, you’ll note that it isn’t using CPU or I/O or really seems to be doing much. What is going on?

The key here is in the ReadFrom method. There, we do two seemingly innocent actions. We set the file’s position (translate to SetFilePointer call) and read a short string (translate to a ReadFile call). Now, why is that expensive? Well, the ReadFrom method is called twice each time we need to sort an entry. In this case, it means that ReadFrom will be called a total of 575,616,878 times.

That is not a typo. And each invocation means two separate system call. In other words, this innocent seeming piece of code executed over 1.15 billion system calls.

For reference, simple by reading the entire file to a MemoryStream and keeping everything else the same, I was able to bring the cost of this operation down to under 3 minutes.

Lesson learned, system calls are expensive, let’s try to reduce them as much as we can.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}