Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 764 words

I previously asked what the code below does, and mentioned that it should give interesting insight into the kind of mindset and knowledge a candidate has. Take a look at the code again:


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>


#define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes


int main() {
    int fd;
    char *buffer;
    struct stat st;


    buffer = (char *)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        return 1;
    }


    fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        return 2;
    }


    if (write(fd, buffer, BUFFER_SIZE) == -1) {
        return 3;
    }


    if (fsync(fd) == -1) {
        return 4;
    }


    if (close(fd) == -1) {
        return 5;
    }


    if (stat("large_file.bin", &st) == -1) {
        return 6;
    }


    printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024));


    free(buffer);
    return 0;
}

This program will output: File size: 2.00 GB

And it will write 2 GB of zeros to the file:


~$ head  large_file.bin  | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
7ffff000

The question is why? And the answer is quite simple. Linux has a limitation of about 2 GB for writes to the disk. Any write call that attempts to write more than that will only write that much, and you’ll have to call the system again. This is not an error, mind. The write call is free to write less than the size of the buffer you passed to it.

Windows has the same limit, but it is honest about it

In Windows, all write calls accept a 32-bit int as the size of the buffer, so this limitation is clearly communicated in the API. Windows will also ensure that for files, a WriteFile call that completes successfully writes the entire buffer to the disk.

And why am I writing 2 GB of zeros? In the code above, I’m using malloc(), not calloc(), so I wouldn’t expect the values to be zero. Because this is a large allocation, malloc() calls the OS to provide us with the buffer directly, and the OS is contractually obligated to provide us with zeroed pages.

time to read 3 min | 536 words

Here is a pretty simple C program, running on Linux. Can you tell me what you expect its output to be?


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>


#define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes


int main() {
    int fd;
    char *buffer;
    struct stat st;


    buffer = (char *)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        return 1;
    }


    fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        return 2;
    }


    if (write(fd, buffer, BUFFER_SIZE) == -1) {
        return 3;
    }


    if (fsync(fd) == -1) {
        return 4;
    }


    if (close(fd) == -1) {
        return 5;
    }


    if (stat("large_file.bin", &st) == -1) {
        return 6;
    }


    printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024));


    free(buffer);
    return 0;
}

And what happens when I run:


head  large_file.bin  | hexdump -C

This shows both surprising behavior and serves as a good opening for discussion on a whole bunch of issues. In an interview setting, that can give us a lot of insight into the sort of knowledge a candidate has.

time to read 9 min | 1642 words

The problem was that this took time - many days or multiple weeks - for us to observe that. But we had the charts to prove that this was pretty consistent. If the RavenDB service was restarted (we did not have to restart the machine), the situation would instantly fix itself and then slowly degrade over time.

The scenario in question was performance degradation over time. The metric in question was the average request latency, and we could track a small but consistent rise in this number over the course of days and weeks. The load on the server remained pretty much constant, but the latency of the requests grew.

That the customer didn’t notice that is an interesting story on its own. RavenDB will automatically prioritize the fastest node in the cluster to be the “customer-facing” one, and it alleviated the issue to such an extent that the metrics the customer usually monitors were fine. The RavenDB Cloud team looks at the entire system, so we started the investigation long before the problem warranted users’ attention.

I hate these sorts of issues because they are really hard to figure out and subject to basically every caveat under the sun. In this case, we basically had exactly nothing to go on. The workload was pretty consistent, and I/O, memory, and CPU usage were acceptable. There was no starting point to look at.

Those are also big machines, with hundreds of GB of RAM and running heavy workloads. These machines have great disks and a lot of CPU power to spare. What is going on here?

After a long while, we got a good handle on what is actually going on. When RavenDB starts, it creates memory maps of the file it is working with. Over time, as needed, RavenDB will map, unmap, and remap as needed. A process that has been running for a long while, with many databases and indexes operating, will have a lot of work done in terms of memory mapping.

In Linux, you can inspect those details by running:


$ cat /proc/22003/smaps


600a33834000-600a3383b000 r--p 00000000 08:30 214585                     /data/ravendb/Raven.Server
Size:                 28 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  28 kB
Pss:                  26 kB
Shared_Clean:          4 kB
Shared_Dirty:          0 kB
Private_Clean:        24 kB
Private_Dirty:         0 kB
Referenced:           28 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    0
VmFlags: rd mr mw me dw
600a3383b000-600a33847000 r-xp 00006000 08:30 214585                     /data/ravendb/Raven.Server
Size:                 48 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  48 kB
Pss:                  46 kB
Shared_Clean:          4 kB
Shared_Dirty:          0 kB
Private_Clean:        44 kB
Private_Dirty:         0 kB
Referenced:           48 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    0
VmFlags: rd ex mr mw me dw

Here you can see the first page of entries from this file. Just starting up RavenDB (with no databases created) will generate close to 2,000 entries. The smaps virtual file can be really invaluable for figuring out certain types of problems. In the snippet above, you can see that we have some executable memory ranges mapped, for example.

The problem is that over time, memory becomes fragmented, and we may end up with an smaps file that contains tens of thousands (or even hundreds of thousands) of entries.

Here is the result of running perf top on the system, you can see that the top three items that hogs most of the resources are related to smaps accounting.

This file provides such useful information that we monitor it on a regular basis. It turns out that this can have… interesting effects. Consider that while we are running the scan through all the memory mapping, we may need to change the memory mapping for the process. That leads to contention on the kernel locks that protect the mapping, of course.

It’s expensive to generate the smaps file

Reading from /proc/[pid]/smaps is not a simple file read. It involves the kernel gathering detailed memory statistics (e.g., memory regions, page size, resident/anonymous/shared memory usage) for each virtual memory area (VMA) of the process. For large processes with many memory mappings, this can be computationally expensive as the kernel has to gather the required information every time /proc/[pid]/smaps is accessed.

When /proc/[pid]/smaps is read, the kernel needs to access memory-related structures. This may involve taking locks on certain parts of the process’s memory management system. If this is done too often or across many large processes, it could lead to contention or slow down the process itself, especially if other processes are accessing or modifying memory at the same time.

If the number of memory mappings is high, and the frequency with which we monitor is short… I hope you can see where this is going. We effectively spent so much time running over this file that we blocked other operations.

This wasn’t an issue when we just started the process, because the number of memory mappings was small, but as we worked on the system and the number of memory mappings grew… we eventually started hitting contention.

The solution was two-fold. We made sure that there is only ever a single thread that would read the information from the smaps (previously it might have been triggered from multiple locations).  We added some throttling to ensure that we aren’t hammering the kernel with requests for this file too often (returning cached information if needed) and we switched from using smaps to using smaps_rollup instead. The rollup version provides much better performance, since it deals with summary data only.

With those changes in place, we deployed to production and waited. The result was flat latency numbers and another item that the Cloud team could strike off the board successfully.

time to read 8 min | 1561 words

We got an interesting question in the RavenDB discussion group:How to do aggregation on a tree structure?

The task is to build a Work Breakdown Structure, where you have:

  • Projects
  • Major deliverables
  • Sub-deliverables
  • Work packages

The idea is to be able to track EstimatedHours and CompletedHours across the entire tree. For example, let’s say that I have the following:

  • Project: Bee Keeper Chronicle App
  • Major deliverable: App Design
  • Sub-deliverable: Wireframes all screens
  • Work Package: Login page wireframe

Users can add the EstimatedHours and CompletedHours at any level, and we want to be able to aggregate the data upward. So the Project: “Bee Keeper Chronicle App” should have a total estimated time and the number of hours that were worked on.

The question is how to model & track that in RavenDB. Here is what I think the document structure should look like:


{
    "Name": "Login page wire frame",
    "Parent": {
        "Type": "Subs",
        "Id": "subs/0000000000000000009-A"
    },
    "EsimatedHours": 8,
    "CompletedHours": 3,
    "@metadata": {
        "@collection": "WorkPackages"
    }
}


{
    "Name": "Wire frames all screens",
    "Parent": {
        "Type": "Majors",
        "Id": "major/0000000000000000008-A"
    },
    "EsimatedHours": 20,
    "CompletedHours": 7,
    "@metadata": {
        "@collection": "Subs"
    }
}


{
    "Name": "App Design",
    "Parent": {
        "Type": "Projects",
        "Id": "projects/0000000000000000011-A"
    },
    "EsimatedHours": 50,
    "CompletedHours": 12,
    "@metadata": {
        "@collection": "Majors"
    }
}


{
    "Name": "Bee Keeper Chronicle App",
    "EsimatedHours": 34,
    "CompletedHours": 21,
    "@metadata": {
        "@collection": "Projects"
    }
}

I used a Parent relationship, since that is the most flexible (it allows you to move each item to a completely different part of the tree easily). Now, we need to aggregate the values for all of those items using a Map-Reduce index.

Because of the similar structure, I created the following JS function:


function processWorkBreakdownHours(doc) {
    let hours = {
        EsimatedHours: doc.EsimatedHours,
        CompletedHours: doc.CompletedHours
    };
    let results = [Object.assign({
        Scope: id(doc)
    }, hours)];


    let current = doc;
    while (current.Parent) {
        current = load(current.Parent.Id, current.Parent.Type);
        results.push(Object.assign({
            Scope: id(current)
        }, hours));
    }
    return results;
}

This will scan over the hierarchy and add the number of estimated and completed hours to all the levels. Now we just need to add this file as Additional Sources to an index and call it for all the relevant collections, like this:


map("WorkPackages",processWorkBreakdownHours);
map("Subs",processWorkBreakdownHours);
map("Majors",processWorkBreakdownHours);
map("Projects",processWorkBreakdownHours);

And the last step is to aggregate across all of them in the reduce function:


groupBy(x => x.Scope).aggregate(g => {
    return {
        Scope: g.key,
        EsimatedHours: g.values.reduce((c, val) => val.EsimatedHours + c, 0),
        CompletedHours: g.values.reduce((c, val) => val.CompletedHours + c, 0)
    };
})

You can see the full index definition here.

The end result is automatic aggregation at all levels. Change one item, and it will propagate upward.

Considerations:

I’m using load() here, which creates a reference from the parent to the child. The idea is that if we move a Work Package from one Sub-deliverable to another (in the same or a different Major & Project), this index will automatically re-index what is required and get you the right result.

However, that also means that whenever the Major document changes, we’ll have to re-index everything below it (because it might have changed the Project). There are two ways to handle that.

  • Instead of using load(), we’ll use noTracking.load(), which tells RavenDB that when the referenced document changes, we should not re-index.
  • Provide the relevant scopes at the document level, like this:


{
    "Name": "Login page wire frame",
    "Scope": [
       "subs/0000000000000000009-A",
       "major/0000000000000000008-A",
       "projects/0000000000000000011-A"
    ],
    "EsimatedHours": 8,
    "CompletedHours": 3,
    "@metadata": {
        "@collection": "WorkPackages"
    }
}

Note that in this case, changing the root will be more complex because you have to scan / touch everything if you move between parts of the tree.

In most cases, that is such a rare event that it shouldn’t be a consideration, but it depends largely on your context.

And there you have it, a simple Map-Reduce index that can aggregate across an entire hierarchy with ease.

time to read 5 min | 803 words

We received a really interesting question from a user, which basically boils down to:

I need to query over a time span, either known (start, end) or (start, $currentDate), and I need to be able to sort on them.

That might sound… vague, I know. A better way to explain this is that I have a list of people, and I need to sort them by their age. That’s trivial to do since I can sort by the birthday, right? The problem is that we include some historical data, so some people are deceased.

Basically, we want to be able to get the following data, sorted by age ascending:

NameBirthdayDeath

Michael Stonebraker 1943 N/A
Sir Tim Berners-Lee 1955 N/A
Narges Mohammadi 1972 N/A
Sir Terry Prachett 1948 2015
Agatha Christie 1890 1976

This doesn’t look hard, right? I mean, all you need to do is something like:


order by datediff( coalesce(Death, now()), Birthday )

Easy enough, and would work great if you have a small number of items to sort. What happens if we want to sort over 10M records?

Look at the manner in which we are ordering, that will require us to evaluate each and every record. That means we’ll have to scan through the entire list and sort it. This can be really expensive. And because we are sorting over a date (which changes), you can’t even get away with a computed field.

RavenDB will refuse to run queries that can only work with small amounts of data but will fail as the data grows. This is part of our philosophy, saying that things should Just Work. Of course, in this case, it doesn’t work, so the question is how this aligns with our philosophy?

The idea is simple. If we cannot make it work in all cases, we will reject it outright. The idea is to ensure that your system is not susceptible to hidden traps. By explicitly rejecting it upfront, we make sure that you’ll have a good solution and not something that will fail as your data size grows.

What is the appropriate behavior here, then? How can we make it work with RavenDB?

The key issue is that we want to be able to figure out what is the value we’ll sort on during the indexing stage. This is important because otherwise we’ll have to compute it across the entire dataset for each query. We can do that in RavenDB by exposing that value to the index.

We cannot just call DateTime.Today, however. That won’t work when the day rolls over, of course. So instead, we store that value in a document config/current-date, like so:


{ // config/current-date
  "Date": "2024-10-10T00:00:00.0000000"
}

Once this is stored as a document, we can then write the following index:


from p in docs.People
let end = p.Death ?? LoadDocument("config/current-date", "Config").Date
select new
{
  Age = end - p.Birthday 
}

And then query it using:


from index 'People/WithAge'
order by Age desc

That works beautifully, of course, until the next day. What happens then? Well, we’ll need to schedule an update to the config/current-date document to correct the date.

At that point, because there is an association created between all the documents that loaded the current date, the indexing engine in RavenDB will go and re-index them. The idea is that at any given point in time, we have already computed the value, and can run really quick queries and sort on it.

When you update the configuration document, it is a signal that we need to re-index the referencing documents. RavenDB is good at knowing how to do that on a streaming basis, so it won’t need to do a huge amount of work all at once.

You’ll also note that we only load the configuration document if we don’t have an end date. So the deceased people’s records will not be affected or require re-indexing.

In short, we can benefit from querying over the age without incurring query time costs and can defer those costs to background indexing time. The downside is that we need to set up a cron job to make it happen, but that isn’t too big a task, I think.

You can utilize similar setups for other scenarios where you need to query over changing values. The performance benefits here are enormous. And what is more interesting, even if you have a huge amount of data, this approach will just keep on ticking and deliver great results at very low latencies.

time to read 4 min | 728 words

I got into an interesting discussion on LinkedIn about my previous post, talking about Code Rot. I was asked about Legacy Code defined as code without tests and how I reconcile code rot with having tests.

I started to reply there, but it really got out of hand and became its own post.

“To me, legacy code is simply code without tests.” Michael Feathers, Working Effectively with Legacy Code

I read Working Effectively with Legacy Code for the first time in 2005 or thereabout, I think. It left a massive impression on me and on the industry at large. The book is one of the reasons I started rigorously writing tests for my code, it got me interested in mocking and eventually led me to writing Rhino Mocks.

It is ironic that the point of this post is that I disagree with this statement by Michael because of Rhino Mocks. Let’s start with numbers, last commit to the Rhino Mocks repository was about a decade ago. It has just under 1,000 tests and code coverage that ranges between 95% - 100%.

I can modify this codebase with confidence, knowing that I will not break stuff unintentionally. The design of the code is very explicitly meant to aid in testing and the entire project was developed with a Test First mindset.

I haven’t touched the codebase in a decade (and it has been close to 15 years since I really delved into it). The code itself was written in .NET 1.1 around the 2006 timeframe. It literally predates generics in .NET.

It compiles and runs all tests when I try to run it, which is great. But it is still very much a legacy codebase.

It is a legacy codebase because changing this code is a big undertaking. This code will not run on modern systems. We need to address issues related to dynamic code generation between .NET Framework and .NET.

That in turn requires a high level of expertise and knowledge. I’m fairly certain that given enough time and effort, it is possible to do so. The problem is that this will now require me to reconstitute my understanding of the code.

The tests are going to be invaluable for actually making those changes, but the core issue is that a lot of knowledge has been lost. It will be a Project just to get it back to a normative state.

This scenario is pretty interesting because I am actually looking back at my own project. Thinking about having to do the same to a similar project from someone else’s code is an even bigger challenge.

Legacy code, in this context, means that there is a huge amount of effort required to start moving the project along. Note that if we had kept the knowledge and information within the same codebase, the same process would be far cheaper and easier.

Legacy code isn’t about the state of the codebase in my eyes, it is about the state of the team maintaining it. The team, their knowledge, and expertise, are far more important than the code itself.

An orphaned codebase, one that has no one to take care of, is a legacy project even if it has tests. Conversely, a project with no tests but with an actively knowledgeable team operating on it is not.

Note that I absolutely agree that tests are crucial regardless. The distinction that I make between legacy projects and non-legacy projects is whether we can deliver a change to the system.

Reminder: A codebase that isn’t being actively maintained and has no tests is the worst thing of all. If you are in that situation, go read Working Effectively with Legacy Code, it will be a lifesaver.

I need a feature with an ideal cost of X (time, materials, effort, cost, etc). A project with no tests but people familiar with it will be able to deliver it at a cost of 2-3X. A legacy project will need 10X or more. The second feature may still require 2X from the maintained project, but only 5X from the legacy system. However, that initial cost to get things started is the killer.

In other words, what matters here is the inertia, the ability to actually deliver updates to the system.

time to read 4 min | 765 words

I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time.

In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking.

We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.

The process is straightforward:

  • Change something.
  • Verify that it works better than before.
  • Run the entire test suite (upward of 30K tests) to see if there are any breaks.

The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation.

I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions.

One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.

I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed.

And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously.

Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.

My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.

RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage.

It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.

I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file.

The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally.

As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff).

The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.

time to read 10 min | 1997 words

I usually talk about the things that I do that were successful. Today I want to discuss something that I tried but failed at. Documenting failed approaches is just as important, though less enjoyable, as documenting what we excel at.

In order to explain the failure, I need to get a bit deeper into how computers handle memory. There is physical memory, the RAM sticks that you have in your machine, and then there is how the OS and CPU present that memory to your code. Usually, the abstraction is quite seamless, and we don’t need to pay attention to it.

Occasionally, we can take advantage of this model. Consider the following memory setup, showing a single physical memory page that was mapped in two different locations:

In this case, it means that you can do things like this:


*page1 = '*';
printf("Same: %d - Val: %c\n", (page1 == page2), *page2); 
// output is:
// Same: 0 - Val: *

In other words, because the two virtual pages point to the same physical page in memory, we can modify memory in one location and see the changes in another. This isn’t spooky action at a distance, it is simply the fact that the memory addresses we use are virtual and they point to the same place.

Note that in the image above, I modified the data using the pointer to Page 1 and then read it from Page 2. The Memory Management Unit (MMU) in the CPU can do a bunch of really interesting things because of this. You’ll note that each virtual page is annotated with an access permission.

In this case, the second page is marked as Copy on Write. That means that when we read from this page, the MMU will happily read the data from the physical page it is pointed to. But when we write, the situation is different.

The MMU will raise an exception to the operating system, telling it that a write was attempted on this page, which is forbidden. At this point, the OS will allocate a new physical page, copy the data to it, and then update the virtual address to point to the new page. Here is what this looks like:

Now we have two distinct mappings. A write to either one of them will not be reflected on the other. Here is what this looks like in code:


*page1 = '1'; // now 
printf("Page1: %c, Page2: %c\n", *page1, *page2); 
// output: Page1: 1, Page2: 1
*page2 = '2'; // force the copy on write to occur
printf("Page1: %c, Page2: %c\n", *page1, *page2); 
// output: Page1: 1, Page2: 2

As long as the modifications happened through the first page address (the orange one in the image), there was no issue and any change would be reflected in both pages. When we make a modification to the second page (the green one in the image), the OS will create a new physical page and effectively split them forever.

Changes made to either page will only be reflected in that page, not both, since they aren’t sharing the same page.

Note that this behavior applies at a page boundary. What happens if I have a buffer, 1GB in size, and I use this technique on it? Let’s assume that we have a buffer that is 1GB in size and I created a copy-on-write mapping on top of it.

The amount of physical memory that I would consume is still just 1GB.

In fact, I would effectively memcpy()very quickly, since I’m not actually copying anything. And for all intents and purposes, it works. I can change the data through the second buffer, and it would not show up in the first buffer. Of particular note is that when I modify the data on the second buffer, only a single page is changed. Here is what this looks like:

So instead of having to copy 1GB all at once, we map the buffer again as copy on write, and we can get a new page whenever we actually modify our “copy” of the data.

So far, this is great, and it is heavily used for many optimizations. It is also something that I want to use to implement cheap snapshots of a potentially large data structure. The idea that I have is that I can use this technique to implement it.

Here is the kind of code that I want to write:


var list = new CopyOnWriteList();
list.Put(1);
list.Put(2);


var snapshot1 = list.CreateSnapshot();


list.Put(3)




var snapshot2 = list.CreateSnapshot();


list.Put(4);

And the idea is that I’ll have (at the same time) the following:

listsnapshot1snapshot2
1,2,3,41,21,2,3

I want to have effectively unlimited snapshots, and the map may contain a large amount of data. In graphical form, you can see it here:

We started with Page 1, created a Copy of Write for Page 2, modified Page 2 (breaking the Copy on Write), and then attempted to create a Copy on Write for Page 2. That turns out to be a problem.

Let’s see the code that we need in order to create a copy using copy-on-write mapping on Linux:


int shm_fd = shm_open("/third", O_CREAT | O_RDWR, 0666);
ftruncate(shm_fd, 4096);
char *page1 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
page1[0] = 'A'; page1[1] = 'B';
// pages1 = 'AB'
char *page2 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, shm_fd, 0);
// pages2 = 'AB'
page1[0]= 'a';
// pages1 = 'aB'
// pages2 = 'aB' (same pagee)
page2[2] = 'C'; // force a private copy creation
// pages1 = 'aB'
// pages2 = 'aBC'
page1[1] = 'b';
// pages1 = 'ab'
// pages2 = 'aBC' (no change here)

The code in Windows is pretty similar and behaves in the same manner:


HANDLE hMapFile = CreateFileMapping(INVALID_HANDLE_VALUE,
    NULL,PAGE_READWRITE,0,4096, TEXT("Local\\MySharedMemory"));
char* page1 = MapViewOfFile(hMapFile,
    FILE_MAP_READ | FILE_MAP_WRITE, 0, 0, 4096);
page1[0] = 'A'; page1[1] = 'B';
// pages1 = 'AB'
char* page2 = MapViewOfFile(hMapFile,
    FILE_MAP_COPY, 0, 0, 4096);
// pages2 = 'AB'
page1[0] = 'a';
// pages1 = 'aB'
// pages2 = 'aB' (same pagee)
page2[2] = 'C'; // force a copy on write 
// pages1 = 'aB'
// pages2 = 'aBC'
page1[1] = 'b';
// pages1 = 'ab'
// pages2 = 'aBC' (no change here)

Take a look at the API we have for creating a copy-on-write:


MapViewOfFile(hMapFile, FILE_MAP_COPY, 0, 0, 4096); // windows
mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, shm_fd, 0); // linux

A key aspect of the API is that we need to provide a source for the Copy-on-Write operation. That means that we can only create a Copy-on-Write from a single source. We cannot perform a Copy-on-Write on top of a page that was marked as copy-on-write. This is because we cannot refer to it. Basically, I don’t have a source that I can use for this sort of mapping.

I tried being clever and wrote the following code on Linux:


int selfmem = open("/proc/self/mem", O_RDWR);
char *page2 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
                   MAP_PRIVATE, selfmem, (off_t)page1);

On Linux, you can use the special file /proc/self/mem to refer to your memory using file I/O. That means that I can get a file descriptor for my own memory, which provides a source for my copy-on-write operation.

I was really excited when I realized that this was a possibility. I spent a lot of time trying to figure out how I could do the same on Windows. However, when I actually ran the code on Linux, I realized that this doesn’t work.

The mmap() call will return ENODEV when I try that. It looks like this isn’t a supported action.

Linux has another call that looks almost right, which is mremap(), but that either zeros out or sets up a userfaulfdhandler for the region. So it can’t serve my needs.

Looking around, I’m not the first person to try this, but it doesn’t seem like there is an actual solution.

This is quite annoying since we are almost there. All the relevant pieces are available, if we had a way to tell the kernel to create the mapping, everything else should just work from there.

Anyway, this is my tale of woe, trying (and failing) to create a snapshot-based system using the Memory Manager Unit. Hopefully, you’ll either learn something from my failure or let me know that there is a way to do this…

time to read 4 min | 683 words

Reading code is a Skill (with a capital letter, yes) that is really important for developers. You cannot be a good developer without it.

Today I want to talk about one aspect of this. The ability to go into an unfamiliar codebase and extract one piece of information out. The idea is that we don’t need to understand the entire system, grok the architecture, etc. I want to understand one thing about it and get away as soon as I can.

For example, you know that project Xyz is doing some operation, and you want to figure out how this is done. So you need to look at the code and figure that out, then you can go your merry way.

Today, I’m interested in understanding how the LMDB project writes data to the disk on Windows. This is because LMDB is based around a memory-mapped model, and Windows doesn’t keep the data between file I/O and mmap I/O coherent.

LMDB is an embedded database engine (similar to Voron, and in fact, Voron is based on some ideas from LMDB) written in C. If you are interested in it, I wrote 11 posts going through every line of code in the project.

So I’m familiar with the project, but the last time I read the code was over a decade ago. From what I recall, the code is dense. There are about 11.5K lines of code in a single file, implementing the entire thing.

I’m using the code from here.

The first thing to do is find the relevant section in the code. I started by searching for the WriteFile() function, the Win32 API to write. The first occurrence of a call to this method is in the mdb_page_flush function.

I look at this code, and… there isn’t really anything there. It is fairly obvious and straightforward code (to be clear, that is a compliment). I was expecting to see a trick there. I couldn’t find it.

That meant either the code had a gaping hole and potential data corruption (highly unlikely) or I was missing something. That led me to a long trip of trying to distinguish between documented guarantees and actual behavior.

The documentation for MapViewOfFile is pretty clear:

A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.

I have my own run-ins with this behavior, which was super confusing. This means that I had experimental evidence to say that this is broken. But it didn’t make sense, there was no code in LMDB to handle it, and this is pretty easy to trigger.

It turns out that while the documentation is pretty broad about not guaranteeing the behavior, the actual issue only occurs if you are working with remote files or using unbuffered I/O.

If you are working with local files and buffered I/O (which is 99.99% of the cases), then you can rely on this behavior. I found some vaguereferences to this, but that wasn’t enough. There is this post that is really interesting, though.

I pinged Howard Chu, the author of LMDB, for clarification, and he was quick enough to assure me that yes, my understanding was (now) correct. On Windows, you can mix memory map operations with file I/O and get the right results.

The documentation appears to be a holdover from Windows 9x, with the NT line always being able to ensure coherency for local files. This is a guess about the history of documentation, to be honest. Not something that I can verify.

I had the wrong information in my head for over a decade. I did not expect this result when I started this post, I was sure I would be discussing navigating complex codebases. I’m going to stand in the corner and feel upset about this for a while now.

time to read 8 min | 1431 words

Today I got in my car to drive to work and realized that Waze suggested “Work” as the primary destination to select. I had noticed that before, and it is a really nice feature. Today, I got to thinking about how I would implement something like that.

That was a nice drive since I kept thinking about algorithms and data flow. When I got to the office, I decided to write about how we can implement something like that. Based on historical information, let’s suggest the likely destinations.

Here is the information we have:

The Lat & Lng coordinates represent the start location, the time is the start time for the trip, and the destination is obvious. In the data set above, we have trips to and from work, to the gym once a week, and to our parents over the weekends.

Based on this data, I would like to build recommendations for destinations. I could try to analyze the data and figure out all sorts of details. The prediction that I want to make is, given a location & time, to find where my likely destination is going to be.

I could try to analyze the data on a deep level, drawings on patterns, etc. Or I can take a very different approach and just throw some computing power at the problem.

Let’s talk in code since this is easier. I have a list of trips that look like this:


public record Trip(double Lat, double Lng, string Destination, DateTime Time);
Trip[] trips = RecentTrips(TimeSpan.FromDays(90));

Given that, I want to be able to write this function:


string[] SuggestDestination((double Lat, double Lng) location, DateTime now)

I’m going to start by processing the trips data, to extract the relevant information:


var historyByDest = new Dictionary<string, List<double[]>>();
foreach (var trip in trips)
{
    if (historyByDest.TryGetValue(trip.Destination, out var list) is false)
    {
        historyByDest[trip.Destination] = list = new();
    }
    list.Add([
        trip.Lat,
        trip.Lng,
        trip.Time.Hour * 100 + trip.Time.Minute, // minutes after midnight
        trip.Time.DayOfYear,
        (int)trip.Time.DayOfWeek
    ]);
}

What this code does is extract details (location, day of the week, time of day, etc.) from the trip information and store them in an array. For each trip, we basically break apart the trip across multiple dimensions.

The next step is to make the actual prediction we want, which will begin by extracting the same dimensions from the inputs we get, like so:


double[] compare = [
    location.Lat, 
    location.Lng, 
    now.Hour * 100 + now.Minute, 
    now.DayOfYear, 
    (int)now.DayOfWeek
];

Now we basically have an array of values from which we want to predict, and for each destination, an array that represents the same dimensions of historical trips. Here is the actual computation:


List<(string Dest, double Score)> scores = new();


foreach (var (dest, items) in historyByDest)
{
    double score = 0;
    foreach (var cur in items)
    {
        for (var i = 0; i < cur.Length; i++)
        {
            score += Math.Abs(cur[i] - compare[i]);
        }
    }
    score /= items.Count;
    scores.Add((dest, score));
}


scores.Sort((x, y) => x.Score.CompareTo(y.Score));

What we do here is compute the difference between the two arrays: the current start location & time compared to the start location & time of historical trips. We do that not only on the raw data but also extract additional features from the information.

For example, one dimension is the day of the week, and the other is the time of day. It is not sufficient to compare just the date itself.

The end result is the distance between the current trip start and previous trips for each of the destinations I have. Then I can return the destinations that most closely match my current location & time.

Running this over a few tests shows that this is remarkably effective. For example, if I’m at home on a Saturday, I’m very likely to visit either set of grandparents. On Sunday morning, I head to the Gym or Work, but on Monday morning, it is more likely to be Work.

All of those were mostly fixed, with the day of the week and the time being different. But If I’m at my parents’ house on a weekday (which is unusual), the location would have a far greater weight on the decision, etc. Note that the code is really trivial (I spent more time generating the actual data), but we can extract some nice information from this.

The entire code is here, admittedly it’s pretty dirty code since I wanted to test how this would actually work. At this point, I’m going to update my Curriculum Vitae and call myself a senior AI developer.

Joking aside, this approach provides a good (although highly simplified) overview of how modern AI systems work. Given a data item (image, text, etc.), you run that through the engine that outputs the embedding (those arrays we saw earlier, with values for each dimension) and then try to find its nearest neighbors across multiple dimensions.

In the example above, I explicitly defined the dimensions to use, whereas LLMs would have their“secret sauce” for this. The concept, at a sufficiently high level, is the same.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}