Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,185
Privacy Policy · Terms
filter by tags archive
time to read 8 min | 1553 words

In the previous post I outlined the requirements we have for FastPFor in RavenDB. Now I want to dig into the actual implementation. Here is the shape of the class in question:

image

The process starts when we start encoding the items. Each call to encode is going to process a different posting list, and it is expected that a single instance will process many posting lists during a single indexing run.

Here is the internal state of this class:

image

Here is how we start the process:

As you can see, there isn’t much here. We initialize the state of exceptions and exceptionsStarts (will discuss this in more detail later in this post) and then we ensure that we have a big enough buffer to encode the data. Note that we retain a point to the original entries that we were passed, that will be important later as well.

Here is the code of the encoding process:

I create a couple of vector variables, one to hold the first item in the list and the other to hold uint.MaxValue. Then I start processing the buffers in blocks of 256 items.

A good resource for using vectors and SIMD in C# can be found here.

The classic FastPFor is using vectors of 128 bits (although there are implementations of up to 512 bits). I chose to use Vector256 for the implementation. Today, pretty much any x64 machine is going to have native support for AVX2 (shipped since 2011!). On ARM, they are implemented using Vector128, so they are still vectorized there as well.

What this means is that we can process 256 numbers as a block. But what are we doing with them? Look at the inner look, where you can see some interesting details. We are using shuffle instructions as well as masking to shift the current value and add the previous one.

In other words, let’s assume that we have the following input array: 1,2,3,4,5,6,7,8 – using longs, so Vector256 will contain 4 elements.

On line 5, we create the prev vector and assign it to have 4 copies of the first item in the array:

prev = [1,1,1,1]

On line 13, we load the 4 elements from the array into cur, giving us:

cur = [1,2,3,4]

On line 14, we shuffle both the prev and the cur values, here are intermediate states:

curShuffled = [0,1,2,3] – note that we have 0 at the start because of the mask

prevShuffled = [1,0,0,0] – the last three items are zero because of the mask.

We then or those two intermediates together, giving us:

mixed = [1,1,2,3]

We subtract that from cur to get the delta. The idea is that we massively reduce the number of operations that we have to perform to compute the delta of the values.

On lines 19..22 we check if any of the deltas is greater than uint.MaxValue, we’ll dig into that in a bit.

The last part of the loop is when we deal with detlaInts, where we are doing some interesting things. We are pulling the low bits from all four elements in the vector to the start of the vector. Then we write the deltas to the output entries buffer.

However, we write 8 values, but we increment the position by 4, so we’ll actually overwrite the last 4 items in the next loop iteration. The entries buffer is sized based on the entries buffer, so we are ensured to have enough space, and the double writes give us better performance.

The end result of the inner loop is that we are performing running delta over the items, converting them from longs to ints. Then we can call the ProcessBlock() method to actually do the code of the FastPFor algorithm:

I already explained how this works in this post.

It is important to understand that we have multiple data channels that we operate on here:

  • The entries output, where we write the packed bits
  • Metadata, where we describe each block
  • Exceptions, where we store the exception data

The encoding process runs through all of them, but isn’t actually going to be writing them anywhere, that happens later.

We process the entries in the array in blocks of 256 numbers each, but what happens when we get to the end? The last block size may not be a multiple of 256, after all. For the last item, we use simple delta encoding plus variable-sized integers. Nothing really fancy is needed at this point, since that is very small.

Finally, we run over all the exceptions data that we created during the encoding process and compute their size. The final result we have is the total size we would need to store all the encoded data as a single buffer.

What about handling numbers whose delta is greater than uint.MaxValue?

You can see that we expect this to be rare, so we handle that to be on the safe side, but aren’t trying to be too optimal about it. We simply take the high portion of the current delta and write it to the metadata. That is wasteful in terms of space, but it is simple and doesn’t require complex code. We’ll see how that is being used during the decoding process in the next post.

Finally, it is now time to actually write things out to the buffer. We have to be careful here, since we need to support partial writes. Let’s see how that works:

We define a header for the encoded data, which looks like this:

image

The header contains the initial value of the encoded data, the exceptions bitmap (discussed here) and the exception and metadata offsets. Because we always limit the size of the buffer to a maximum of 8KB, I used ushort here.

The first value we set in the header is the baseline, here we need to explain a bit. When we encode the data, we start from the very first item in the list, and go forward. The problem with doing things in this manner is that when we need to split the encoded data, we need to use the baseline that is the previous value that we encoded with. For that reason, we use either the first item in the entries array or the item from the end of the previous block that we consumed.

Next, we need to deal with the exceptions, as a reminder, we have a set of exceptions that are shared across blocks. When we do a partial write, we need to keep track of how many exceptions we consumed from each of the exceptions array.

The most interesting part comes when we start iterating over the metadata values, which tells us how to deal with the data:

Here we start with the exceptional cases, of a metadata value that is greater than the max delta or the varint batch marker (which occurs in the end).

You can see that there isn’t much there, except that we verify that we have enough space for the written buffers. For fun, I’m also using a goto here, this is needed to break out of the loop, since I cannot use a break statement in a switch case.

The more interesting aspect happens when we process full blocks of compressed 256 numbers:

This looks like a lot of code, but what it is actually doing is to mostly checking that we have enough space in the buffer for the block we are about to write. Of particular interest to us here is that we may have enough space to write the block, but we also need to ensure that we have space for the metadata and exceptions data that is associated with the current and all previous blocks.

If we have enough space, we continue to write more blocks to the output, or we exit if we are about to run out of room. After the loop, we have the header and compressed integers data in the output buffer, but we still need to write the metadata and exceptions data, this is handled here:

We compute the bitmap of all the exceptions, writing them out as packed segments (basically, 256 blocks of packed bits). Note that we have to account for how many of them we actually write out to the output, since we may have stopped in the middle.

Then we can simply write out the metadata buffer and return the total size we took.

I find it very amusing that we use the same bit packing routines to store the numbers and the exceptions data as well.

Overall, most of the code isn’t particularly interesting, but there are a few highlights that I would like to point out:

  • We compute the deltas and the narrowing from int64 to int32 on the buffer during the Encode()
  • We do the actual bit packing during the Write()

There is some argument to make here that we can do the bit packing during the Encode() as well, but it is unlikely to matter, I believe.

In the next post, I’m going to be diving into how we implement decoding for FastPFor. That is actually of far more interest to me, since that happens a lot more often. It’s going to be interesting…

time to read 6 min | 1155 words

In this series so far, I explored several ways that we can implement integer compression. I focused on the FastPFor algorithm and dove deeply into how it works. In the last post, I showed how we can use the SIMD intrinsics to generate vectorized code that can process a lot of data very quickly.

With this baseline established, we need to understand the use case we have in RavenDB for integer compression. We are actually using this for the classic scenario, posting lists as part of the indexing engine. However, there are quite a few differences in how we want to actually utilize it.

Let’s take Lucene as a good example, since that is one of the classic examples for a search engine. From the ground up, Lucene is meant to be used in batching scenarios. That has a lot of impact on the design. In particular, it means that a posting list for Lucene is something that you write once in a big batch of operations. For RavenDB, however, we explicitly reject his approach. We want to allow to do on-the-fly indexing and we cannot afford to pay the cost of merging many small writes to big files over time.

The FastPFor algorithm is meant to get a lot of data, encode that in a compressed form and then allow you to iterate over the results as fast as possible. But RavenDB’s needs are different. At the most basic level, RavenDB thinks about things in 8KB pages. A posting list for RavenDB is composed of the following:

  • Single – just a single value (common when indexing unique values)
  • Small – can fit in under 4KB
  • Large – requires more than 4KB of data, in which case we split it on 8KB boundaries.

What’s the point here? The idea is that when we need to update a posting list (add or remove items), we have one of those three options:

  • Single – we just update the value, duh!
  • Small – FastPFor is fast, right? It is cheap enough to unpack the compressed values, merge with the new items (or the removals) and compress again.
  • Large – here we are talking about a lot of values, potentially tens of millions or higher. How do we handle updates in this case?

The answer is that we are going to be making use of a B+ Tree, with some caveats. The basic idea is that we have two types of pages in the tree. A branch page is the classic B+ Tree concept, which points to leaf pages below it that contain some range of values. However, the leaf pages, on the other hand, are very different.

Each leaf page is just an 8KB buffer that holds the compressed data, nothing more. When we want to update the values in the posting list, we find the relevant page that needs to be updated, then we just unpack, merge & compress the values. It’s actually cheaper to go this route than any other alternative.

There is another important aspect. Lucene (and most other indexing systems) assume that your posting list is using 32 bits integers. That limits the number of items in an index to 2.1 billion (up to 4.2 billion, for some implementations). I don’t like that limit, not at today’s scale. So we internally use 64bits integers in our posting lists.

That… however, is actually too big. In practice, we use 54 bits for the indexed entries. That gives us a maximum number of 18,014,398,509,481,984 items and at a range of 100,000 items per second, we are good for 5,712 years.  Those 10 bits allow us to encode additional information about the actual entry that is being indexed, and reduce the number of external lookups we need to do. Taking 10 bits from the entry id means that we are going to hit the 32 bits limit at around 2 million entries, which is really low.

Then there is another problem, if I have a posting list I want to compress that has 100,000 items in it, I would need to compress that into multiple pages, but figuring out how to split the compressed stream in a way that makes it usable to work with is not trivial at all.

In short, we have two major concerns with using FastPFor in RavenDB, int64 and paged writes. There is another aspect that I didn’t mention, however. FastPFor will take an array of numbers and compress them. If you want to also use delta compression, that is something that you need to add on top of that.

Now that we understand the constraints we have to operate around, let’s talk about the kind of API that I would like to build:

image

What is going on here?  Well, the basic idea is that you’ll create an encoder as part of the indexing process. For each posting list that you want to process, you’ll call the Encode() method, passing it the data to be encoded.

The Encode() method will return the required size for encoding the entire list as a single operation. Note that it does not actually write the data out (indeed, it has nowhere to write it). The calling code is then able to make decisions based on the returned size. If they have a buffer big enough to hold the encoded data, they can pass it to the Write() method.

Things get more interesting when we don’t have a suitable buffer. Remember that we have a limit of 4KB for small posting lists and 8KB per page for large posting lists. The idea with Write() is that it will write as much as it can to the output, allowing you to use multiple buffers to store the compressed posting list.

In terms of work being done, the Encode() phase copies the post list entries to its own buffer, preparing it to be written out. The process of actually writing the data to the output buffer is mostly concerned with ensuring that we don’t exceed the size of the provided buffer.

This approach means that we only have to pay the encoding code once, which is quite nice. Moreover, the number of checks that we have to make to ensure that we can fit the data into the buffer is also greatly reduced.

What about decoding? That is handled using:

image

Here we have a very interesting observation, the Encoder is a class, but the Decoder is a struct. We assume that decoding is far more frequent, so we use a struct to avoid an unnecessary memory allocation.

In order to ensure high performance, the decoder makes assumptions about the provided buffers. We will always accept a buffer that has a minimum size, for example.

With all of this out of the way, I’m going to dedicate the next post to discussing the actual implementation and semantics we have for FastPFor encoding.

time to read 4 min | 715 words

In the code of the simdcomp library there is a 25KLOC file that does evil things to SIMD registers to get bit packing to work. When I looked at the code the first few dozen times, I had a strong desire to run away screaming. Luckily, this isn’t just some pile of complicated code, but well-thought-out set of functions that are meant to provide optimal code for specific tasks. In particular, the code is specialized for each bit width that you may want to bit pack (0 .. 32). Even better, no one actually sat down to write it out by hand, there is a Python script that would generate the code.

The first step was to understand what exactly is going on in the code, and then see how we can translate that to C#. Even just a few years ago, that would have been either an impossible dream or required the use of a native library (with the associated PInvoke overhead). However, .NET today has a very robust intrinsics support, and vectors / SIMD instructions are natively supported.

I actually had a tougher challenge, since I want this code to run on x64 and ARM64 instances. The original code is x64 only, of course. The nice thing about SIMD support for .NET is that most of that can be done in a cross platform manner, with the JIT deciding what instructions will actually be emitted. There is still a very strong correlation between the vectorized code and the instructions that are being emitted, which means that I get both great control of what is going on and the appropriate abstraction. I was actually able to implement the whole thing without dropping to architecture-specific code, which makes me very happy.

Before we get any deeper, let’s take a simple challenge. We want to take an array of 32 integers and pack each one of them in 4 bits into an array of 16 bytes. Here is what the code will look like:

This is a bit dense, but let’s see if I can explain what is going on here.

We load from the array a vector (4 items) at the 0, 4, 8, 12, 16, 20, 24, and 28 intervals. For each one of those, we shift the values by the required offset and or all of them together. Note that this means that the first item’s four bits go in the first nibble, but the second item’s bits go in the fifth nibble, etc.

The idea is that we are operating on 4 items at a time, reducing the total number of operations we have to perform. It may be easier to understand if you see those changes visually:

image

What is happening here, however, is that we are able to do this transformation in very compact code. That isn’t just an issue of high-level code, let’s take a look at the assembly instructions that this generates:

I’m going to assume that you aren’t well versed with assembly, so let’s explain what is going on. This code contains zero branches, it does four reads from memory, mask the elements, shift them and or them together.

The relevant instructions are:

  • vmovupd – read 4 integers to the register
  • vpand – binary and with a value (masking)
  • vpslld – shift to the left
  • vpor – binary or
  • vmovdqu – write 16 bytes to memory

There are no branches, nothing to complicate the code at all. This is about as tight as you can get, at the level of machine instructions.

Of particular interest here is the C# code. The entire thing is basically a couple of lines of code, and I could express the whole thing as a single expression in a readable manner. Let’s look at the C code, to compare what is going on:

Note that both should generate the same machine code, but being able to rely on operator overloading means that I can now get a far more readable code.

From that point, the remaining task was to re-write the Python script so it would generate C# code, not C. In the next post I’m going to be talking more about the constraints we have and what we are actually trying to solve with this approach.

time to read 3 min | 563 words

As I mentioned, I spent quite a lot of time trying to understand the mechanism behind how the FastPFor algorithm works. A large part of that was the fact that I didn’t initially distinguish the separate steps in the process. Finding the bits' widths, packing the bits, metadata and exception management are all integral to how this works, and I tried to grok it all in one shot. I kept getting lost in 14KLOC files with dense SIMD instructions.

When looking at the code, it is also important to keep in mind that this is not a library that you are meant to pick up & use. This is a research project. As such, there was enough work done to show the results, but all the spit & polish that are associated with making a library ready for general consumption weren’t done.

A simple example of that is that the code just assumes that the output buffer provided will be large enough (with no way to estimate upfront) and will overflow the buffer if that happens. Perfectly acceptable for research code, hell no for production usage, naturally.

In the same manner, the repository contains a lot of code. Most of that is being used as part of comparing the FastPFor algorithm to various other options. That can make it hard to understand what are the core parts of the algorithm and what is effectively chuff.

To be honest, the hardest part for me, however, was figuring out the memory allocation patterns that are going on in here. There is a high usage of C++ containers, with implicit allocations and hidden memory management. Here is a piece of code that I had to struggle to understand for quite some time, for example:

This does bit-packing using SIMD without a mask (the naming convention is something that you really have to get used to, I don’t think I encountered justlowercase before).

Note the call to fastpackwithoutmask() in there, it assumes that the provided buffer has exactly 32 elements. A lot of the code in the project assumes that you are passed a buffer of a certain size and operate on that directly. That produces great results in terms of efficiency and code density. But when I read this code it was “obvious” to me that there is an out-of-band work here. If the provided buffer isn’t perfectly aligned on 32 boundary, that method will write read or write beyond the end of the buffer.

Look at line 9 there, where the source container is being resized to ensure that this is fine, since we increase the container size to be exactly 32 elements aligned. We revert this operation later in line 20.

For some SIMD instructions, it matters the alignment of the buffers we are working on. This is handled using a custom allocator on the containers, which is not usually something that I would pay attention to.

In short, from my perspective, there is a lot of hidden behavior in non-trivial C++ code usage there that masks the actual behavior of the algorithm in question.

If you are looking at FastPFor (and if you care enough to look at this series of posts, you probably should), take into consideration that this code shouldn’t be applied as is, but should probably be adapted for your needs.

In the next post, I’m going to start discussing how we did just that for RavenDB.

time to read 9 min | 1766 words

The FastPFor is an integer compression algorithm that was published in 2012 initially. You can read the paper about it here: Decoding billions of integers per second through vectorization.

I’ve run into this algorithm many times in the past. You pretty much can’t be in the database arena and not run into that. It is an interesting paper, and it has a GitHub repository with the code, which is great. Except that I couldn’t figure out what was going on there.

I actually tried stepping through the code a bunch of times, and I always ended up getting lost. The code is in C++ and makes heavy use of templates, containers and non-trivial amounts of magic. To give some context, I gave up when I run into these code snippers:

The paper itself describes the algorithm, but not in a way that actually made sense to me. I tried looking at the code, and I basically noped out. Too complex for me to understand, it seems.

But I kept running into this, and we recently had a strong need for something similar. So I basically took the time to meditate on this for a while.

On a personal level, I realized that I was trying to understand how the whole thing worked by just stepping through the code and reading the paper. The problem was that I was mixing several different layers and wasn’t able to create the proper separation between them.

FastPFor builds upon the previously discussed simdcomp, once you understand that, you can go through this in isolation. For this reason, I first talked about SIMD bit-packing and showed an example usage, before actually diving into how FastPFor itself works.

As a reminder, simdcomp provides routines for quick packing and unpacking of integers, nothing more. FastPFor is where the actual compression happens. Let’s see what is actually going on here. Consider the following list, which we previously discussed:

1871143144

4

4

4

4

4

4

4

4

4

4

7984

4

4

4

4

If we’ll try bit-pack this list, we’ll find out that we aren’t gaining much. The first value is 31 bits in length, after all, and that means that we aren’t able to save much. We talked about Frame of Reference (FOR) as a way to handle that. We can treat every128 block of numbers as a block that has its own reference point and compute the maximum number of bits for that location. That would actually save us a lot of space, but it isn’t ideal. In the case above, most entries in the list can be packed in just 3 bits, except 2. That is where PFor comes into play, which stands for Patched Frame of Reference.

The key aspects of FastPFor are how it figures out what is the best size to use to pack the numbers, the way it detects what should be patched, and the manner in which it stores this.

The code boils down to a single function:

What is going on here? This function accepts an array of integers (whose size is 128) and computes the best number of bits to use to pack it.

How does that work? The first thing this does is compute how many numbers we have for each bit range. This is done using the asmbits() call, which is basically a LeadingZeroCount(). The end result is a frequency map that we can use to estimate how many bits will be required to store the items from this block.

We start from the maximum number of bits that we need to store all the numbers in the block, and we go down, checking what would be the best cost (in terms of space). Since we need to store exceptions somewhere, we compute the cost of that as we go down in the list.

This function gives us:

  • bestb – the number of bits needed to pack all the items in the block that would result in the smallest output size
  • bestcexception – the count of exceptions (numbers that do not fit into bestb bits
  • maxb – the maximum number of bits that we need to store for this block

That is the key function, the one that is responsible for getting the best compression ratio.

What I found that made FastPFor challenging to understand is that it has a lot of state that you need to keep track of. What I described so far is a single block, but FastPFor operates over sets of numbers, and so needs to tie all of them together.

At any given point, FastPFor has:

  • The block that is being outputted
  • The metadata about the entire process
  • 32 arrays of exceptions

The process interleaves all of them together in interesting ways, and I had a hard time figuring out how it all ties together.

Let’s talk about the way FastPFor processes a single block. We process the data in the array in blocks of 128 numbers at a time, like so:

The first thing that happens is the computation of the best bits widths for the current block. Then, we use the bc value to record the metadata about the block.

That is an array of bytes with the following format:

  1. Packed number of bits
  2. Number of exceptions

If there are exceptions for this block, we also add:

  1. Maximum number of bits
  2. Offset of exception in the block (repeated as the number of exceptions)

The metadata is shared for the entire encoding operation, across blocks.

You can see in the code that if bestcexcept (which holds the number of exceptions) is greater than zero, we find the appropriate exceptions buffer to use. That requires a better explanation.

the getBestBFromData() call gave us two important data points, the best number of bits to pack the numbers, and the maximum number. We are going to pack all the numbers in the block to the output, but what about the exceptions? Where do they go?

It turns out that we need to store the exceptions, but we don’t actually need to store max bits, only the difference between the best number of bits and the maximum. This is what thisexceptioncontainer is holding, the remainding bits for exceptions. It’s important to understand that this is done across blocks. So the thisexceptioncontainer value holds exceptions from many different blocks. That will turn out to be very important in just a little bit. We then scan the block and write the remainding bits to the container, and write to the metadata the offset of the value in the block. Since we are using blocks of 128 numbers, this is ensured to fit inside a byte.

The last step that we do for the block is to call: packblockupsimd(), this ends up calling to simdpack() and packs all the numbers from the block to bestb bits in the output.

It’s really important to understand that we still have two data items that haven’t been written. The first is the metadata for the encoding process (bits in blocks, exceptions offsets, etc). The second is the set of exceptions themselves.

This process repeats itself for each block, until the end of the buffer. It is at this point that we need to write the remaining data to the output. Here is what the code looks like:

What is going on? This is dense, and it takes a while to unpack (pun intended) what is going on here.

First, we write the size of the metadata and then we copy the metadata that we have to the output. That is the description of all the blocks that we just went over. Then, we run over the set of exceptions that we gathered. Remember, the datatobepacked is an array that holds lists of exception data for each bit range that we have to store. We iterate over all the bit widths where we wrote exception data and generate a bitmap. That will help us later during the decoding process.

Finally, we run over the same exceptions and write them out. Note that we do that using the same simd bit-packing approach as the rest of the code.

The end result is that we have the following format:

image

For decoding, we start from the metadata offset and jump to that. Then we read the exceptions bitmap. That tells us how many exceptions we have and how many bits we have in each one of them. In the image above, you can see that we have 3 exceptions list, for 4 bits, 12 bits and 19 bits.

We use the simd bit-packing to decode those values into memory. We’ll need them shortly.  Now, we start iterating over the metadata, each block has an overhead of minimum two metadata bytes (number of bits and number of exceptions). Let’s assume that we don’t have any exceptions in the block. In that case, the process of decoding is simple. Just do the unpacking from bits to numbers and you are done.

If we have exceptions, on the other hand, we have to deal with them. At that point, the next metadata byte would contain the maximum number of bits for the exceptions in this block. Using that and the normal number of bits, we can tell where the extra bits are located. Here is the relevant code:

Note that there is a special case here if the difference in the number of bits between the block bits and the maximum number of bits is one. In that case, we don’t need to store the data, we already know what the value is then (it’s one, after all). So we can compute it once and set it in the output.

For scenarios where we can’t tell from the bit width along, we read the relevant exception array based on the difference between the bits in the block and the exceptions bits. That gives us an array that is shared across blocks. The idea is that the metadata contains the offset in the block, and we read from the relevant array one item at a time. This two-step process will result in us setting the right value and getting ready for the next call, which may be done in a different block.

Note that the exception bits buffers only care about the number of bits, not where they come from. So two blocks, which have (bestb: 4, maxb: 9) and (bestb: 7: maxb: 10) will both go to the 3 bits container.

Okay, this blog post is getting long enough, so I think that would be it. Hopefully, it will make it easier to understand exactly how the FastPFor format is working. I’m going to be talking more about FastPFor and the format implications in the next post, then move on to actually using that in C#.

time to read 3 min | 430 words

I talked a bit before about the nature of bit packing and how the simdcomp library isn’t actually doing compression. Why do I care about that, then?

Because the simdcomp library provides a very useful building block. A simple set of functions that allows you to very quickly pack and unpack bits. That is important, since those operations are fast enough that we can treat them as effectively free. Once that exists, we can now start finding interesting usages for those.

Let’s assume that we have the following functions:

  • Pack(Span<uint> input, Stream output, int bit);
  • Unpack(Stream input, Span<uint> output, int bit);

It get a set of numbers and the number of bits and encode or decode them as needed.

Because those methods are fast, it means that we can play with various interesting designs. For example, let’s take dictionary compression. Assume that we have a set of values, such as country names. This looks like this:

[ "United States", "China", "Japan", "India", "United States", "Brazil", "Germany", "India", "France", "China", "Japan", "Brazil", … ]

This array contains thousands or millions of entries. We want to encode that in an efficient manner, how can we do that? Here is a simple way to utilize the speed of bit-packing to do something useful:

What is going on here?

  • We first iterate over the set of countries, finding the unique names and giving an index for each one of them.
  • Along the way, we produce an integer array of the indexes of country names.
  • Then we compute the maximum number of bits that we have to use to store those indexes.
  • To the output, we write the number of unique country names, the names themselves and then the bit-packed set of indexes.

Let’s assume that we have 30 unique values in this list, but the list itself is 5 million items in size. What is the actual cost here?

We’ll assume an average country name of 8 bytes, to make the math easy. That means that to store the list of countries would cost us 38MB, if we did that using the most obvious approach.

Using the code above, assuming 30 unique values, we’ll have 5 million values with 5 bits each, leading to a total cost of about 3MB, instead of 38MB.

The nice thing about this approach is that once you have fast bit-packing routines, you can now start using them in all sorts of interesting ways.

The reason I talked about the topic before starting to discuss FastPFor is that it makes a lot more sense to understand how bit-packing can be applied separately from FastPFor, since that is just making interesting use of the capability.

time to read 3 min | 520 words

In the last post, I talked about how the simdcomp library is actually just doing bit-packing. Given a set of numbers, it will put the first N bits in a packed format. That is all. On its own, that isn’t very useful, we have to pair it with something else to make it do something more interesting.

Previously, I talked about delta compression, what if we put both of those together?

Given a sorted list of numbers, we’ll compute the delta between the numbers, then we can figure out how many bits we need to store them, and we can pack that tightly.

Analyzing the sample posting list I’m using, we get the following batches (of 128 numbers each)

  • 328 batches at 13 bits
  • 11 batches at 22 bits
  • 7 batches at 23 bits
  • 3 batches as 27 bits
  • 1 batch at 31 bits
  • 1 batch at 28 bits

Doing the math, we can fit all of those using 77Kb or so.

The idea is that for each batch, we can use simdpack() to write just the relevant bits. We’ll need some extra metadata to keep track of the number of bits per batch, but that is just one byte per every 128 values. For that matter, we can bit-pack that as well Smile.

Note that in this case, we are using the simd packing routines just for bit packing. The actual compression aspect comes not from the packing, but from computing the deltas and then figuring out how densely we can pack those values.

This approach is called Frame Of Reference (FOR), and it has some really nice advantages. By splitting the data into blocks, we can take advantage of the patterns in the data. Here are the first 16 deltas in the posting list:

 

                   

Value                        

Bits

 

[0]

1871143144

31

 

[1]

4

3

 

[2]

4

3

 

[3]

4

3

 

[4]

4

3

 

[5]

4

3

 

[6]

4

3

 

[7]

4

3

 

[8]

4

3

 

[9]

4

3

 

[10]

4

3

 

[11]

7984

13

 

[12]

4

3

 

[13]

4

3

 

[14]

4

3

 

[15]

4

3

Let’s assume that the batch size is 4 numbers, what would be the numbers here? Basically, we take the max number of bits for each batch of 4 numbers.

The first block will require us to use 31 bits, the second would be just 3 bits, but the third would require 13 bits and the last is 3 bits again.

The idea is that for each block, we have a separate frame of reference, the number of bits that are required for us to record & retrieve the data.

Of course, if you’ll look at the data itself, we are wasting quite a lot of space here needlessly. There are just two numbers (0, 11) that require more than 3 bits.

There is an algorithm called Patched Frame Of Reference (PFor) that handles this. Basically, when we compute a batch, we’ll record it using 3 bits only in this case. What about the rest of the values? Those we’ll store outside the batch, as an exception. On reading, we’ll need to patch the data to make it whole.

That adds a bit of complexity, but it also means that you can compress the data a lot more densely.

In the next post, I’m going to be talking about FastPFor, which combines simd bit packing, patched frame of references and some really clever tricks to get a very interesting integer compression algorithm.

time to read 4 min | 728 words

In the previous post, I showed how you can use integer compression using variable-size integers. That is a really straightforward approach for integer compression, but it isn’t ideal. To start with, it means that we take at least one byte for each number, but in many cases, especially if we are using delta encoding, we can pack things a lot more tightly. For reference, here are the first 15 records we have in the posting list I’m using for testing, delta encoded:

1871143144
4
4
4
4
4
4
4
4
4
4
7984
4
4
4

See the range of 4, repeating. Each one of them can be represented by 3 bits, but we’ll waste 32 – 64 bits to hold it.

When you are searching for data on “database stuff”, the building blocks that are being used to build databases, you’ll usually find Lemire there. In this case, I would like to talk about the SimdComp library. From the GitHub repository:

A simple C library for compressing lists of integers using binary packing and SIMD instructions. This library can decode at least 4 billion of compressed integers per second on most desktop or laptop processors.

Those are some really nice numbers, but I actually have a pet peeve with the name of the library. It isn’t actually doing compression, what it does is bit packing, which is something quite different.

Bit packing just takes numbers that are 32 bits in length and stores them using fewer bits.  For example, let’s consider the following list: 1,2,3,4,5,6,7,8,9,10. If I would store that in 32 bits integers, that would take 40 bytes. But given that the maximum size is 10, I can use 4 bits integers, instead. Taking only 5 bytes.

The core of the library is this set of routines:

  • simdpack()
  • simdpackwithoutmask()
  • simdunpack()

All of them have roughly the same structure, something like this:

image

This switch statement goes on for 32 bits. Each time selecting a different function. This code makes a lot of assumptions. You’ll always give it exactly an input of 128 numbers and it will pack them and write them to output. The idea is to be as compact as you possibly can.

Reading this code is.. a challenge, because there is so much mental weight behind it, or so it seems. For example, take a look at how we pack a set of numbers into 2-bit integers:

The actual code goes on for quite a while (104 of dense SIMD code), so let’s try to break it up a bit. If we’ll translate this from SIMD to scalar, we’ll have code like this:

If you are used to manipulating bits, it’s fairly straightforward. We start by setting the output to the first value, then the second value was shifted by 2 bits and added to the value, then the third value was shifted by 4 bits, etc.

With SIMD, the interesting part is that we aren’t operating on a single value, but 4 at a time. That also means that we have a difference in how the data actually end up being packed.

Given a list of values from 1 .. 16, bit packing them will usually result in the data looking like this:

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16

In contrast, the SIMD bit packing model will store the data like so:

1	5	9	13	2	6	10	14	3	7	11	15	4	8	12	16

This is a result of operating on 4 items at a time, but in practice, the actual placement of the bits does not matter much, just that we are able to store and retrieve them.

Note that for packing, we have two options, one that uses a mask and the other that does not. That will be important later. The only difference between those options is whether a mask is applied before we splice the bits into the output, nothing else.

The reason I said that this isn’t compression is that this library is focused solely on packing the bits, it has no behavior / concept beyond that stage. In the next post, we’ll start looking into how we can make use of this in more interesting ways.

time to read 3 min | 418 words

If you are building a database, the need to work with a list of numbers is commonplace. For example, when building an index, we may want to store all the ids of documents of orders from Europe.

You can just store the raw values, of course, but that can be incredibly wasteful in terms of disk space, memory bandwidth and performance. Consider a database where millions of orders are from Europe. We’ll need multiple megabytes just to store the document ids. We can utilize patterns in the numbers to reduce the total storage size. For example, if you are storing a list of sorted numbers, you can store the delta of the numbers, instead of the whole value. The delta tends to be much smaller, after all. Take a look at the following code:

This will generate (in place) delta for the sorted list. We can then write it to a stream, using 7-bit integer (also called variable size integers) encoding. This way, smaller numbers take less space, which is the point of first doing delta encoding.

For testing purposes, I have a list of 44,956 items, which looks like this (full list is here):

1871143144

1871143148

1871143152

1871143156

1871143160

1871143164

1871143168

1871143172

1871143176

1871143180

Note that I’m using int64, so if we would store the data as is, it would take ~351KB. Using delta encoding, we can pack that into just ~45Kb.

I care a lot more about the decoding of the values, so let’s see how we read it back, shall we?

With very little code, we can gain quite a jump in information density. That is quite good. For reference purposes, taking the same data and compressing it using GZip would cost us ~58KB, so significantly more than this simpler mechanism.

On the other hand, using GZip on the delta encoded list will put us under 8KB. This is because in this context, there is a great deal of repetition in the data, there are many runs that are exactly 4 values apart from one another, so we can get quite a good compression ratio out of this.

That does require that we’ll have patterns in the data, which is by no means a guarantee. There is a whole field out there of research into integer compression, since it is such an important aspect of how database internals are working.

In this series, I’m going to focus on detailing how we implemented FastPFor inside of RavenDB, what we learned along the way and what the final results were.

time to read 7 min | 1244 words

In this series so far, we reduced the storage cost of key/value lookups by a lot. And in the last post we optimized the process of encoding the keys and values significantly. This is great, but the toughest challenge is ahead of us, because as much as encoding efficiency matters, the absolute cost we have is doing lookups. This is the most basic operation, which we do billions of times a second. Any amount of effort we’ll spend here will be worth it. That said, let’s look at the decoding process we have right now. It was built to be understandable over all else, so it is a good start.

What this code does is to accept a buffer and an offset into the buffer. But the offset isn’t just a number, it is composed  of two values. The first 12 bits contain the offset in the page, but since we use 2-byte alignment for the entry position, we can just assume a zero bit at the start. That is why we compute the actual offset in the page by clearing the first four bits and then shifting left by three bits. That extracts the actual offset to the file, (usually a 13 bits value) using just 12 bits. The first four bits in the offset are the indicator for the key and value lengths. There are 15 known values, which we computed based on probabilities and one value reserved to say: Rare key/val length combination, the actual sizes are stored as the first byte in the entry.

Note that in the code, we handle that scenario by reading the key and value lengths (stored as two nibbles in the first byte) and incrementing the offset in the page. That means that we skip past the header byte in those rare situations.

The rest of the code is basically copying the key and value bytes to the relevant variables, taking advantage of partial copy and little-endian encoding.

The code in question takes 512 bytes and has 23 branches. In terms of performance, we can probably do much better, but the code is clear in what it is doing, at least.

The first thing I want to try is to replace  the switch statement with a lookup table, just like we did before.  Here is what the new version looks like:

The size of the function dropped by almost half and we have only 7 more branches involved. There are also a couple of calls to the memory copy routines that weren’t inlined. In the encoding phase, we reduced branches due to bound checks using raw pointers, and we skipped the memory calls routines by copying a fixed size value at varied offsets to be able to get the data properly  aligned. In this case, we can’t really do the same. One thing that we have to be aware of is the following situation:

image

In other words, we may have an entry that is at the end of the page, if we’ll try to read unconditionally 8 bytes, we may read past the end of the buffer. That is not something that we can do. In the Encode() case, we know that the caller gave us a buffer large enough to accommodate the largest possible size, so that isn’t an issue. That complicates things, sadly, but we can go the other way around.

The Decode() function will always be called on an entry, and that is part of the page. The way we place entries means that we are starting at the top and moving down. The structure of the page means that we can never actually place an entry below the first 8 bytes of the page. That is where the header and the offsets array are going, after all. Given that, we can do an unconditional read backward from the entry. As you can see in the image below, we are reading some data that we don’t care about, but this is fine, we can fix it later, and without any branches.

image

The end result is that we can have the following changes:

I changed the code to use a raw pointer, avoiding bound checks that we already reasoned about. Most interesting is the ReadBackward function. This is an inner function, and was properly inlined during JIT compilation, it implements the backward reading of the value. Here is what the assembly looks like:

With this in place, we are now at 133 bytes and a single branch operation. That is pretty awesome, but we can do better still. Consider the following code (explanations to follow):

Note that the first element in the table here is different, it is now setting the 4th bit. This is because we are making use of that. The structure of the bytes in the table are two nibbles, but no other value in the table sets the 4th bit. That means that we can operate on that.

Indeed, what we are doing is use the decoder byte to figure out what sort of shift we want. We have the byte from the table and the byte from the buffer. And we use the fact that masking this with 8 gives (just for this value) the value of 8. We can then use that to select the appropriate byte. If we have an offloaded byte, then we’ll shift the value by 8, getting the byte from the buffer. For any other value, we’ll get 0 as the shift index, resulting in us getting the value from the table. That gives us a function with zero branches, and 141 bytes.

I spent a lot of time thinking about this, so now that we have those two approaches, let's benchmark them. The results were surprising:

|                  Method |       Mean |    Error |   StdDev |
|------------------------ |-----------:|---------:|---------:|
|  DecodeBranchlessShifts | 2,107.1 ns | 20.69 ns | 18.34 ns |
|           DecodeBranchy |   936.2 ns |  1.89 ns |  1.68 ns |

It turns out that the slightly smaller code with the branches is able to beat up the branchless code. When looking into what they are doing, I think that I can guess why. Branches aren’t a huge problem if they are predictable, and in our case, the whole point of all of this is that the offload process where we need to go to the entry to get the value is meant to be a rare event. In branchless code, on the other hand, you have to do something several times to avoid a branch (like shifting the value from the buffer up and maybe shifting it down, etc).

You can’t really argue with a difference like that. We also tried an AVX version, to see if this would have better performance. It turns out that there is really no way for us to beat the version with the single branch. Everything else was at least twice as slow.

At this point, I believe that we have a winner.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}