Reducing parsing costs in RavenDB

time to read 5 min | 801 words

Note, this is something that we are considering for 4.0.

RavenDB uses JSON natively to handle pretty much everything. Which makes sense, for a JSON Document Database. JSON is easy to parse, human readable and for the most part, have no real complexity associated with it.

However, while JSON is easy to parse, there is some computational overhead around parsing it. In fact, in our perf testing, we spend quite a lot of time just serializing and deserialzing JSON. It is one of the major costs we have to deal with, mostly because that is something that happens all the time. We have previous used BSON to store documents internally, but profiling has actually shown that it is cheaper to hold the data as JSON text and parse it. So that is what we are currently doing.

However, even faster parsing is still parsing, and something that we would like to avoid. There is also an issue with how this is actually represented in memory. Let us consider the following JSON:

{"FirstName":"John","LastName":"Smith"}

In memory, this is represented as (highly simplified):

  • Dictionary<string,object> instance
    • string[] for the keys
      • “FirstName” string
      • “LastName” string
    • object[] for the values
      • “John” string
      • “Smith” string

In other words, especially for large documents, there are a lot of small objects that are being created. That doesn’t impact immediate parsing cost, but those do need to be collected after the fact, and that is something that we would like to avoid.

We are currently considering using Flat Buffers for internal document storage. The good thing about Flat Buffers is that there is not intermediate parsing step. You get a block of memory that you can immediately access. That has two major advantages, loading the documents to memory would mean just reading a buffer from disk, with no extra cost. But it would also mean that the act of releasing a document would be much cheaper, we would only need to collect the buffer again, not potentially tens of thousands of small objects.

Another advantage is that we usually need to load documents to be indexed, and usually indexing only require very few fields from the documents. By avoiding the cost of parsing, only paying the price for the objects that we are actually touching, we are in a much better position to reduce the indexing costs.

A rough scratch schema using flat buffers would be:

union AnyValueUnion { Document, Array, Value }

table AnyValue
{
    Value: AnyValueUnion;
}

table Value
{
    Type: byte;
    String: string;
    Bytes: [byte];
}

table Array
{
    Values: [AnyValue];
}

table FieldValue 
{
    Name: string;
    Value: AnyValue;
}

table Document
{
    Values: [FieldValue];    
}

The Values inside the documents are sorted by field name, so we can search a field using binary search.

Nitpicker corner: Yes, we probably would want to use a hash here, but this is a rough draft to test things out, it will come later.

We need to optimize the schema, and using it will not be a lot of fun, but the speed and memory improvements should be significant.

Another alternative would be a schema such as this:

table FieldValue 
{
    Name: int;
    Value: AnyValue;
}

table Document
{
    Values: [FieldValue];    
}

table Root
{
    Fields: [string];
    Document: Document;
}

Here we store all the field names once, then we refer to the field name using its index in the root object. This should have the advantage of reducing repeated string names.