Reducing parsing costs in RavenDB
Note, this is something that we are considering for 4.0.
RavenDB uses JSON natively to handle pretty much everything. Which makes sense, for a JSON Document Database. JSON is easy to parse, human readable and for the most part, have no real complexity associated with it.
However, while JSON is easy to parse, there is some computational overhead around parsing it. In fact, in our perf testing, we spend quite a lot of time just serializing and deserialzing JSON. It is one of the major costs we have to deal with, mostly because that is something that happens all the time. We have previous used BSON to store documents internally, but profiling has actually shown that it is cheaper to hold the data as JSON text and parse it. So that is what we are currently doing.
However, even faster parsing is still parsing, and something that we would like to avoid. There is also an issue with how this is actually represented in memory. Let us consider the following JSON:
{"FirstName":"John","LastName":"Smith"}
In memory, this is represented as (highly simplified):
- Dictionary<string,object> instance
- string[] for the keys
- “FirstName” string
- “LastName” string
- object[] for the values
- “John” string
- “Smith” string
In other words, especially for large documents, there are a lot of small objects that are being created. That doesn’t impact immediate parsing cost, but those do need to be collected after the fact, and that is something that we would like to avoid.
We are currently considering using Flat Buffers for internal document storage. The good thing about Flat Buffers is that there is not intermediate parsing step. You get a block of memory that you can immediately access. That has two major advantages, loading the documents to memory would mean just reading a buffer from disk, with no extra cost. But it would also mean that the act of releasing a document would be much cheaper, we would only need to collect the buffer again, not potentially tens of thousands of small objects.
Another advantage is that we usually need to load documents to be indexed, and usually indexing only require very few fields from the documents. By avoiding the cost of parsing, only paying the price for the objects that we are actually touching, we are in a much better position to reduce the indexing costs.
A rough scratch schema using flat buffers would be:
union AnyValueUnion { Document, Array, Value } table AnyValue { Value: AnyValueUnion; } table Value { Type: byte; String: string; Bytes: [byte]; } table Array { Values: [AnyValue]; } table FieldValue { Name: string; Value: AnyValue; } table Document { Values: [FieldValue]; }
The Values inside the documents are sorted by field name, so we can search a field using binary search.
Nitpicker corner: Yes, we probably would want to use a hash here, but this is a rough draft to test things out, it will come later.
We need to optimize the schema, and using it will not be a lot of fun, but the speed and memory improvements should be significant.
Another alternative would be a schema such as this:
table FieldValue { Name: int; Value: AnyValue; } table Document { Values: [FieldValue]; } table Root { Fields: [string]; Document: Document; }
Here we store all the field names once, then we refer to the field name using its index in the root object. This should have the advantage of reducing repeated string names.
Comments
Hi Oren,
Sounds good for times when RavenDB is opening a document on the server-side. How would this impact retrieval, update and storage when the action is happening against the RavenDB client on a different server. I guess (given it's across the wire) the parsing cost would still be there for this type of accessing documents?
So the main benefit would be as you say, things like Indexing performance. Or when using ResultsTransformers where the document itself is not sent to the client?
Cheers,
Ian
Ian, Over the wire, there should be no change. This is an internal change on the server side. Indexing, loading documents from disk, transformers etc would be impacted.
That's interesting. From the very beginning of the RavenDB that was the question that was in my head. Why would one store the document in such an inefficient format as JSON? It impacts the size of the stored blobs, has an impact on indexing, etc. On the other hand, storing docs in another format makes you pay the transformation tax over and over (more my thoughts about this can be found in here: http://blog.scooletz.com/2015/01/14/do-we-really-need-all-these-data-transformations/). It can be easily addressed by a proxy, just before the server endpoint, but still it should be tested. Then there are concerns when generating the structures, for example, one can think about merging them when they are similar enough, which depends on the serializer as well. I'm looking forward to reading about this attempt.
Hi Oren. I recently read about capnproto-net in Marc Gravell's blog. Maybe you could be interested in the case you didn't know about that.
https://github.com/mgravell/capnproto-net
@Crowley, the choice between cap'n'proto and Flat Buffers share one major feature, they allow zero-copy operations and accessing the property without walking over whole document/message. One may dive into more subtle comparison but after all, the feature of zero-copy is the feature you want to have to lower the allocation/decoding pressure.
JSON deserializers are getting better at the job:
http://theburningmonk.com/2015/06/binary-and-json-benchmarks-updated-3
I also gathered my own performance figures on non-trivial JSON data shapes, FWIW:
https://github.com/ysharplanguage/FastJsonParser#Performance
'HTH,
Y Sharp, Please note that while this is interesting, there is a major issue at hand. It still require doing the full serializaton. Compare that to not having to do anything.
Nice read, Flat Buffers look useful. Thanks for sharing, I'd not heard of them before.
You might also want to have a read of the Dremel paper (from Google), it covers a similar area, see http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
In-particular the parts about how it efficiently creates a "nested column storage" (pg 3) and then iterates over it for aggregation (using state-machines). You might find some tricks in there that you can use for RavenDB.
Comment preview