The importance of a data formatPart I – Current state problems
JSON is a really simple format. It make it very easy to work with it, interchange it, read it, etc. Here is the full JSON format definition:
- object = {} | { members }
- members = pair | pair , members
- pair = string : value
- array = [] | [ elements ]
- elements = value | value , elements
- value = string | number | object | array | true | false | null
So far, so good. But JSON also has a few major issues. In particular, JSON require that you’ll read and parse the entire document (at least until the part you actually care about) before you can do something with it. Reading JSON documents into memory and actually working with them means loading and parsing the whole thing, and typically require the use of dictionaries to get fast access to the data. Let us look at this typical document:
{ "firstName": "John", "lastName": "Smith", "address": { "state": "NY", "postalCode": "10021-3100" }, "children": [{"firstName": "Alice"}] }
How would this look in memory after parsing?
- Dictionary (root)
- firstName –> John
- lastName –> Smith
- address –> Dictionary
- state –> NY
- postalCode –> 10021-3100
- children –> array
- [0] –> Dictionary
- firstName –> Alice
- [0] –> Dictionary
So that is three dictionaries and an array (even assuming we ignore all the strings). Using Netwonsoft.Json, the above document takes 3,840 bytes in managed memory (measured using objsize in WinDBG). The size of the document is 126 bytes as text. The reason for the different sizes is dictionaries. Here is 320 bytes allocation:
new Dictionary<string,Object>{ {“test”, “tube”} };
And as you can see, this adds up fast. For a database that mostly deals with JSON data, this is a pretty important factor. Controlling memory is a very important aspect of the work of a database. And the JSON is really inefficient in this regard. For example, imagine that we want to index documents by the names of the children. That is going to force us to parse the entire document, incurring a high penalty in both CPU and memory. We need a better internal format for the data.
In my next post, I’ll go into details on this format and what constraints we are working under.
More posts in "The importance of a data format" series:
- (25 Jan 2016) Part VII–Final benchmarks
- (15 Jan 2016) Part VI – When two orders of magnitude aren't enough
- (13 Jan 2016) Part V – The end result
- (12 Jan 2016) Part IV – Benchmarking the solution
- (11 Jan 2016) Part III – The solution
- (08 Jan 2016) Part II–The environment matters
- (07 Jan 2016) Part I – Current state problems
Comments
I wouldn't say that this is an issue of JSON. JSON is a data format, the way you go about parsing it is entirely up to you. More precisely, the issue only affects parsers that are unaware of the data structure.
If you have a POCO that represents the JSON structure, memory footprint is (or can be) as low as it can get. The part about having to "parse the entire thing" also only applies if you are dealing with unknown JSON data. It is easy to write a parser that skips objects and values and only reads those that you care for, of course such a parser must be implemented manually.
I understand your situation and why JSON is not an ideal choice, but this is not a shortcoming of JSON itself but rather a problem of your specific scenario. I will go as far and say that most generic data exchange formats have this "shortcoming", e.g. XML or YAML. Google's Protocol Buffer format is very efficient in terms of space, but requires you to know its structure in order to be able to parse it. This I think further demonstrates how this is not a problem of the data format, but rather a reason why a certain format is inadequate in your specific scenario.
EnCey, You have to read the entire thing. Consider the document in this post, even if you write a dedicated parser for JSON for this document type, you have to go through this whole document (read it from disk, parse it, etc) to get to the names of the children. You don't have to store the information, for sure, but you have to read through the data, which can be expensive at scale.
Yes, I should have been more specific there. My point was that you don't have to map all the information to e.g. Dictionaries and can simply ignore it if you don't need it. I got hooked by your statement
I'm curious as to whether or not you've investigated Netflix's Falcor open source project? It seems to support the scenario where you have a large JSON document but only want/need to retrieve pieces of it possibly without loading the whole document into memory.
Binary JSON (BSON) was designed for this purpose. It is both efficient to store and fast to scan. Curious what needs RavenDB have that it wouldn't work?
I belive we tried bson and readbyte was very expensive. Also bson still require to load it from the disk and parse the whole object.
My first idea was also BSON...
When parsing is the problem, a custom json parser (like EnCey already suggested) which streams the document could help, like you parse large xmls with a SAX parser instead of a dom parser. This would reduce memory load.
Another approach could perhaps a denormalized data storage. Storage size increases because of redundant information, but parsing/accessing ist fast and without loading the whole document. e.g. PATH : VALUE firstName : John lastName : Smith address/state : NY address/postalCode : 10021-3100 children[0]/firstname : Alice
If you need the fieldnames/nodes they could be put them in an extra field. e.g. PATH : FIELD : VALUE : firstName : John : lastName : Smith address : state : NY address : postalCode : 10021-3100 children[0] : firstname : Alice
Json denormalizing and normalizing will also be expensive, but it depends how often this is needed.
EnCey, RavenDB is a perf critical application. Every CPU cycle or memory byte that we can save means that we can handle more requests, and respond to requests much faster
Tom. Falcor is an interchange / API format. We are talking about internal storage format inside RavenDB. Completely different scenarios
Matt, BSON is nice, but it is actually more costly to parse than JSON in many cases. It is also a streaming format, which doesn't allow you to just get the relevant value without going through all the previous data
Thomas, Deconstructing the document would mean a very high cost of reconstructing it when the user needs that document back.
One of the recurring problems in computing. ASN.1 solved a lot of this already, but it also has some of the same weaknesses as e.g. BSON - you have to start reading from the beginning to find a particular element. Still, ASN.1 is superior to JSON and XML in that it is completely self-describing - every element is a TLV (type-length-value) and you can quickly skip past the elements you don't recognize or care about using the length of each field, instead of needing to parse them first to find out how big they are.
Comment preview