The importance of a data format: Part I – Current state problems

architecture (623) rss
bugs (451) rss
community (382) rss
databases (481) rss
design (899) rss
development (654) rss
hibernating-practices (73) rss
miscellaneous (592) rss
performance (397) rss
programming (1104) rss
raven (1471) rss
ravendb.net (558) rss
reviews (184) rss

2025
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Jan 07 2016

The importance of a data formatPart I – Current state problems

time to read 3 min | 459 words

JSON is a really simple format. It make it very easy to work with it, interchange it, read it, etc. Here is the full JSON format definition:

object = {} | { members }
members = pair | pair , members
pair = string : value
array = [] | [ elements ]
elements = value | value , elements
value = string | number | object | array | true | false | null

So far, so good. But JSON also has a few major issues. In particular, JSON require that you’ll read and parse the entire document (at least until the part you actually care about) before you can do something with it. Reading JSON documents into memory and actually working with them means loading and parsing the whole thing, and typically require the use of dictionaries to get fast access to the data. Let us look at this typical document:

{
  "firstName": "John",
  "lastName": "Smith",
  "address": {
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "children": [{"firstName": "Alice"}]
}

How would this look in memory after parsing?

Dictionary (root)
- firstName –> John
- lastName –> Smith
- address –> Dictionary
  - state –> NY
  - postalCode –> 10021-3100
- children –> array
  - [0] –> Dictionary
    - firstName –> Alice

So that is three dictionaries and an array (even assuming we ignore all the strings). Using Netwonsoft.Json, the above document takes 3,840 bytes in managed memory (measured using objsize in WinDBG). The size of the document is 126 bytes as text. The reason for the different sizes is dictionaries. Here is 320 bytes allocation:

new Dictionary<string,Object>{ {“test”, “tube”} };

And as you can see, this adds up fast. For a database that mostly deals with JSON data, this is a pretty important factor. Controlling memory is a very important aspect of the work of a database. And the JSON is really inefficient in this regard. For example, imagine that we want to index documents by the names of the children. That is going to force us to parse the entire document, incurring a high penalty in both CPU and memory. We need a better internal format for the data.

In my next post, I’ll go into details on this format and what constraints we are working under.

Tweet Share Share 12 comments

Tags:

Comments

07 Jan 2016
13:27 PM

EnCey

I wouldn't say that this is an issue of JSON. JSON is a data format, the way you go about parsing it is entirely up to you. More precisely, the issue only affects parsers that are unaware of the data structure.
If you have a POCO that represents the JSON structure, memory footprint is (or can be) as low as it can get. The part about having to "parse the entire thing" also only applies if you are dealing with unknown JSON data. It is easy to write a parser that skips objects and values and only reads those that you care for, of course such a parser must be implemented manually.

I understand your situation and why JSON is not an ideal choice, but this is not a shortcoming of JSON itself but rather a problem of your specific scenario. I will go as far and say that most generic data exchange formats have this "shortcoming", e.g. XML or YAML. Google's Protocol Buffer format is very efficient in terms of space, but requires you to know its structure in order to be able to parse it. This I think further demonstrates how this is not a problem of the data format, but rather a reason why a certain format is inadequate in your specific scenario.

07 Jan 2016
13:31 PM

Oren Eini

EnCey, You have to read the entire thing. Consider the document in this post, even if you write a dedicated parser for JSON for this document type, you have to go through this whole document (read it from disk, parse it, etc) to get to the names of the children. You don't have to store the information, for sure, but you have to read through the data, which can be expensive at scale.

07 Jan 2016
13:44 PM

EnCey

Yes, I should have been more specific there. My point was that you don't have to map all the information to e.g. Dictionaries and can simply ignore it if you don't need it. I got hooked by your statement

incurring a high penalty in both CPU and memory .. where the memory part can be avoided (unless we count temporary objects generated while reading the JSON, which again depends on the parser that is used). Agreed though that you can't avoid the CPU penalty, which to be honest is also only a concern in high-performance applications. My point in all this is that I wouldn't call it a "major issue" of JSON, as it is only a problem in performance-critical applications. Nevertheless, I'm interested to see the format you came up with ;)

07 Jan 2016
16:26 PM

Tom

I'm curious as to whether or not you've investigated Netflix's Falcor open source project? It seems to support the scenario where you have a large JSON document but only want/need to retrieve pieces of it possibly without loading the whole document into memory.

07 Jan 2016
16:36 PM

Matt Z

Binary JSON (BSON) was designed for this purpose. It is both efficient to store and fast to scan. Curious what needs RavenDB have that it wouldn't work?

07 Jan 2016
18:36 PM

Tal Weiss

I belive we tried bson and readbyte was very expensive. Also bson still require to load it from the disk and parse the whole object.

08 Jan 2016
00:23 AM

Thomas Lauzi

My first idea was also BSON...

When parsing is the problem, a custom json parser (like EnCey already suggested) which streams the document could help, like you parse large xmls with a SAX parser instead of a dom parser. This would reduce memory load.

Another approach could perhaps a denormalized data storage. Storage size increases because of redundant information, but parsing/accessing ist fast and without loading the whole document. e.g. PATH : VALUE firstName : John lastName : Smith address/state : NY address/postalCode : 10021-3100 children[0]/firstname : Alice

If you need the fieldnames/nodes they could be put them in an extra field. e.g. PATH : FIELD : VALUE : firstName : John : lastName : Smith address : state : NY address : postalCode : 10021-3100 children[0] : firstname : Alice

Json denormalizing and normalizing will also be expensive, but it depends how often this is needed.

08 Jan 2016
12:32 PM

Oren Eini

EnCey, RavenDB is a perf critical application. Every CPU cycle or memory byte that we can save means that we can handle more requests, and respond to requests much faster

08 Jan 2016
12:39 PM

Oren Eini

Tom. Falcor is an interchange / API format. We are talking about internal storage format inside RavenDB. Completely different scenarios

08 Jan 2016
12:40 PM

Oren Eini

Matt, BSON is nice, but it is actually more costly to parse than JSON in many cases. It is also a streaming format, which doesn't allow you to just get the relevant value without going through all the previous data

08 Jan 2016
12:41 PM

Oren Eini

Thomas, Deconstructing the document would mean a very high cost of reconstructing it when the user needs that document back.

02 Feb 2016
14:55 PM

Howard

One of the recurring problems in computing. ASN.1 solved a lot of this already, but it also has some of the same weaknesses as e.g. BSON - you have to start reading from the beginning to find a particular element. Still, ASN.1 is superior to JSON and XML in that it is completely self-describing - every element is a TLV (type-length-value) and you can quickly skip past the elements you don't recognize or care about using the length of each field, instead of needing to parse them first to find out how big they are.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

The importance of a data formatPart I – Current state problems

More posts in "The importance of a data format" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "The importance of a data format" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication