JSON Packing, Text Based Formats and other stuff that come to mind at 5 AM

time to read 92 min | 18267 words

This post was written at 5:30AM, I run into this while doing research for another post, and I couldn’t really let it go.

XML as a text base format is really wasteful in space. But that wasn’t what really made it lose its shine. That was when it became so complex that it stopped being human readable. For example, I give you:

   1: <?xml version="1.0" encoding="UTF-8" ?>

   2:  <SOAP-ENV:Envelope

   3:   xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"

   4:   xmlns:xsd="http://www.w3.org/1999/XMLSchema"

   5:   xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">

   6:    <SOAP-ENV:Body>

   7:        <ns1:getEmployeeDetailsResponse

   8:         xmlns:ns1="urn:MySoapServices"

   9:         SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">

  10:            <return xsi:type="ns1:EmployeeContactDetail">

  11:                <employeeName xsi:type="xsd:string">Bill Posters</employeeName>

  12:                <phoneNumber xsi:type="xsd:string">+1-212-7370194</phoneNumber>

  13:                <tempPhoneNumber

  14:                 xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"

  15:                 xsi:type="ns2:Array"

  16:                 ns2:arrayType="ns1:TemporaryPhoneNumber[3]">

  17:                    <item xsi:type="ns1:TemporaryPhoneNumber">

  18:                        <startDate xsi:type="xsd:int">37060</startDate>

  19:                        <endDate xsi:type="xsd:int">37064</endDate>

  20:                        <phoneNumber xsi:type="xsd:string">+1-515-2887505</phoneNumber>

  21:                    </item>

  22:                    <item xsi:type="ns1:TemporaryPhoneNumber">

  23:                        <startDate xsi:type="xsd:int">37074</startDate>

  24:                        <endDate xsi:type="xsd:int">37078</endDate>

  25:                        <phoneNumber xsi:type="xsd:string">+1-516-2890033</phoneNumber>

  26:                    </item>

  27:                    <item xsi:type="ns1:TemporaryPhoneNumber">

  28:                        <startDate xsi:type="xsd:int">37088</startDate>

  29:                        <endDate xsi:type="xsd:int">37092</endDate>

  30:                        <phoneNumber xsi:type="xsd:string">+1-212-7376609</phoneNumber>

  31:                    </item>

  32:                </tempPhoneNumber>

  33:            </return>

  34:        </ns1:getEmployeeDetailsResponse>

  35:    </SOAP-ENV:Body>

  36: /SOAP-ENV:Envelope>

After XML was thrown out of the company of respectable folks, we had JSON show up and entertain us. It is smaller and more concise than XML, and so far has resisted the efforts to make it into some sort of a uber complex enterprisiey tool.

But today I run into quite a few effort to do strange things to JSON. I am talking about things like JSON DB (a compressed json format, not actual json database), JSONH, json.hpack, and friends. All of those attempt to reduce the size of JSON documents.

Let us take an example. the following is a JSON document representing one of RavenDB builds:

   1: {

   2:   "BuildName": "RavenDB Unstable v2.5",

   3:   "IsUnstable": true,

   4:   "Version": "2509-Unstable",

   5:   "PublishedAt": "2013-02-26T12:06:12.0000000",

   6:   "DownloadsIds": [],

   7:   "Changes": [

   8:     {

   9:       "Commiter": {

  10:         "Email": "david@davidwalker.org",

  11:         "Name": "David Walker"

  12:       },

  13:       "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",

  14:       "Href": "/app/rest/changes/id:21039",

  15:       "TeamCityId": 21039,

  16:       "Username": "david walker",

  17:       "Comment": "Do not save Has-Api-Key header to metadata\n",

  18:       "Date": "2013-02-20T23:22:43.0000000",

  19:       "Files": [

  20:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"

  21:       ]

  22:     },

  23:     {

  24:       "Commiter": {

  25:         "Email": "david@davidwalker.org",

  26:         "Name": "David Walker"

  27:       },

  28:       "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",

  29:       "Href": "/app/rest/changes/id:21040",

  30:       "TeamCityId": 21040,

  31:       "Username": "david walker",

  32:       "Comment": "Do not save IIS Application Request Routing headers to metadata\n",

  33:       "Date": "2013-02-20T23:23:59.0000000",

  34:       "Files": [

  35:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"

  36:       ]

  37:     },

  38:     {

  39:       "Commiter": {

  40:         "Email": "ayende@ayende.com",

  41:         "Name": "Ayende Rahien"

  42:       },

  43:       "Version": "5919521286735f50f963824a12bf121cd1df4367",

  44:       "Href": "/app/rest/changes/id:21035",

  45:       "TeamCityId": 21035,

  46:       "Username": "ayende rahien",

  47:       "Comment": "Better disposal\n",

  48:       "Date": "2013-02-26T10:16:45.0000000",

  49:       "Files": [

  50:         "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"

  51:       ]

  52:     },

  53:     {

  54:       "Commiter": {

  55:         "Email": "ayende@ayende.com",

  56:         "Name": "Ayende Rahien"

  57:       },

  58:       "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",

  59:       "Href": "/app/rest/changes/id:21036",

  60:       "TeamCityId": 21036,

  61:       "Username": "ayende rahien",

  62:       "Comment": "Will ensure that the value is always positive or zero (never negative).\nWhen using numeric calc, will div by 1,024 to get more concentration into buckets.\n",

  63:       "Date": "2013-02-26T10:17:23.0000000",

  64:       "Files": [

  65:         "Raven.Database/Indexing/IndexingUtil.cs"

  66:       ]

  67:     },

  68:     {

  69:       "Commiter": {

  70:         "Email": "ayende@ayende.com",

  71:         "Name": "Ayende Rahien"

  72:       },

  73:       "Version": "7bf51345d39c3993fed5a82eacad6e74b9201601",

  74:       "Href": "/app/rest/changes/id:21037",

  75:       "TeamCityId": 21037,

  76:       "Username": "ayende rahien",

  77:       "Comment": "Fixing a bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",

  78:       "Date": "2013-02-26T10:53:01.0000000",

  79:       "Files": [

  80:         "Raven.Database/Indexing/MapReduceIndex.cs",

  81:         "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",

  82:         "Raven.Database/Storage/IMappedResultsStorageAction.cs",

  83:         "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",

  84:         "Raven.Tests/Issues/RavenDB_784.cs",

  85:         "Raven.Tests/Storage/MappedResults.cs",

  86:         "Raven.Tests/Views/ViewStorage.cs"

  87:       ]

  88:     },

  89:     {

  90:       "Commiter": {

  91:         "Email": "ayende@ayende.com",

  92:         "Name": "Ayende Rahien"

  93:       },

  94:       "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",

  95:       "Href": "/app/rest/changes/id:21038",

  96:       "TeamCityId": 21038,

  97:       "Username": "ayende rahien",

  98:       "Comment": "No need for so many repeats\n",

  99:       "Date": "2013-02-26T11:27:49.0000000",

 100:       "Files": [

 101:         "Raven.Tests/Bugs/MultiOutputReduce.cs"

 102:       ]

 103:     },

 104:     {

 105:       "Commiter": {

 106:         "Email": "ayende@ayende.com",

 107:         "Name": "Ayende Rahien"

 108:       },

 109:       "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",

 110:       "Href": "/app/rest/changes/id:21041",

 111:       "TeamCityId": 21041,

 112:       "Username": "ayende rahien",

 113:       "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",

 114:       "Date": "2013-02-26T11:41:39.0000000",

 115:       "Files": [

 116:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"

 117:       ]

 118:     }

 119:   ],

 120:   "ResolvedIssues": [],

 121:   "Contributors": [

 122:     {

 123:       "FullName": "Ayende Rahien",

 124:       "Email": "ayende@ayende.com",

 125:       "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"

 126:     },

 127:     {

 128:       "FullName": "David Walker",

 129:       "Email": "david@davidwalker.org",

 130:       "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"

 131:     }

 132:   ],

 133:   "BuildTypeId": "bt8",

 134:   "Href": "/app/rest/builds/id:588",

 135:   "ProjectName": "RavenDB",

 136:   "TeamCityId": 588,

 137:   "ProjectId": "project3",

 138:   "Number": 2509

 139: }

This document is 4.52KB in size. Running this through JSONH gives us the following:

   1: [

   2:     14,

   3:     "BuildName",

   4:     "IsUnstable",

   5:     "Version",

   6:     "PublishedAt",

   7:     "DownloadsIds",

   8:     "Changes",

   9:     "ResolvedIssues",

  10:     "Contributors",

  11:     "BuildTypeId",

  12:     "Href",

  13:     "ProjectName",

  14:     "TeamCityId",

  15:     "ProjectId",

  16:     "Number",

  17:     "RavenDB Unstable v2.5",

  18:     true,

  19:     "2509-Unstable",

  20:     "2013-02-26T12:06:12.0000000",

  21:     [

  22:     ],

  23:     [

  24:         {

  25:             "Commiter": {

  26:                 "Email": "david@davidwalker.org",

  27:                 "Name": "David Walker"

  28:             },

  29:             "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",

  30:             "Href": "/app/rest/changes/id:21039",

  31:             "TeamCityId": 21039,

  32:             "Username": "david walker",

  33:             "Comment": "Do not save Has-Api-Key header to metadata\n",

  34:             "Date": "2013-02-20T23:22:43.0000000",

  35:             "Files": [

  36:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"

  37:             ]

  38:         },

  39:         {

  40:             "Commiter": {

  41:                 "Email": "david@davidwalker.org",

  42:                 "Name": "David Walker"

  43:             },

  44:             "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",

  45:             "Href": "/app/rest/changes/id:21040",

  46:             "TeamCityId": 21040,

  47:             "Username": "david walker",

  48:             "Comment": "Do not save IIS Application Request Routing headers to metadata\n",

  49:             "Date": "2013-02-20T23:23:59.0000000",

  50:             "Files": [

  51:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"

  52:             ]

  53:         },

  54:         {

  55:             "Commiter": {

  56:                 "Email": "ayende@ayende.com",

  57:                 "Name": "Ayende Rahien"

  58:             },

  59:             "Version": "5919521286735f50f963824a12bf121cd1df4367",

  60:             "Href": "/app/rest/changes/id:21035",

  61:             "TeamCityId": 21035,

  62:             "Username": "ayende rahien",

  63:             "Comment": "Better disposal\n",

  64:             "Date": "2013-02-26T10:16:45.0000000",

  65:             "Files": [

  66:                 "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"

  67:             ]

  68:         },

  69:         {

  70:             "Commiter": {

  71:                 "Email": "ayende@ayende.com",

  72:                 "Name": "Ayende Rahien"

  73:             },

  74:             "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",

  75:             "Href": "/app/rest/changes/id:21036",

  76:             "TeamCityId": "...bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",

  77:             "Date": "2013-02-26T10:53:01.0000000",

  78:             "Files": [

  79:                 "Raven.Database/Indexing/MapReduceIndex.cs",

  80:                 "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",

  81:                 "Raven.Database/Storage/IMappedResultsStorageAction.cs",

  82:                 "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",

  83:                 "Raven.Tests/Issues/RavenDB_784.cs",

  84:                 "Raven.Tests/Storage/MappedResults.cs",

  85:                 "Raven.Tests/Views/ViewStorage.cs"

  86:             ]

  87:         },

  88:         {

  89:             "Commiter": {

  90:                 "Email": "ayende@ayende.com",

  91:                 "Name": "Ayende Rahien"

  92:             },

  93:             "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",

  94:             "Href": "/app/rest/changes/id:21038",

  95:             "TeamCityId": 21038,

  96:             "Username": "ayende rahien",

  97:             "Comment": "No need for so many repeats\n",

  98:             "Date": "2013-02-26T11:27:49.0000000",

  99:             "Files": [

 100:                 "Raven.Tests/Bugs/MultiOutputReduce.cs"

 101:             ]

 102:         },

 103:         {

 104:             "Commiter": {

 105:                 "Email": "ayende@ayende.com",

 106:                 "Name": "Ayende Rahien"

 107:             },

 108:             "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",

 109:             "Href": "/app/rest/changes/id:21041",

 110:             "TeamCityId": 21041,

 111:             "Username": "ayende rahien",

 112:             "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",

 113:             "Date": "2013-02-26T11:41:39.0000000",

 114:             "Files": [

 115:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"

 116:             ]

 117:         }

 118:     ],

 119:     [

 120:     ],

 121:     [

 122:         {

 123:             "FullName": "Ayende Rahien",

 124:             "Email": "ayende@ayende.com",

 125:             "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"

 126:         },

 127:         {

 128:             "FullName": "David Walker",

 129:             "Email": "david@davidwalker.org",

 130:             "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"

 131:         }

 132:     ],

 133:     "bt8",

 134:     "/app/rest/builds/id:588",

 135:     "RavenDB",

 136:     588,

 137:     "project3",

 138:     2509

 139: ]

It reduced the document size to 2.93KB! Awesome, nearly half of the size was gone. Except: This is actually generating utterly unreadable mess. I mean, can you look at this and figure out what the hell is going on.

I thought not. At this point, we might as well use a binary format. I happen to have a zip tool at my disposal, so I checked what would happen if I threw this through that. The end result was a file that was 1.42KB. And I had no more loss of readability than I have with the JSONH stuff.

To be frank, I just don’t get efforts like this. JSON is a text base human readable format. If you lose the human readable portion of the format, you might as well drop directly to binary. It is likely to be more efficient and you don’t lose anything by it.

And if you want to compress your data, it is probably better to use something like a compression tool. HTTP Compression, for example, is practically free, since all servers and clients should be able to consume it now. And any tool that you use should be able to inspect through it. And it is likely to generate much better results on your JSON documents than if you will try a clever format like this.

Tweet Share Share 15 comments

Tags:

Comments

17 Sep 2013
09:48 AM

Patrick Huizinga

Ayende, I can't find any savings between the two documents. Unless you count the fact that two changes were 'merged' somehow.

The 'never negative' change and 'decrement reduce stats bug' are 'merged', which accounts for the roughly 500 byte difference. Which is about 10% btw, not nearly half.

So this makes me conclude that either JSONH destructive and less than useless or you shouldn't create blog posts at 5:30 AM. And given the bad math, I'm afraid you're part of the problem...

Anyway, I think your conclusion is still correct. Choose between compression and readability. Trying to do both results in neither.

17 Sep 2013
09:50 AM

Patrick Huizinga

Oh, forgot to mention that when I count the characters I ended up on roughly 4500 and 4000. The only way to get the 3000 characters of JSONH is when you ignore unnecessary spaces, which is a 'compression' you should also get for free with a regular JSON document.

17 Sep 2013
10:20 AM

Andy Dent

That seems like somewhat of a strawman sample of XML.

For starters, you are comparing a strongly-typed set of data with loosely typed.

Then lets consider why that particular XML is so wordy - it's got data types on every item! Why? You can declare a schema up front and then the element data types will still be strongly typed.

I ran into the same issue with supposed compaction when dealing with geological data in XML. There were concerns about a format which recorded lab results for assay samples. I spent a day refining a schema to get a 600KB sample file down to about half its size. Then I compared a zip of the original with a zip of my reduced file - 39KB vs 36KB!

(Note that data which is mainly lots of different floating point numbers doesn't zip as well as some other plain text so this zip ratio is, if anything, on the high side.)

17 Sep 2013
11:38 AM

tobi

Dictionary encoders replace the redundancy anyway. They are probably slower.

There are custom XML compression algorithms (with competitions) as well. The best ones compress 10-20% better than ZIP by doing a preprocessing step basically.

17 Sep 2013
12:01 PM

Khalid Abuhakmeh

I know the horrors of XML, especially when you start getting into namespaces and trying to extract values using XPath. It is a complete nightmare in C# and just feels gross.

Have you looked at ServiceStack's JSV format. It is JSON with a few compressing techniques that ultimately help, but don't break human readability.

http://www.servicestack.net/docs/text-serializers/jsv-format http://www.servicestack.net/mythz_blog/?p=176

I myself would probably just stick with JSON, but I know these kind of optimizations might seem small, but ultimately help in the overall performance of an app.

17 Sep 2013
15:45 PM

Anders

It feels like the backlash against XML was due to the enterprise mess it had become in some instances. I find "simple" XML to be quite readable and built in tool support is better than for json on almost all platforms. Doc size is marginally larger than JSON and compression makes them even closer. Throwing out XML because of horrible SOAP formats is like throwing out the jvm because off struts. So the choice between json and XML I find arbitrary. If you care about size you choose neither, and if you can't make your doc readable in both XML and json you shouldn't have chosen a text format.

17 Sep 2013
16:44 PM

Nick

Google's Protocol Buffers came to mind... By the way, YAML is human readable too (4.2KB). Choosing between JSON and YAML is just a matter of applicability to the actual solution (where in many cases JSON wins).

17 Sep 2013
20:37 PM

Kelly Summerlin

I found a very specific use case for JSONH that worked very well. If your browser app is sending/receiving javascript that contains arrays of homogeneous JSON objects then JSONH compresses things very well. Compression gets better as the number of properties in the homogenous JSON object increases or the number of objects in the arrays gets really large. This is because JSONH takes the property names out of the arrays and moves them into a schema-like set of properties in an outer JSON object. Thus it removes the property name duplication in the arrays. The 14 on line 2 is the number of properties in the JSON array, followed by the 14 property names. It is true, I would never use JSONH for storage because it makes it much more difficult for a human to read.

In my specific use case, JSONH was 3 times faster than a client-side dictionary compression and compressed the JSON very well (70-74% in some cases). A big win when dealing with browser traffic. At a later date, this application will move to streaming the JSON in smaller increments and then the JSONH will very likely go away. JSONH has delayed of some of that pain for now.

So as an 'over the wire' format JSONH worked well here.

17 Sep 2013
21:50 PM

Ayende Rahien

Kelly, Now compare this to doing gzip compression, what would be the results?

18 Sep 2013
00:33 AM

Kelly Summerlin

Ayende, yes gzip compression solves the problem in the browser, but only for responses from the server back to the browser. There is no corresponding gzip option for GET/POST with a large request. I would absolutely love it if browsers recognized Content-Encoding: gzip and would automatically gzip requests. There are of course other options for large requests, plugins (sketchy at best), websockets (promising, maybe someday, still no compression though), write to file & and transfer (not exactly kosher or well supported). For us, JSONH was a nice option because it compressed our AJAX request payload well but was still valid JSON.

18 Sep 2013
01:06 AM

Kelly Summerlin

One more thing. I mentioned before that we tried several different javascript compression libraries. All of them work quite well in Nodejs scenario, but kinda fall flat in the browser. I think this has to do mainly with trying to emulate 8-byte binary read/writes in the browser. Certain browsers (uhm - IE9 and below) really struggle with the code and turn out to be be very slow. 5x slower in IE 8 & 9, 2.3x slower in IE10, FF20+ and Chrome 18+ do pretty well with client side compression, about ~1.3x slower than JSONH in the browser. Once you have a compressed 8-byte emulation buffer, what do you do with it though? Send it via AJAX as what encoding exactly? GZIP encoding was not recognized as valid GZIP by our server (Python Tornado in our case).

18 Sep 2013
06:46 AM

Ayende Rahien

You can actually compress the request contents. And the HTTP spec supports it. In fact, we routinely compress both requests & responses from RavenDB. But yes, it is not easy to do in the browser.

18 Sep 2013
06:47 AM

Ayende Rahien

Kelly, You need to specify Content-Encoding, and the server should be able to recognize it. If it doesn't, you can set it up to recognize it.

18 Sep 2013
17:35 PM

Kelly Summerlin

I tried Content-Encoding: gzip but the server still didn't accept it from the browser. We could send a gzipped file upload and the server would recognize. We also sent a gzipped request from another server app, so we know it was not the web server. I never had a chance to compare the browser generated request and the server app generated request in Fiddler. One day in my spare time (hah, did I really say that?) I need to go back and compare the two.

'But yes, it is not easy to do in the browser.' -- that was the kicker for this project. You can compress requests, but browser support for this feature is sadly lacking.

25 Sep 2013
17:18 PM

Anonymous Coward

I only have one thing to say: http://www.catb.org/esr/writings/taoup/html/ch05s01.html

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB