JSON Packing, Text Based Formats and other stuff that come to mind at 5 AM
This post was written at 5:30AM, I run into this while doing research for another post, and I couldn’t really let it go.
XML as a text base format is really wasteful in space. But that wasn’t what really made it lose its shine. That was when it became so complex that it stopped being human readable. For example, I give you:
1: <?xml version="1.0" encoding="UTF-8" ?>2: <SOAP-ENV:Envelope3: xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"4: xmlns:xsd="http://www.w3.org/1999/XMLSchema"5: xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">6: <SOAP-ENV:Body>7: <ns1:getEmployeeDetailsResponse8: xmlns:ns1="urn:MySoapServices"9: SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">10: <return xsi:type="ns1:EmployeeContactDetail">11: <employeeName xsi:type="xsd:string">Bill Posters</employeeName>12: <phoneNumber xsi:type="xsd:string">+1-212-7370194</phoneNumber>13: <tempPhoneNumber14: xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"15: xsi:type="ns2:Array"16: ns2:arrayType="ns1:TemporaryPhoneNumber[3]">17: <item xsi:type="ns1:TemporaryPhoneNumber">18: <startDate xsi:type="xsd:int">37060</startDate>19: <endDate xsi:type="xsd:int">37064</endDate>20: <phoneNumber xsi:type="xsd:string">+1-515-2887505</phoneNumber>21: </item>22: <item xsi:type="ns1:TemporaryPhoneNumber">23: <startDate xsi:type="xsd:int">37074</startDate>24: <endDate xsi:type="xsd:int">37078</endDate>25: <phoneNumber xsi:type="xsd:string">+1-516-2890033</phoneNumber>26: </item>27: <item xsi:type="ns1:TemporaryPhoneNumber">28: <startDate xsi:type="xsd:int">37088</startDate>29: <endDate xsi:type="xsd:int">37092</endDate>30: <phoneNumber xsi:type="xsd:string">+1-212-7376609</phoneNumber>31: </item>32: </tempPhoneNumber>33: </return>34: </ns1:getEmployeeDetailsResponse>35: </SOAP-ENV:Body>36: /SOAP-ENV:Envelope>
After XML was thrown out of the company of respectable folks, we had JSON show up and entertain us. It is smaller and more concise than XML, and so far has resisted the efforts to make it into some sort of a uber complex enterprisiey tool.
But today I run into quite a few effort to do strange things to JSON. I am talking about things like JSON DB (a compressed json format, not actual json database), JSONH, json.hpack, and friends. All of those attempt to reduce the size of JSON documents.
Let us take an example. the following is a JSON document representing one of RavenDB builds:
1: {
2: "BuildName": "RavenDB Unstable v2.5",3: "IsUnstable": true,4: "Version": "2509-Unstable",5: "PublishedAt": "2013-02-26T12:06:12.0000000",6: "DownloadsIds": [],7: "Changes": [8: {
9: "Commiter": {10: "Email": "david@davidwalker.org",11: "Name": "David Walker"12: },
13: "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",14: "Href": "/app/rest/changes/id:21039",15: "TeamCityId": 21039,16: "Username": "david walker",17: "Comment": "Do not save Has-Api-Key header to metadata\n",18: "Date": "2013-02-20T23:22:43.0000000",19: "Files": [20: "Raven.Abstractions/Extensions/MetadataExtensions.cs"21: ]
22: },
23: {
24: "Commiter": {25: "Email": "david@davidwalker.org",26: "Name": "David Walker"27: },
28: "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",29: "Href": "/app/rest/changes/id:21040",30: "TeamCityId": 21040,31: "Username": "david walker",32: "Comment": "Do not save IIS Application Request Routing headers to metadata\n",33: "Date": "2013-02-20T23:23:59.0000000",34: "Files": [35: "Raven.Abstractions/Extensions/MetadataExtensions.cs"36: ]
37: },
38: {
39: "Commiter": {40: "Email": "ayende@ayende.com",41: "Name": "Ayende Rahien"42: },
43: "Version": "5919521286735f50f963824a12bf121cd1df4367",44: "Href": "/app/rest/changes/id:21035",45: "TeamCityId": 21035,46: "Username": "ayende rahien",47: "Comment": "Better disposal\n",48: "Date": "2013-02-26T10:16:45.0000000",49: "Files": [50: "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"51: ]
52: },
53: {
54: "Commiter": {55: "Email": "ayende@ayende.com",56: "Name": "Ayende Rahien"57: },
58: "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",59: "Href": "/app/rest/changes/id:21036",60: "TeamCityId": 21036,61: "Username": "ayende rahien",62: "Comment": "Will ensure that the value is always positive or zero (never negative).\nWhen using numeric calc, will div by 1,024 to get more concentration into buckets.\n",63: "Date": "2013-02-26T10:17:23.0000000",64: "Files": [65: "Raven.Database/Indexing/IndexingUtil.cs"66: ]
67: },
68: {
69: "Commiter": {70: "Email": "ayende@ayende.com",71: "Name": "Ayende Rahien"72: },
73: "Version": "7bf51345d39c3993fed5a82eacad6e74b9201601",74: "Href": "/app/rest/changes/id:21037",75: "TeamCityId": 21037,76: "Username": "ayende rahien",77: "Comment": "Fixing a bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",78: "Date": "2013-02-26T10:53:01.0000000",79: "Files": [80: "Raven.Database/Indexing/MapReduceIndex.cs",81: "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",82: "Raven.Database/Storage/IMappedResultsStorageAction.cs",83: "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",84: "Raven.Tests/Issues/RavenDB_784.cs",85: "Raven.Tests/Storage/MappedResults.cs",86: "Raven.Tests/Views/ViewStorage.cs"87: ]
88: },
89: {
90: "Commiter": {91: "Email": "ayende@ayende.com",92: "Name": "Ayende Rahien"93: },
94: "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",95: "Href": "/app/rest/changes/id:21038",96: "TeamCityId": 21038,97: "Username": "ayende rahien",98: "Comment": "No need for so many repeats\n",99: "Date": "2013-02-26T11:27:49.0000000",100: "Files": [101: "Raven.Tests/Bugs/MultiOutputReduce.cs"102: ]
103: },
104: {
105: "Commiter": {106: "Email": "ayende@ayende.com",107: "Name": "Ayende Rahien"108: },
109: "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",110: "Href": "/app/rest/changes/id:21041",111: "TeamCityId": 21041,112: "Username": "ayende rahien",113: "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",114: "Date": "2013-02-26T11:41:39.0000000",115: "Files": [116: "Raven.Abstractions/Extensions/MetadataExtensions.cs"117: ]
118: }
119: ],
120: "ResolvedIssues": [],121: "Contributors": [122: {
123: "FullName": "Ayende Rahien",124: "Email": "ayende@ayende.com",125: "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"126: },
127: {
128: "FullName": "David Walker",129: "Email": "david@davidwalker.org",130: "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"131: }
132: ],
133: "BuildTypeId": "bt8",134: "Href": "/app/rest/builds/id:588",135: "ProjectName": "RavenDB",136: "TeamCityId": 588,137: "ProjectId": "project3",138: "Number": 2509139: }
This document is 4.52KB in size. Running this through JSONH gives us the following:
1: [
2: 14,
3: "BuildName",
4: "IsUnstable",
5: "Version",
6: "PublishedAt",
7: "DownloadsIds",
8: "Changes",
9: "ResolvedIssues",
10: "Contributors",
11: "BuildTypeId",
12: "Href",
13: "ProjectName",
14: "TeamCityId",
15: "ProjectId",
16: "Number",
17: "RavenDB Unstable v2.5",
18: true,
19: "2509-Unstable",
20: "2013-02-26T12:06:12.0000000",
21: [
22: ],
23: [
24: {
25: "Commiter": {
26: "Email": "david@davidwalker.org",
27: "Name": "David Walker"
28: },
29: "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",
30: "Href": "/app/rest/changes/id:21039",
31: "TeamCityId": 21039,
32: "Username": "david walker",
33: "Comment": "Do not save Has-Api-Key header to metadata\n",
34: "Date": "2013-02-20T23:22:43.0000000",
35: "Files": [
36: "Raven.Abstractions/Extensions/MetadataExtensions.cs"
37: ]
38: },
39: {
40: "Commiter": {
41: "Email": "david@davidwalker.org",
42: "Name": "David Walker"
43: },
44: "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",
45: "Href": "/app/rest/changes/id:21040",
46: "TeamCityId": 21040,
47: "Username": "david walker",
48: "Comment": "Do not save IIS Application Request Routing headers to metadata\n",
49: "Date": "2013-02-20T23:23:59.0000000",
50: "Files": [
51: "Raven.Abstractions/Extensions/MetadataExtensions.cs"
52: ]
53: },
54: {
55: "Commiter": {
56: "Email": "ayende@ayende.com",
57: "Name": "Ayende Rahien"
58: },
59: "Version": "5919521286735f50f963824a12bf121cd1df4367",
60: "Href": "/app/rest/changes/id:21035",
61: "TeamCityId": 21035,
62: "Username": "ayende rahien",
63: "Comment": "Better disposal\n",
64: "Date": "2013-02-26T10:16:45.0000000",
65: "Files": [
66: "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"
67: ]
68: },
69: {
70: "Commiter": {
71: "Email": "ayende@ayende.com",
72: "Name": "Ayende Rahien"
73: },
74: "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",
75: "Href": "/app/rest/changes/id:21036",
76: "TeamCityId": "...bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",
77: "Date": "2013-02-26T10:53:01.0000000",
78: "Files": [
79: "Raven.Database/Indexing/MapReduceIndex.cs",
80: "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",
81: "Raven.Database/Storage/IMappedResultsStorageAction.cs",
82: "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",
83: "Raven.Tests/Issues/RavenDB_784.cs",
84: "Raven.Tests/Storage/MappedResults.cs",
85: "Raven.Tests/Views/ViewStorage.cs"
86: ]
87: },
88: {
89: "Commiter": {
90: "Email": "ayende@ayende.com",
91: "Name": "Ayende Rahien"
92: },
93: "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",
94: "Href": "/app/rest/changes/id:21038",
95: "TeamCityId": 21038,
96: "Username": "ayende rahien",
97: "Comment": "No need for so many repeats\n",
98: "Date": "2013-02-26T11:27:49.0000000",
99: "Files": [
100: "Raven.Tests/Bugs/MultiOutputReduce.cs"
101: ]
102: },
103: {
104: "Commiter": {
105: "Email": "ayende@ayende.com",
106: "Name": "Ayende Rahien"
107: },
108: "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",
109: "Href": "/app/rest/changes/id:21041",
110: "TeamCityId": 21041,
111: "Username": "ayende rahien",
112: "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",
113: "Date": "2013-02-26T11:41:39.0000000",
114: "Files": [
115: "Raven.Abstractions/Extensions/MetadataExtensions.cs"
116: ]
117: }
118: ],
119: [
120: ],
121: [
122: {
123: "FullName": "Ayende Rahien",
124: "Email": "ayende@ayende.com",
125: "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"
126: },
127: {
128: "FullName": "David Walker",
129: "Email": "david@davidwalker.org",
130: "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"
131: }
132: ],
133: "bt8",
134: "/app/rest/builds/id:588",
135: "RavenDB",
136: 588,
137: "project3",
138: 2509
139: ]
It reduced the document size to 2.93KB! Awesome, nearly half of the size was gone. Except: This is actually generating utterly unreadable mess. I mean, can you look at this and figure out what the hell is going on.
I thought not. At this point, we might as well use a binary format. I happen to have a zip tool at my disposal, so I checked what would happen if I threw this through that. The end result was a file that was 1.42KB. And I had no more loss of readability than I have with the JSONH stuff.
To be frank, I just don’t get efforts like this. JSON is a text base human readable format. If you lose the human readable portion of the format, you might as well drop directly to binary. It is likely to be more efficient and you don’t lose anything by it.
And if you want to compress your data, it is probably better to use something like a compression tool. HTTP Compression, for example, is practically free, since all servers and clients should be able to consume it now. And any tool that you use should be able to inspect through it. And it is likely to generate much better results on your JSON documents than if you will try a clever format like this.
Comments
Ayende, I can't find any savings between the two documents. Unless you count the fact that two changes were 'merged' somehow.
The 'never negative' change and 'decrement reduce stats bug' are 'merged', which accounts for the roughly 500 byte difference. Which is about 10% btw, not nearly half.
So this makes me conclude that either JSONH destructive and less than useless or you shouldn't create blog posts at 5:30 AM. And given the bad math, I'm afraid you're part of the problem...
Anyway, I think your conclusion is still correct. Choose between compression and readability. Trying to do both results in neither.
Oh, forgot to mention that when I count the characters I ended up on roughly 4500 and 4000. The only way to get the 3000 characters of JSONH is when you ignore unnecessary spaces, which is a 'compression' you should also get for free with a regular JSON document.
That seems like somewhat of a strawman sample of XML.
For starters, you are comparing a strongly-typed set of data with loosely typed.
Then lets consider why that particular XML is so wordy - it's got data types on every item! Why? You can declare a schema up front and then the element data types will still be strongly typed.
I ran into the same issue with supposed compaction when dealing with geological data in XML. There were concerns about a format which recorded lab results for assay samples. I spent a day refining a schema to get a 600KB sample file down to about half its size. Then I compared a zip of the original with a zip of my reduced file - 39KB vs 36KB!
(Note that data which is mainly lots of different floating point numbers doesn't zip as well as some other plain text so this zip ratio is, if anything, on the high side.)
Dictionary encoders replace the redundancy anyway. They are probably slower.
There are custom XML compression algorithms (with competitions) as well. The best ones compress 10-20% better than ZIP by doing a preprocessing step basically.
I know the horrors of XML, especially when you start getting into namespaces and trying to extract values using XPath. It is a complete nightmare in C# and just feels gross.
Have you looked at ServiceStack's JSV format. It is JSON with a few compressing techniques that ultimately help, but don't break human readability.
http://www.servicestack.net/docs/text-serializers/jsv-format http://www.servicestack.net/mythz_blog/?p=176
I myself would probably just stick with JSON, but I know these kind of optimizations might seem small, but ultimately help in the overall performance of an app.
It feels like the backlash against XML was due to the enterprise mess it had become in some instances. I find "simple" XML to be quite readable and built in tool support is better than for json on almost all platforms. Doc size is marginally larger than JSON and compression makes them even closer. Throwing out XML because of horrible SOAP formats is like throwing out the jvm because off struts. So the choice between json and XML I find arbitrary. If you care about size you choose neither, and if you can't make your doc readable in both XML and json you shouldn't have chosen a text format.
Google's Protocol Buffers came to mind... By the way, YAML is human readable too (4.2KB). Choosing between JSON and YAML is just a matter of applicability to the actual solution (where in many cases JSON wins).
I found a very specific use case for JSONH that worked very well. If your browser app is sending/receiving javascript that contains arrays of homogeneous JSON objects then JSONH compresses things very well. Compression gets better as the number of properties in the homogenous JSON object increases or the number of objects in the arrays gets really large. This is because JSONH takes the property names out of the arrays and moves them into a schema-like set of properties in an outer JSON object. Thus it removes the property name duplication in the arrays. The 14 on line 2 is the number of properties in the JSON array, followed by the 14 property names. It is true, I would never use JSONH for storage because it makes it much more difficult for a human to read.
In my specific use case, JSONH was 3 times faster than a client-side dictionary compression and compressed the JSON very well (70-74% in some cases). A big win when dealing with browser traffic. At a later date, this application will move to streaming the JSON in smaller increments and then the JSONH will very likely go away. JSONH has delayed of some of that pain for now.
So as an 'over the wire' format JSONH worked well here.
Kelly, Now compare this to doing gzip compression, what would be the results?
Ayende, yes gzip compression solves the problem in the browser, but only for responses from the server back to the browser. There is no corresponding gzip option for GET/POST with a large request. I would absolutely love it if browsers recognized Content-Encoding: gzip and would automatically gzip requests. There are of course other options for large requests, plugins (sketchy at best), websockets (promising, maybe someday, still no compression though), write to file & and transfer (not exactly kosher or well supported). For us, JSONH was a nice option because it compressed our AJAX request payload well but was still valid JSON.
One more thing. I mentioned before that we tried several different javascript compression libraries. All of them work quite well in Nodejs scenario, but kinda fall flat in the browser. I think this has to do mainly with trying to emulate 8-byte binary read/writes in the browser. Certain browsers (uhm - IE9 and below) really struggle with the code and turn out to be be very slow. 5x slower in IE 8 & 9, 2.3x slower in IE10, FF20+ and Chrome 18+ do pretty well with client side compression, about ~1.3x slower than JSONH in the browser. Once you have a compressed 8-byte emulation buffer, what do you do with it though? Send it via AJAX as what encoding exactly? GZIP encoding was not recognized as valid GZIP by our server (Python Tornado in our case).
You can actually compress the request contents. And the HTTP spec supports it. In fact, we routinely compress both requests & responses from RavenDB. But yes, it is not easy to do in the browser.
Kelly, You need to specify Content-Encoding, and the server should be able to recognize it. If it doesn't, you can set it up to recognize it.
I tried Content-Encoding: gzip but the server still didn't accept it from the browser. We could send a gzipped file upload and the server would recognize. We also sent a gzipped request from another server app, so we know it was not the web server. I never had a chance to compare the browser generated request and the server app generated request in Fiddler. One day in my spare time (hah, did I really say that?) I need to go back and compare the two.
'But yes, it is not easy to do in the browser.' -- that was the kicker for this project. You can compress requests, but browser support for this feature is sadly lacking.
I only have one thing to say: http://www.catb.org/esr/writings/taoup/html/ch05s01.html
Comment preview