The problem with compression & streaming

Jan 10 2010

The problem with compression & streaming

time to read 2 min | 299 words

I spent some time today trying to optimize the amount of data the profiler is sending on the wire. My first thought was that I could simply wrap the output stream with a compressing stream and use that, indeed, in my initial testing, it proved to be quite simple to do and reduced the amount of data being sent by a factor of 5. I played around a bit more and discovered that different compression implementation can bring me up to a factor of 50!

Unfortunately, I did all my initial testing on files, and while the profiler is able to read files just fine, it is most commonly used for live profiling, to see what is going on in the application right now. The problem here is that adding compression is a truly marvelous way to screw that up. Basically, I want to compress live data, and most compression libraries are not up for that task. It gets a bit more complex when you realize that what I actually wanted was a way to get compression to work on relatively small data chunks.

When you think how most compression algorithm works (there is a dictionary in there somewhere), you realize what the problem is. You need to keep updating the dictionary while you are compressing the stream, and at the same time, you need the dictionary to uncompress things. That make it… difficult to handle things. I thought about compressing small chunks (say, every 256Kb), but then I run into problems of figuring out when exactly I am supposed to be flushing them, how to handle partial messages, and more.

In the end, I decided that while it was a very interesting trial run, this is not something that is likely to show good ROI.

Tweet Share Share 10 comments

Tags:

Programming

Comments

10 Jan 2010
11:12 AM

Lior

Ayende,

Theres a whole branch of compression algorithms dealing with streams. While in theory they are not as efficient as a "file" based compression algorithms, they should be able to provide you with reasonable results.

The problems you describes are the exact challanges they are dealing with.

10 Jan 2010
11:16 AM

Ayende Rahien

Lior,

Yes, I am aware of that, the issue is just that I figured out that there isn't enough ROI for this

10 Jan 2010
11:28 AM

Giorgi

This is the best compression library I've ever seen: http://www.codeplex.com/DotNetZip

It supports "creating zip files from stream content, saving to a stream, extracting to a stream, reading from a stream"

10 Jan 2010
12:49 PM

Ayende Rahien

Giorgi,

There is a BIG difference between a stream (an IO abstraction) and streaming

10 Jan 2010
14:14 PM

Eugene

Giorgi,

The library you recommend is helpful, but it has serious flaws. Firstly, it ain't threadsafe. Secondly, its performance becomes awful when the number of entries in the archive has more than two digits.

10 Jan 2010
15:32 PM

Eric Hauser

Oren,

ROI notwithstanding, couldn't you cheat by pre-populating the dictionary with common strings from known framework log messages and, at runtime, table metadata?

10 Jan 2010
21:53 PM

Jeff Brown

One observation is that you don't actually need live realtime streaming. You're fine as long as blocks of messages arrive frequently enough to convince the user that it's realtime.

To that end, just flush the stream at message boundaries every 50-100ms or so. So for example after writing a message, check whether there was pending data and it's been X time since the last flush, if so, do a flush and reset the timestamp. Make sure to flush at the end of the message stream too of course.

You can "sync flush" as often as you like. A sync flush doesn't empty the dictionary. It's a bit like a checkpointing operation and is perfect for streaming. Pretty sure SharpZipLib supports this behaviour.

10 Jan 2010
23:25 PM

dan

i was also going to suggest pre-populating a dictionary based on some large corpus of typical data.

11 Jan 2010
08:54 AM

Ayende Rahien

Eric,

Oh, I can do that, sure. But when it became hard I decided that it doesn't make sense to devote that much effort to this use case.

It was more exploratory in the nature on seeing if I can get good perf benefit out of a potential low hanging fruit.

11 Jan 2010
21:25 PM

Aaron

Waste of space, this blog post. I tried to zip up a stream, it didn't work, fail. If you had done some thinking before you started, you could have improved on your ROI

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB