The problem with compression & streaming

time to read 2 min | 299 words

I spent some time today trying to optimize the amount of data the profiler is sending on the wire. My first thought was that I could simply wrap the output stream with a compressing stream and use that, indeed, in my initial testing, it proved to be quite simple to do and reduced the amount of data being sent by a factor of 5. I played around a bit more and discovered that different compression implementation can bring me up to a factor of 50!

Unfortunately, I did all my initial testing on files, and while the profiler is able to read files just fine, it is most commonly used for live profiling, to see what is going on in the application right now. The problem here is that adding compression is a truly marvelous way to screw that up. Basically, I want to compress live data, and most compression libraries are not up for that task. It gets a bit more complex when you realize that what I actually wanted was a way to get compression to work on relatively small data chunks.

When you think how most compression algorithm works (there is a dictionary in there somewhere), you realize what the problem is. You need to keep updating the dictionary while you are compressing the stream, and at the same time, you need the dictionary to uncompress things. That make it… difficult to handle things. I thought about compressing small chunks (say, every 256Kb), but then I run into problems of figuring out when exactly I am supposed to be flushing them, how to handle partial messages, and more.

In the end, I decided that while it was a very interesting trial run, this is not something that is likely to show good ROI.