The problem with compression & streaming
I spent some time today trying to optimize the amount of data the profiler is sending on the wire. My first thought was that I could simply wrap the output stream with a compressing stream and use that, indeed, in my initial testing, it proved to be quite simple to do and reduced the amount of data being sent by a factor of 5. I played around a bit more and discovered that different compression implementation can bring me up to a factor of 50!
Unfortunately, I did all my initial testing on files, and while the profiler is able to read files just fine, it is most commonly used for live profiling, to see what is going on in the application right now. The problem here is that adding compression is a truly marvelous way to screw that up. Basically, I want to compress live data, and most compression libraries are not up for that task. It gets a bit more complex when you realize that what I actually wanted was a way to get compression to work on relatively small data chunks.
When you think how most compression algorithm works (there is a dictionary in there somewhere), you realize what the problem is. You need to keep updating the dictionary while you are compressing the stream, and at the same time, you need the dictionary to uncompress things. That make it… difficult to handle things. I thought about compressing small chunks (say, every 256Kb), but then I run into problems of figuring out when exactly I am supposed to be flushing them, how to handle partial messages, and more.
In the end, I decided that while it was a very interesting trial run, this is not something that is likely to show good ROI.
Comments
Ayende,
Theres a whole branch of compression algorithms dealing with streams. While in theory they are not as efficient as a "file" based compression algorithms, they should be able to provide you with reasonable results.
The problems you describes are the exact challanges they are dealing with.
Lior,
Yes, I am aware of that, the issue is just that I figured out that there isn't enough ROI for this
This is the best compression library I've ever seen: http://www.codeplex.com/DotNetZip
It supports "creating zip files from stream content, saving to a stream, extracting to a stream, reading from a stream"
Giorgi,
There is a BIG difference between a stream (an IO abstraction) and streaming
Giorgi,
The library you recommend is helpful, but it has serious flaws. Firstly, it ain't threadsafe. Secondly, its performance becomes awful when the number of entries in the archive has more than two digits.
Oren,
ROI notwithstanding, couldn't you cheat by pre-populating the dictionary with common strings from known framework log messages and, at runtime, table metadata?
One observation is that you don't actually need live realtime streaming. You're fine as long as blocks of messages arrive frequently enough to convince the user that it's realtime.
To that end, just flush the stream at message boundaries every 50-100ms or so. So for example after writing a message, check whether there was pending data and it's been X time since the last flush, if so, do a flush and reset the timestamp. Make sure to flush at the end of the message stream too of course.
You can "sync flush" as often as you like. A sync flush doesn't empty the dictionary. It's a bit like a checkpointing operation and is perfect for streaming. Pretty sure SharpZipLib supports this behaviour.
i was also going to suggest pre-populating a dictionary based on some large corpus of typical data.
Eric,
Oh, I can do that, sure. But when it became hard I decided that it doesn't make sense to devote that much effort to this use case.
It was more exploratory in the nature on seeing if I can get good perf benefit out of a potential low hanging fruit.
Waste of space, this blog post. I tried to zip up a stream, it didn't work, fail. If you had done some thinking before you started, you could have improved on your ROI
Comment preview