Decompression code & Discussion

Jun 23 2014

Decompression code & Discussion

time to read 4 min | 763 words

As I said, a good & small interview question is this one, it is a good one because it is short, relatively simple to handle, but it should show a lot of things about your code. To start with, being faced with a non trivial task that most people are not that familiar with.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:
Prefix byte – composed of the highest bit + 7 bits len
If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.
If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

I couldn’t resist doing this myself, and I came up with the following:

public void Decompress(Stream input, Stream output, byte[][] dictionary)
{
    var tmp = new byte[128];
    while (true)
    {
        var readByte = input.ReadByte();
        if (readByte == -1)
            break;
            
        var prefix = (byte) readByte;
        var compressed = (prefix & 0x80) != 0;
        var len = prefix & 0x7f;

        if (compressed == false)
        {
            while (len > 0)
            {
                var read = input.Read(tmp, 0, len);
                if(read == 0)
                    throw new InvalidDataException("Not enough data to read from compressed input stream");
                len -= read;
                output.Write(tmp, 0, read);
            }
        }
        else
        {
            readByte = input.ReadByte();
            if(readByte == -1)
                throw new InvalidDataException("Not enough data to read from compressed input stream");
            output.Write(dictionary[readByte], 0, len);
        }
    }
}

Things to pay attention to: Low memory allocations, error handling, and handling of partial reads from the stream.

But that is just part of the question. After reading the protocol, and implementing it. The question now turns to what does the protocol says about this kind of compression scheme. The use of just 7 bits to store len drastically limit the compression utility in a general format. It also requires an external dictionary, which most compression formats don’t use, they use the actual compressed text itself as the dictionary. Of course, I’ve been reading compression algorithms for a while now, so that isn’t that fair. But I would expect people to note that that 7 bit limits the compression usability.

And hopefully, with a bit of a hint, they should note that the external dictionary is useful for small data sets where the repetitions are actually between entry, not per entry.

Tweet Share Share 5 comments

Tags:

Comments

23 Jun 2014
14:19 PM

Joe Buschmann

My solution is close to yours. One difference is the while (len > 0) loop when compressed is false which my solution was missing. Is this because Stream.Read() may return fewer bytes than requested?

    public static void Decompress(Stream input, Stream output, byte[][] dictionary)
    {
        const int prefixMask = 128;
        int prefix = input.ReadByte();

        while (prefix != -1)
        {
            bool isCompressed = (prefix & prefixMask) == prefixMask;

            if (isCompressed)
            {
                int len = prefix ^ prefixMask;
                int index = input.ReadByte();
                byte[] buffer = dictionary[index];
                output.Write(buffer, 0, len);
            }
            else
            {
                int len = prefix;
                byte[] buffer = new byte[len];
                input.Read(buffer, 0, len);
                output.Write(buffer, 0, len);
            }

            prefix = input.ReadByte();
        }
    }

23 Jun 2014
15:38 PM

Aaron

I wrote a tiny solution just for fun, but I wasn't aware that streams could read fewer than the number of bytes requested outside of the end of the stream.

[code] static void Decompress(Stream i, Stream o, byte[][] d) { for (dynamic b = i.ReadByte(), u = new byte[127]; b > 0; b = i.ReadByte()) if (b < 128) o.Write(u, 0, i.Read(u, 0, b)); else o.Write(d[i.ReadByte()], 0, b & 127); } [/code]

23 Jun 2014
16:09 PM

Ayende Rahien

Joseph, Yes, Read() may return less bytes than you wanted, for any number of reasons.

23 Jun 2014
16:10 PM

Ayende Rahien

Aaron, That code is _ugly_. And why on earth would you use dynamic here?

23 Jun 2014
16:35 PM

Aaron

Ayende: yes it is extremely ugly. As I said, I wrote a tiny answer just for fun. Using dynamic was fewer characters than two declarations outside the for loop. (There's probably a much smaller solution, but I'm not a code golf enthusiast.)

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB