Negative feature responseAutomatic attachment compression in RavenDB
Following my previous post, which mentioned that you can save significantly on disk space if you store a plain text attachment using gzip, we go a feature request:
Perhaps in future attachments could have built-in compression as well?
The answer to that is no, but I thought that it is worth a post to explain why not.
Let’s consider the typical types of attachments that you’ll store in RavenDB. Based on experience, we usually see:
- PDF files
- Word / Excel / Power Point
- Images (JPEG, PNG, GIF, etc)
- Videoes
- Designs (floor plans, CAD / DWG, etc)
- Text files
Aside from the text files, pretty much all the data you’ll store as an attachment is already compressed. In fact, you’ll be hard pressed today to find any file format that does not already have built-in compression.
Compressing already compressed data is… suboptimal. I will not usually lead to significant space savings and can actually make the file size larger. It also burns CPU cycles unnecessarily.
It is better to shift the responsibility to the users in this case, since they have a lot more information about what they actually put into RavenDB and won’t have to guess.
More posts in "Negative feature response" series:
- (20 Dec 2021) Protect the user from accidental collection deletion
- (20 Oct 2021) Automatic attachment compression in RavenDB
Comments
These are fair points you made. I guess if somebody insisted it could be overcome by introducing
EligibleForCompression
property to the attachments infrastructure but I totally understand that attachments are general-purpose mechanism and doing special-cases is problematical and in this case may be not worthy.If someone really insisted on having collection-wide compression of texts I guess he could emulate attachments to some degree and store texts in normal documents within its own entity type e.g.
TextAttachment
or if he wanted to be more domain-specificEbookContent
.Regarding user-based compression - I was wondering how would one know whether the text in an attachment is compressed and if so how is it compressed. Two ideas came to my mind that make use of attachment's ContentType property:
The usage of Media Type's Structured Syntax Name Suffixes. There are 3 compression-related suffixes registered at the moment: +zip, +gzip, +zstd
(https://www.iana.org/assignments/media-type-structured-suffix)
Example:
text/plain+gzip;charset=utf-8
Usage of unregistered suffixes is not recommended "given the possibility of conflicts with future suffix definitions"
(https://www.rfc-editor.org/rfc/rfc6838.html#section-4.2.8)
The usage of own Media Type (https://en.wikipedia.org/wiki/Media_type#Registration_trees, https://www.rfc-editor.org/rfc/rfc6838.html#section-3.1)
I could even imagine myself creating an
IAttachmentsSessionOperations
extension methods calledStore{/Get}CompressedText
that would wrap a stream into {de}compressing stream and would construct{/parse} a ContentType string.Sorry, I forgot to clarify that hypothetical
EligibleForCompression
property would be set by user when storing an attachment.Milosz,
Yes, technically speaking you can re-use document compression in RavenDB to do cross text compression. Not something that I actually thought of, but would work.In general,
EligibleForCompression
is the same as just sending a gzip (or zstd, etc) values, no need to get anything inside RavenDB involved.Sure, it wouldn't be much helpful if it would just locally compress the attachment - what I meant is that by
EligibleForCompression
it would behave like a smart compression of values in documents that you presented in the very first approach in the previous post (also described here https://ravendb.net/articles/ravendb-5-0-features-smart-document-compression).But then again - I totally understand that gains here are probably negligible compared to feature implementation and ownership.
We just added 2 extension methods to IAttachmentsSessionOperations to StoreGzipped/TryGetGzipped for the (very) few cases where we do store some larger text files. It tries to gzip them, for larger results we keep the original content, and the TryGetGzipped first checks for the binary marker to determine if it was gzipped:
bool IsGzipCompressed(byte[] data) => data.Length > 1 && data[0] == 0x1F && data[1] == 0x8B;
This also helped us that we didn't have to gzip all the existing documents and we can just keep the same logic, when you really need the attachment directly from the database you can just download it, add .gz to the filename and use WinRAR or any other tool to decrompress them.
Steve,
Awesome that this is that easy to integrate.
Comment preview