The case for size limits vs. number limits vs. time limits
This post was written because of this tweet, asking whatever RavenDB has something similar to MongoDB’s capped collections. RavenDB doesn’t have this feature, by design. It is something that would be trivial to add and would likely lead to horrible consequences down the line.
The basic idea with a capped collection is that you set a limit on the size of the collection, and for any write beyond that limit, you delete the oldest document in the collection until to maintain that limit. You can also specify a maximum number of elements, but you must also always specify the size (in bytes). Redis has a similar feature, with capped lists, but that feature only count the number of items, not their size. Both MongoDB and Redis has the notion of expiring data (which is, incidentally, how RavenDB would handle a similar case).
I can follow the reasoning behind having a limit based on the number of items, even if I think that in most cases that is going to be an issue. But using the overall size of the data for the limit is a recipe for disaster. You need to plan ahead for the capacity you’ll get, and in that case, the actual size of the data isn’t really meaningful (assuming that you aren’t going to run out of disk space). In fact, even a small change in the access patterns to your system can cause you to lose documents because they have reached the limit in the capped collection.
For example, if you use a capped collection to process events (which seems like a fairly common scenario), you’ll set the maximum size of the collection as a factor of how quickly you can process all the events in the collection plus some padding. But if you event processing is running a bit slow (or even just plain crashed), you are racing against the clock to keep up with the incoming events before they will be removed by the size limit.
A change in the usage pattern can also be that users are sending more data, instead of sending 140 chars, they are now sending 280 chars, to keep the example focused on Twitter. That can really mess up any calculation you made based on expected sizing.
In practice, you will usually set the maximum number of elements and set the maximum size of the collection as something that is very high, hoping to never hit that limit.
I mentioned that I can follow the logic behind having a collection that is capped to a certain number of items. I can see follow the logic, but I disagree with it. In most scenarios when you need this feature, you are looking at stuff like “I would like to retain the last 100 transactions” or stuff like that. The problem is that in the business world, you almost never think in this way, you don’t think in counts, you think in time. What you’ll likely be asked is: “I would like to retain last month transactions”. And if you have less than a 100 transactions a month, these two options end up being very close to one another.
The way we handle it with RavenDB is through document expiration. You can specify a certain point in time when a document will be automatically deleted and RavenDB will take care of that for you. However, we also have a way to globally disable this feature when you need to extend the expiration for some reason. Working on top of time means that the feature is more closely aligned with the actual business requirements and that you don’t throw away perfectly good data too soon.
Comments
I also can follow the size reasoning and I have seen it many times. But all the cases I have seen the size cap was a derivative of a time constraint. In other words business has said - "we just need the last month of data". Some analysis in between turned that around into "how many elements do we have a month? No more than 100". So developer heard "we need to keep up to 100 elements".
And in truth there is often going to be some paging implemented on the receiving side, so no harm using real "1 month" requirement with additional requirement of not displaying more than 100 per page.
So I'm with you on that.
Using logging as real world example. MSFT application insights only retain last 90 days results. They do have daily data cap, that can easily achieved by incremental count. After reach the data cap, application insights won't take any more request. They won't delete any old data to allow save new data. If data cap is an issue that's cap too low, no data should be automatically deleted, unless it is timed event.
One of the application I maintain setup local logging file with size limit, when major issue happening, I lost a lot of information. Log content only contains the period with exception, not the data before exception happens.
For audit or change data capture, in business world most of time is never delete. In other word, they are time based. Number or size based will always run into trouble where massive change within a day will result not traceable. The use case for audit and change data capture is to indicate what happen in the past, it is not only for diagnose, but also for evidence. Will Lawyer's evidence pile auto delete on time based? Will Police's unsolved murder case be time based deletion? Never.
I don't think storage now a day is an issue anymore, even with limitation, they can still use automation to move aged data to low cost blob storage or file storage.
I'm with you on that too.
looks like a data-mining vs transaction storage use-case. When data-mining one doesn't care about trimming off some stuff. Mongodb seems to cater to that more than the legacy sql platforms.
Ha, just a friendly comment: you might want to change "this twit" to "this tweet". It can be easily interpreted to look like you're calling @evntdrvn a twit.
I might have opinions on fixed size arrays but fixed size is a feature, not a bug :)
Matt, Thanks, I fixed the post.
Comment preview