The case for size limits vs. number limits vs. time limits

time to read 4 min | 625 words

imageThis post was written because of this tweet, asking whatever RavenDB has something similar to MongoDB’s capped collections. RavenDB doesn’t have this feature, by design. It is something that would be trivial to add and would likely lead to horrible consequences down the line.

The basic idea with a capped collection is that you set a limit on the size of the collection, and for any write beyond that limit, you delete the oldest document in the collection until to maintain that limit. You can also specify a maximum number of elements, but you must also always specify the size (in bytes). Redis has a similar feature, with capped lists, but that feature only count the number of items, not their size.  Both MongoDB and Redis has the notion of expiring data (which is, incidentally, how RavenDB would handle a similar case).

I can follow the reasoning behind having a limit based on the number of items, even if I think that in most cases that is going to be an issue. But using the overall size of the data for the limit is a recipe for disaster. You need to plan ahead for the capacity you’ll get, and in that case, the actual size of the data isn’t really meaningful (assuming that you aren’t going to run out of disk space). In fact, even a small change in the access patterns to your system can cause you to lose documents because they have reached the limit in the capped collection.

For example, if you use a capped collection to process events (which seems like a fairly common scenario), you’ll set the maximum size of the collection as a factor of how quickly you can process all the events in the collection plus some padding. But if you event processing is running a bit slow (or even just plain crashed), you are racing against the clock to keep up with the incoming events before they will be removed by the size limit.

A change in the usage pattern can also be that users are sending more data, instead of sending 140 chars, they are now sending 280 chars, to keep the example focused on Twitter. That can really mess up any calculation you made based on expected sizing. 

In practice, you will usually set the maximum number of elements and set the maximum size of the collection as something that is very high, hoping to never hit that limit.

I mentioned that I can follow the logic behind having a collection that is capped to a certain number of items. I can see follow the logic, but I disagree with it. In most scenarios when you need this feature, you are looking at stuff like “I would like to retain the last 100 transactions” or stuff like that. The problem is that in the business world, you almost never think in this way, you don’t think in counts, you think in time. What you’ll likely be asked is: “I would like to retain last month transactions”. And if you have less than a 100 transactions a month, these two options end up being very close to one another.

The way we handle it with RavenDB is through document expiration. You can specify a certain point in time when a document will be automatically deleted and RavenDB will take care of that for you. However, we also have a way to globally disable this feature when you need to extend the expiration for some reason. Working on top of time means that the feature is more closely aligned with the actual business requirements and that you don’t throw away perfectly good data too soon.