Indexing only recent data - adventures with large datasets & archiving
We recently got a support request from a user in which they had the following issue:
We have an index that is using way too much disk space. We don’t need to search the entire dataset, just the most recent documents. Can we do something like this?
from d in docs.Events
where d.CreationDate >= DateTime.UtcNow.AddMonths(-3)
select new { d.CreationDate, d.Content };
The idea is that only documents from the past 3 months would be indexed, while older documents would be purged from the index but still retained.
The actual problem is that this is a full-text search index, and the actual data size required to perform a full-text search across the entire dataset is higher than just storing the documents (which can be easily compressed).
This is a great example of an XY problem. The request was to allow access to the current date during the indexing process so the index could filter out old documents. However, that is actually something that we explicitly prevent. The problem is that the current date isn’t really meaningful when we talk about indexing. The indexing time isn’t really relevant for filtering or operations, since it has no association with the actual data.
The date of a document and the time it was indexed are completely unrelated. I might update a document (and thus re-index it) whose CreationDate is far in the past. That would filter it out from the index. However, if we didn’t update the document, it would be retained indefinitely, since the filtering occurs only at indexing time.
Going back to the XY problem, what is the user trying to solve? They don’t want to index all data, but they do want to retain it forever. So how can we achieve this with RavenDB?
Data Archiving in RavenDB
One of the things we aim to do with RavenDB is ensure that we have a good fit for most common scenarios, and archiving is certainly one of them. In RavenDB 6.0 we added explicit support for Data Archiving.
When you save a document, all you need to do is add a metadata element: @archive-at and you are set. For example, take a look at the following document:
{
"Name": "Wilman Kal",
"Phone": "90-224 8888",
"@metadata": {
"@archive-at": "2024-11-01T12:00:00.000Z",
"@collection": "Companies",
}
}
This document is set to be archived on Nov 1st, 2024. What does that mean?
From that day on, RavenDB will automatically mark it as an archived document, meaning it will be stored in a compressed format and excluded from indexing by default.
In fact, this exact scenario is detailed in the documentation.
You can decide (on a per-index basis) whether to include archived documents in the index. This gives you a very high level of flexibility without requiring much manual effort.
In short, for this scenario, you can simply tell RavenDB when to archive the document and let RavenDB handle the rest. RavenDB will do the right thing for you.
Comments
This is awesome!
Comment preview