The non expiring documents and the funky clock
Recently the time.gov site had a complete makeover, which I love. I don’t really have much to do with time in the US in the normal course of things, but this site has a really interesting feature that I love.
Here is what this shows on my machine:
I love this feature because it showcase a real world problem very easily. Time is hard. The concept we have in our head about time is completely wrong in many cases. And that leads to interesting bugs. In this case, the second machine will be adjusted on midnight from the network and the clock drift will be fixed (hopefully).
What will happen to any code that runs when this happens? As far as it is concerned, time will move back.
RavenDB has a feature, document expiration. You can set a time for a document to go away. We had a bug which caused us to read the entries to be deleted at time T and then delete the documents that are older than T. Expect that in this case, the T wasn’t the same. We travelled back in time (and the log was confusing) and go an earlier result. That meant that we removed the expiration entries but not their related documents. When the time moved forward enough again to have those documents expire, the expiration record was already gone.
As far as RavenDB was concerned, the documents were updated to expire in the future, so the expiration records were no longer relevant. And the documents never expired, ouch.
We fixed that by remembering the original time we read the expiration records. I’m comforted with knowing that we aren’t the only one having to deal with it.
Comments
Ha, been there, done that also... i fixed it by separating the expiration from deletion by implementing an algorithm which marks (sets an is_expired bit to 1 ) all expired entries in one tx and sweeps at leasure (something like do { delete top 5000 from entries order by created_date where is_expired =1; sleep 5 } while (no. of deletes > 0) ) so that the perf impact of removing large numbers of entries is tunable and can be spread over time (nice if your create rate has big bursts and expiration then causes perf hickups).
the is_expired bit has the benefit of not being time sensitive but the tradeoff is that there's no upper limit on the delay until actual removal if your removal rate is too low (but you can always check the is_expired bit)
Comment preview