Automatic subscription retries with RavenDB
RavenDB’s subscription give you the ability to run batch processing easily and robustly. In other words, you specify a query and subscribe to its results. RavenDB will send you all the documents matching the query. So far, that is pretty obvious, but what is important with subscriptions is the fact that it will keep sending you results. As long as your subscription is opened, you’ll get any changed document that matches your query. That gives you a great way to implement event pipelines, batch processes and in general opens up some interesting options.
In this case, I want to talk about how failures with subscriptions. Not failure in the sense of a server going down, or a client crashing. These are already handled by the subscription mechanism itself. A server going down will cause the cluster to change the ownership of subscription, and your client code will not even notice. A client going down can either failover to another client. Alternatively, upon restart of the client, it will pick up right from where it dropped things. No, this is handled.
What require attention is what happen if there is an error during the processing of a batch of documents. Imagine that we want to do some background processing. We could do that in many ways, such as introducing a queuing system and tasks queue, but in many cases, the overhead of that is quite high. A simpler approach is to just write the tasks out as documents and use a subscription to process them. In this case, let’s imagine that we want to send emails. A subscription will run over all the EmailToSend collection, doing whatever processing is required to actually send it. Once we are done processing a batch, we’ll delete all the items that we processed. Whenever there are new emails to send, the subscriptions will get them for us immediately.
But what happens if there is a failure to send one particular email in a batch? Well, we can ignore this (and not delete the document), but that will require some admin involvement to resolve. Subscriptions will not revisits documents that they have already seen. Except if these documents were changed. Here is one way to handle this scenario:
In short, we’ll try to process each document, sending the email, etc. If we failed to do so, we’ll not delete the document, instead, we’ll patch it to increment a Retries property in the metadata. This operation has two interesting effects. First, it means that we can keep track of how often we retried a particular document. But as a side effect of modifying the document, we’ll get it back in the subscription again. In other words, this piece of code will give a document 5 retries before it give up.
As an admin, you can then peek into your database and see all the documents that have exceeded the allow retries and make a decision on what to do with them. But anything that failed because of some transient failure will just work.
Comments
Wow, I just love the simplicity of this code. It's extremely powerful and dead simple. Amazing!
Love this! I've been doing manual task queues to accomplish basically the same thing. This looks way simpler.
It'd be nice to have a back-off/cool-down retry (longer delays between retries) but perhaps that's getting to complex for our simple scenario here.
I'm wondering here it this will result in immediate retries. The example with sending mails is quite good here, because it introduces external dependencies that may have transient errors. Sending a mail may just be unable to contact the relay / smarthost. Normally it would be useful that the first retry is done after a few seconds, the second after a few minutes and so on.
Can this be handled here?
Daniel, That is correct, this will cause immediate retries. To handle this, you can set a
LastRetriedAt
and keep punting these values until a certain timeout has passed.the
session.Delete(item.Id);
is after the try/catch and will therefor delete items which should be patched? is this intended or a bug? I would have written the delete directly after line 18?Fabian, Note that there is a
continue
in line 38, that handles this.ah, I missed that. Now it makes sense!
Oren, wouldn't this create a loop-scenario where the system is aggressively polling? The faster ravendb is, the more load will be induced in this scenario.
Daniel, Yes, that is a concern. That is why I said you might want to punt it. Put it on the side and trigger an update every
$TIME
to re-process.Comment preview