Running RavenDB on burstable cloud instances

time to read 8 min | 1502 words

One of the silent features of people moving to the cloud is that it make it the relation between performance and $$$ costs evident. In your on data center, it is easy to lose track between the yearly DC budget and the application performance. But on the cloud, you can track it quite easily.

The immediately lead people to try to optimize their costs and use the minimal amount of resources that they can get away with. This is great for people who believe in efficient software.

One of the results of the focus on price and performance has been the introduction of burstable cloud instances. These instances allow you to host machines that do not need the full machine resources to run effectively. There are many systems whose normal runtime cost is minimal, with only occasional spikes. In AWS, you have the T series and Azure has the B series. Given how often RavenDB is deployed on the cloud, it shouldn’t surprise you that we are frequently being run on such instances. The cost savings can be quite attractive, around 15% in most cases. And in the case of small instances, that can be even more impressive.

RavenDB can run quite nicely on a Raspberry PI, or a resource starved container, and for many workloads, that make sense. Looking at AWS in this case, consider the t3a.medium instance with 2 cores and 4 GB at 27.4$ / month vs. a1.large (the smallest non burstable instance) with the same spec (but ARM machine) at 37.2$ per month. For that matter, a t3a.small with 2 cores and 2 GB of memory is 13.7$. As you can see, the cost savings adds up, and it make a lot of sense to want to use the most cost effective solution.

Enter the problem with burstable instances. They are bursty. That means that you have two options when you need more resources. Either you end up with throttling (reducing the amount of CPU that you can use) or you are being charged for the additional extra CPU power you used.

Our own production systems, running all of our sites and backend systems are running on a cluster of 3 t3.large instances. As I mentioned, the cost savings are significant. But what happens when you go above the limit? In our production systems, for the past 6 months, we have never had an instance where RavenDB used more than the provided burstable performance. It helps that the burstable machines allows us to accrue CPU time when we aren’t using it, but overall it means that we are doing a good job of  handling requests efficiently. Here are some metrics from one of the nodes in the cluster during a somewhat slow period (the range is usually 20 – 200 requests / sec).

image

So we are pretty good in terms of latency and resource utilizations, that’s great.

However, the immediately response to seeing that we aren’t hitting the system limits is… to reduce the machine size again, to pay even less. There is a lot of material on cost optimization in the cloud that you can read, but that isn’t the point of this post. One of the more interesting choices you can make with burstable instances is to ask to not go over the limit. If your system is using too much CPU, just take it away until it is done. Some workloads are just fine for this, because there is no urgency. Some workloads, such as servicing requests, are less kind to this sort of setup. Regardless, this is a common enough deployment model that we have to take it into account.

Let’s consider the worst case scenario from our perspective. We have a user that runs the cheapest possible instance a t3a.nano with 2 cores and 512 MB of RAM costing just 3.4$ a month. The caveat with this instance is that you have just 6 CPU credits / hour to your name. A CPU credit is basically 100% CPU for 1 minute. Another way of looking at this is that t t3a.nano instance has a budget of 360 CPU seconds per hour. If it uses more than that, it is charged a hefty fee (about ten times per hour than the machine cost). So we have users that disable the ability to go over the budget.

Now, let’s consider what is going to happen to such an instance when it hits an unexpected load. In the context of RavenDB, it can be that you created a new index on a big collection. But something seemingly simple such as generating a client certificate can also have a devastating impact on such an instance.RavenDB generates client certificates with 4096 bits keys. On my machine (nice powerful dev machine), that can take anything from 300 – 900 ms of CPU time and cause a noticeable spike in one of the cores:

image

On the nano machine, we have measured key creation time of over six minutes.

The answer isn’t that my machine is 800 times more powerful than the cloud machine. The problem is that this takes enough CPU time to consume all the available credits, which cause throttling. At this point, we are in a sad state. Any additional CPU credits (and time) that we earn goes directly to the already existing computation. That means that any other CPU time is pushed back. Way back. This happens at a level below the operating system (at the hypervisor), so there it isn’t even aware of it. What is happening from the point of view of the OS is that the CPU is suddenly much slower.

All of this make sense and expected given the burstable nature of the system. But what is the observed behavior?

Well, the instance appears to be frozen. All the cloud metrics will tell you that everything is green, but you can’t access the system (no CPU to accept new connections) you can’t SSH into it (same) and if you have an existing SSH connection, it appears to have frozen. Measuring performance from inside the machine shows that everything is cool. One CPU is busy (expected, generating the certificate, doing indexing, etc). Another is idle. But the system behaves as if it has no CPU available, which is exactly what is going on, except in a way that we can’t tell.

RavenDB goes to a lot of trouble to be a nice citizen and a good neighbor. This includes scheduling operations so the underlying OS will always have the required resources to handle additional load. For example, background tasks are run with lowered priority, etc. But when running on a burstable machine that is being throttled… well, from the outside it looked like certain trivial operations would just take the entire machine down and it wouldn’t be recoverable short of hard reboot.

Even when you know what is going on, it doesn’t really help you. Because from inside the machine, there is no way to access the cloud metrics in a good enough precision to take action.

We have a pretty strong desire to not get into these kind of situation, so we implemented what I can only refer to as counter measures. When you are running on a burstable instance, you can let RavenDB know what is your CPU credits situation, at which point RavenDB will actively monitor the machine and compute its own tally of the outstanding CPU credits situation. When we notice that the CPU credits are running short, we’ll start pro-actively halting internal processes to give the machine more space to recover from the CPU credits shortage.

We’ll move ETL processes and ongoing backups to other nodes in the cluster and even pause indexing in order to recover the CPU time we need. Here is an example of what you’ll see in such a case:

image

One of the nice things about the kind of promises RavenDB make about its indexes is that it is able to make these kind of decisions without impacting the guarantees that we provide.

If the situation is still bad, we’ll start proactively rejecting requests. The idea is that instead of getting of getting to the point where we are being throttled (and effectively down to the world), we’ll process a request by simply sending 503 Service Unavailable response back. This is going to be very cheap to do, so won’t put additional strain on our limited budget. At the same time, the RavenDB’s client treat this kind of error as a trigger for a failover and will use another node to service this request.

This was a pretty long post to explain that RavenDB is going to give you a much nicer experience on burstable machines even if you don’t have bursting capabilities.