RavenHQ Outage: What happened and what WILL happen
“We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area,” Amazon reported at 8:30 pm Pacific time on Friday the 29th.
Among those servers were RavenHQ-web-1 server (responsible for showing the www.RavenHQ.com marketing site) and RavenHQ-DB-1 (responsible for all the databases located on 1.ravenhq.com).
The reason for that is apparently a big storm that hit the US-East-1 region (Virginia data center). The data center lost power (and no generators were up, somehow) for about 30 minutes which caused extended outage for some of our customers. Small comforts, but we were in the same boat as Netflix, Heroku, Pinterest, and Instagram, among others.
Before we dig any deeper. No data was lost, and we resumed normal operations within a few hours. Users with replicated plans had no interruption of service.
During the outage, we were in the process of bringing up a new node with all of the databases from the impacted servers, but that would have entailed customers having to change connection strings, and the outage was resolved before we got to that point.
I want to show you how RavenHQ is architected from a physical stand point. What you can see isn’t actually the servers (those are fairly dynamic) but it is enough for you to get the picture.
In particular, both RavenHQ-web-1 and RavenHQ-db-1 were in US-East-1 region, and were impacted by this issue. The good news is that the rest of our servers were located in different availability regions and were not impacted by the issue.
In particular, that was a good stress test (which we could have done without, thank you very much, Amazon’s non working generators) for our HA scenario. None of the replicated plans customers experienced much of a problem, we had an auto failover to the secondary server (and depending on your plan, if that would have gone down, the failover would go to the tertiary server, etc). We actually have a customer that has a 4 ways master / master replicated plan, so he is super safe .
Unfortunately, that means that any customer that wasn’t on a replicated plan and was located on a US-East-1 server felt the impact. Unfortunately, that meant a lot of the free plan customers, since those are predominantly located on that region. As well as a number of actual paying customers.
We are sorry for that, and we all understand the need to balance between “what happen if” and “what does it cost”. As a result, we are going to offer all existing customers a 25% discount for all replicated plans for the next 6 months. Just contact us and ask for an upgrade to the replicated plan, and we will set it up for you.
One of the things that we are trying to do in RavenHQ is really commoditize the notion of a database that is just there, and you worry not about it. Going with the replicated plans is probably the best way to go about doing that, since your data is going to live on at least two physically remote servers, and we have auto failover ready to pop in and support your application if there are any issues. That is why we are offering the discount, as a way to make it even more affordable to go into High Availability mode.
Speaking of High Availability, I should probably talk about what happened to www.ravenhq.com, and the answer is fairly simple. The cobbler's children go barefoot. We spent a lot of time designing and building RavenHQ to be sustainable in the face of outages, but we focused all of our attention into the actual production instances, we didn’t really pay any mind to www.ravenhq.com. As far as we are concerned, this is a marketing site, and it was a low priority for HA story.
Unfortunately when we actually had an outage, people really freaked out because www.ravenhq.com was down, even though we had the actual database servers up and running, the website being down gave the impression that all of RavenHQ was down, which was decidedly not the case.
Lessons learned
- We need to encourage customers to go to the replicated plans by default. They are more expensive,yes, but they are also safer.
- We need a better process in the case of outages, to move databases from failed nodes to new ones, and inform customers about this change.
- More parts of are actual core infrastructure require to be HA. In particular, we need:
- To make sure that authentication works when core servers are missing (wasn’t a problem in this particular case, but our investigation revealed that it could be, so we need to solve that).
- Ensure that www.ravenhq.com is fully replicated.
- Create a /status page, where you can look at the status of the various servers and see how they are acting.
The last two are more for peace of mind than any real production need, but any cloud service runs on trust, and we think that adding those would ensure that if there are any problems in the future, we would be able to provide you with better service.
Comments
Love the idea of the status page with a dashboard to all servers, etc.
:)
(now i just need to finally get a fricking hobby project out of the door and out of R&D :( )
Hey Ayende,
Just a note about the /status page, usual practice is to have it on a separate web host and on status.mydomain.com
This is to handle the case where (eg, Amazon itself) goes down again, but your status host is still online as they have nothing in common with Amazon etc.
Comment preview