The story of the failed backup
This story started a few years ago, in a very non technical setting. We changed the accountant that we use for Hibernating Rhinos. We outgrew the office we were using at the time and needed better services. Among the changes that was implemented as a result of this move was the usage of new accounting software. Nothing really that interesting, to be frank. I like that my accounting is boring. However, the new accounting software was an on-premise solution. In other words, we are the one running it. Which is perfectly fine, we provisioned a VM in our data center (a fancy name to the single rack that we had at the time) and let it run.
As you can imagine, we consider our accounting data to be mission critical, so to speak. I don’t mind not being able to access it for an hour, for example, but losing it is going to be Bad. So we had a backup, nothing really that interesting. We have a backup that goes to local disk on the VM, remote disk in the office and just to be safe, uploading the backup to S3. I asked one of our developers to take care of this, and aside from specifying that I want backups in triplicates, I didn’t really pay attention. That was around 2017, I believe. I made sure that if the backup failed, we would get notified of that, and that was pretty much it.
One of the reasons that I like my accounting boring is that it simplify my life and reduces stress. Unfortunately, it seems like my accounting practices has a cost. In particular, it means that I favor paying a bit too much to the taxman. That means that all of the taxes are going out immediately, and the company doesn’t end up the year with a large tax bill that we need to cover. But I overdid it a time or two, and we overpaid on our taxes. Well, that was by design, extra money showing up from the taxman is much better than a surprise bill. But at certain point, we were supposed to get a refund for a non trivial amount. At which point the tax authorities came a-calling and audited us.
Remember that I talked about boring accounting practices. The day we started the audit, I was having dinner with my wide and being audited was the third topic of the day, if I recall properly. They found a few things that we did wrong (we registered an invoice for the wrong currency, so we cancelled it and issued a new one, instead of refunding it and issuing a new one). That was a Thing, it seemed. But the end result was pretty much nothing. I loved it. Since then, we were audited a few more times, always with no repercussions.
Given that the next audit is a question of when (usually every 18 – 30 months or so, it seems), not if. I really care about my accounting data. Hence the triple backups policy. You might have been going through this post expecting to hear that we lost the accounting data, and the backup failed, and now my accountant outlook is decidedly not boring. I’m afraid that this is only half true. We did have a failed backup, but we caught it before we actually needed it.
At one point, I looked at out backup policies, and I noticed that the accounting backup was months old at this point. That was concerning, I gotta say. Here is the timeline, as I could piece is together:
- Q2 2017 – Backup process is defined and tested. This is a one off process that we use only for the accounting database.
- Q1 2018 – Routine key rotation is performed on some of our keys. Unbeknownst to us, the backup process lose the ability to report failure. But given that it doesn’t fail, no one notices.
- Q4 2018 – The developer responsible for setting up the backup process leave the company. As part of the outgoing employee process, we shut down relevant user accounts.
- Q1 2019 – The accounting server is rebooted. The backup process fails to start, because the user account is disabled.
You might notice the scale of this issue. The underlying problem was that the developer setup this one off process as a… well, one off process. That meant that it wasn’t hooked to any of our usual monitoring / alert systems. It did have a way to report on errors, but the credentials on that went stale after a year. No one paid attention, since the backups continued to run.
The backup process was also running under the user account of the developer, not a service account. I guess it was easier than creating a user, but the end result was that when we deactivated the user account after the developer left the company, we also disabled the backup. But the process was running, and it continued to run for months. Only much later will the process fail to start, and by then there was no way to report errors, and we noticed it only because we looked for that during routine operations.
One of the reasons we had built backups directly into the core of RavenDB was exactly this sort of situations. A backup process is not something that you cobble together (that’s on us, to be fair), it is something that should be part and parcel of the operations of your database, and being able to do something like get backups in triplicate is essential for good operations experience.
Comments
Reminds me of the good old days when I did some small business support stuff. A small company dutifully replaced the tape in the tapedrive every day. Master on monday, incrementals the other days and they kept the masters for 8 weeks out of the office. Sounds pretty solid, right? Except for the fact that the backup software was configured completely wrong so the tapes contained exactly zero bytes. They only discovered that when the server killed itself an they lost all company data.
Backups are hard to do right and reliable, unfortunately.
There was this one time about 15 years ago when I was a junior working at a small company when I've accidentally deleted a website from one of the hosted servers. Not a big problem except the company owner accidentally deleted the backups about 1 hour earlier ... somehow we used good old fashioned "undelete" and disk recovery tool to piece things back together.
Comment preview