Emergency fixes & the state of your code base

time to read 2 min | 397 words

One of the most horrible things that can happen to a code base is a production bug that needs to be fixed right now. At that stage, developers usually throw aside all attempts of creating well written code and just try to fix the problem using brute force.

Add to that the fact that very often we don’t have the exact cause of the production bug, and you get an interesting experiment in the scientific method. The devs form a hypothesis, try to build a fix, and “test” that on production. Repeat until something make the problem go away, which isn’t necessarily one of the changes that the developers intended to make.

Consider that this is a highly stressful period of time, with very little sleep (if any), and you get into a cascading problem situation. A true mess.

Here is an example from me trying to figure out an error in NH Prof’s connectivity:

image

This is not how I usually write code, but it was very helpful in narrowing down the problem.

But this post isn’t about the code example, it is about the implications on the code base. It is fine to cut corners in order to resolve a production bug. One thing that you don’t want to do is to commit those to source control, or to keep them in production. In one memorable case, during the course of troubleshooting a production issue, I enabled full logging on the production server, then forgot about turning them off when we finally resolved the issue. Fast forward three months later, and we had a production crash because of a full hard disk.

Another important thing is that you should only try one thing at a time when you try to troubleshoot a production error. I often see people try something, and when it doesn’t work, they try something else on top of the modified code base.

That change that you just made may be important, so by all means save it in a branch and come back to it later, but for production emergency fixes, you want to always work from the production code snapshot, not on top of increasingly more franticly changed code base.