War Stories: Check the stupid stuff, too
Mission Critical application, running quitely in production, suddenly stopped working. No warning, not errors. The application is still up, still running, but for all intent and purposes, it is not operational. I get a call that basically says: "Fix it!"
The main interface to this application (windows service) is a huge amount of logs that is being outputed. The target for the logs is the same SQL Server database that the application is using. Any error in this application should be logged and some errors should go directly to the administrators (and me).
But, the application is not doing its job, and worse, it doesn't write anything to the logs. This is a heavily multi-threaded application, so the first thing that I thought of was a deadlock, but restarting the application didn't help. I am on the phone with the sys-admin, going over the application configuration, trying to figure why it doesn't even log its startup event (which is singled threaded, so no chance for deadlocks).
I give up and drive there. Restarting the service doesn't help, and no logs are written, even thought the cofniguration seems just fine. The first thing that I do is taking a look at the logs table. But it currently has about 500,000 records, so queries take longer than instantenous. I get annoyed and try to add an index for the main search criterias. It errors. I double-check my syntax and tries again. It errors.
This time I read the error. It says that it doesn't have enough space to create the index. I didn't think that it is that large an index, but now I operate on a hunch. I tries to insert a row to the database (row size: 40bytes or so). The database refuses, says that it doesn't have enough space.
They had a rigorous backup procedure that kept backups on the same disks as the database files and never purged them. Once I cleared some of the old backups, the application immediately resume normal processing.
Total time trying to fix the issue: ~3 hours.
Total time fixing the issues since arrival: ~10 minutes.
Comments
Comment preview