What kind of logging should you do in production?

time to read 3 min | 455 words

That really depends on the type of application that you write and what kind of operations team it is going to have.

I have applications that I setup, then forget about (this blog being one of them). In those types of applications, having a log in production is a burden, I need to purge it occasionally, or write the code to purge it automatically.

Then I have applications that have a strong operations team, where people are looking at the application every single day. An alert raised from the system is actually going to be looked at by a human before irate customers start calling. In those cases, a lot is pretty important, and understanding how to properly distinguish between real errors (human needs to look at) and transient ones (do a review once a month) is pretty important.

Setting things up is that I have production sites log only error conditions, which is pretty common, is also a mistake, as a simple example, I had once seen a log that where 40% of the errors where users coming back to the site after the session has timed out, and the error was leading them to the error page.

The way I try to do things is:

  • Pay attention to messages that arrive to the error queue, see if there is anything that can be done about them.
  • Log & alert any time that an error crosses the system boundary (if the users see an error page, I really want to know about it).
  • Setup things so I can change log levels in productions without restarting / redeployment, etc.

Please note that I am making a distinction here between developer’s log and audit trails or operations information. Depending on the type of system that you have, and the requirements on it, those two can be a gold mine when trying to troubleshoot issues.

Providing things like performance counters or access to internal state in your application is also important. For example, being able to ask the app for the worst performing queries is a great way of troubleshooting perf issues. Or querying the cache miss ratios, etc. It isn’t just logging that gives you visibility into the system.

Something that I haven’t had the chance to do yet (but that I would like to try) is to plug the NH Prof backend (which is basically an event aggregation and analysis system) as a way to analyze log streams. That way, even if you do have some logging turned on, it doesn’t stay in its raw form, but is translated to something much more concise and understandable.