Pensieri di un lunatico minore

3 November 2008 Technology

Fixing logging (pt 1) – know your audience

In my opening rant on the subject of logging, I mentioned that “logging is not tracing is not narrative”, and I wanted to start by saying that, like any writing, you need to know your audience. So, let’s start by defining what those words mean, and who the likely audience is. I’m going to use a bogus application to demonstrate what I mean by the individual audiences. In addition, I’m going to leave them in “human readable” format for now so that people don’t freak out completely looking at them. I’m also going to leave out common things that people are familiar with like time stamps and “levels”.

Logging

The best way to describe this generically is with the adverb transactional. The target for this form of message is system administrators and, in many places, more importantly system monitoring tools. That means that you’re attempting to describe high level actions taken by the system under normal circumstances primarily, while providing enough information to begin to understand where things went wrong. It is not for debugging at a developer level.

So what’s not a good log message at this level?

Opening file 'foobar.xml' for reading

What’s the point of that log message? It tells me nothing. Was that successful? Was it the file I told the system to open? Let’s look at two messages that are useful to this audience.

SUCCEEDED processing '/tmp/foobar.xml' in 0.32s. 100 entries

See first we know it succeeded, and we know exactly which file is used, how long it took to handle—which might give us a hint at performance issues in the future—and how many entries were in the file. All of that allows us to verify the proper operation of the system. But what if something goes wrong?

FAILED to process '/tmp/foobar.xml'. Unable to open file as user nobody.

In this case, we know clearly that something went wrong (easy to grep for), what file was involved and more importantly, a good hint at why it couldn’t open the file. This would give us a good head-start at figuring out not only what went wrong, but how we can fix it.

Note that never do we talk about exactly where this failed in our code. That wouldn’t help a system administrator in many cases. Often the problem is much simpler and one of permissions or configuration, and code pointers are a distraction to solving these problems.

Tracing

Tracing is what developers want. They want detail about the internal logic of the process and how it’s progressing through it’s flow. This is great for developers, but it’s absolutely useless to most everyone else and tends to create noise in the underlying logs that make it harder to find actual useful information. Tracing should be:

  1. Easy to turn on and off mid flight
  2. Clear about where you are in the code (e.g., include module/class/method info)
  3. Provide information about important variables going in and out of things

The first thing is often missed. To trace a bit of software, you have to restart it, which often means it’s in a totally different state than the one you want to see. I generally write my software to deal with a specific Unix signal (SIGUSR1 usually) that will turn on tracing, and then another to turn it off. This lets you watch something when it’s broken.

It’s much harder to give a good trace message example because it’s totally dependent on system design, programming language choices, etc. It’s something written by the developer for the developer. That’s not to say it holds no value for anyone else, but it’s primarily for someone who is familiar with the underlying code. Let’s hypothesize though:

subsystem_name/SomeClass/handleLine:12 Received line (len=3021; fmt=utf8) starting "when in the course"
subsystem_name/SomeClass/handleLine:42 Found 7 tokens at (9, 30, 42, 59, 1003, 2055, 2155)

Here we have information about the logical flow of the inside of a process at a micro level. We have additional information that otherwise would be of no use, such as line length and format. Finally, we have the “results” of an action such as tokenizing. Obviously, every system is different, but it gives you a strong key as to what’s going on inside the system so that you can diagnose a problem.

There in lies the key. Tracing is only useful when something is broken. The rest of the time, it’s just noise and more megabytes of cruft in the stream that obfuscates other more useful information

Narrative

Finally we come to what most people actually do, which is the worst of both worlds. They provide tracing level “messages”, but without the detail needed to be useful at debugging a problem, and then they wrap them in English that is aimed at the wrong audience. The examples I used in my previous post are typical of this approach. It’s well intentioned, but in the end, fruitless.

Final Thoughts

So where does that leave us? It means that developers need to look at logging as a multi-faceted environment that doesn’t necessarily lend itself to a single solution. In later posts, I want to address the issue of “message catalogs”, logging mechanisms (like syslog) and formatting.

This entry was posted at 11:23 am on 3 November 2008 and is filed under Technology. You can follow any responses to this entry through the post-specific RSS 2.0 feed.

No comments found.

Both comments and pings are currently closed.