The trouble with logging
Logging is a wonderful thing. I think that most applications would be best served by doing a lot more of it than they currently do. Having said that, I do have a few observations with how logging is done by most software and how things might be improved.
Unstructured text is pure evil I don’t know how else to put it. The log files generated by most applications are simply atrocious from a parsing perspective. Sometimes this makes sense, as the log messages are targeted at human beings, but I believe that it’s possible to create something that meets the need of automatic parsing without making it totally unusable for the human eye.
The truth of the matter is that in this day and age a tiny percentage of log files are actually reviewed by organic eyeballs and a gigantic amount is dealt with automatically. For example, at a previous employer, we had literally tens of thousands of lines of Perl (regular expressions, parsers and logic) to deal with the log detritus from security tools and appliances. It was a nightmare of maintenance and a constant chase to discover more undocumented messages and absurdly pointless changes between point releases of products.
Words are inaccurate This plays further into the traditional approach that has been used in many applications of designing log files to be read, rather than parsed. I certainly appreciate the desire to make things easier to “read”, but in the process often the meaning is obfuscated. For example, in the mainframe world, we use message codes. Take IMS for example. There are two possible error messages that can be generated if you issue /START MADSIO command: DFS0007I and DFS0008I, depending on whether it was unable to start or if it as already running. Yes, you could parse this out of the language, but it’s easier to check the message code.
Logs are bloated with human-readable data This plays further into the whole writing for the wrong audience issue. Since a vast majority of log messages either never viewed, or reviewed by machine, a huge amount of the data in the log is useless. Computers don’t care about English grammar. In many cases, just a timestamp plus the message code would be enough, and if more data is included it should be that which adds value, not linguistic noise.
For example, this is a useful log message:
IPT: DENIED IN=eth0 OUT= MAC=40:40:d1:14:58:8c:00:18:8b:f9:70:06:08:00 SRC=86.145.19.153 DST=209.20.88.140 LEN=400 TOS=0x00 PREC=0x00 TTL=103 ID=18046 PROTO=UDP SPT=59787 DPT=500 LEN=380
Everything in the message is easily parsed out and it has a regular form. This, however, is not:
Attempting to swizzle in class 'MessageViewer': swapping method '<em>setUpWindowContents' with 'Letterbox</em>_setUpWindowContents'...
WTF does that even mean? I’m sure it means something to someone, but… it’s not really a log message. It might be a trace message, which brings us to my last thought.
Logging is not tracing is not narration Logging is not a trace. Logging tells me X happened. It shouldn’t be telling me every single step along the way. That’s a trace. When I want a trace, I’ll ask for a trace. In addition, it’s not a narration. To understand what I mean by narration, take this log series from Apple:
Starting standard backup Backing up to: /Volumes/Tardis/Backups.backupdb No pre-backup thinning needed: 350.8 MB requested (including padding), 1.56 GB available Copied 2416 files (5.9 MB) from volume Silver Surfer. No pre-backup thinning needed: 343.9 MB requested (including padding), 1.56 GB available Copied 376 files (59 KB) from volume Silver Surfer. SStarting post-backup thinning Deleted backup /Volumes/Tardis/Backups.backupdb/Silver Surfer/2008-11-01-103735: 1.56 GB now available Post-back up thinning complete: 1 expired backups removed Backup completed successfully.
That’s sure a lot of noise to tell me this something that could have been a single message.
I don’t have a solution for this, but it’s just something that was annoying me this evening in trying to diagnose a problem.
This entry was posted at 11:05 pm on 2 November 2008 and is filed under Technology. You can follow any responses to this entry through the post-specific RSS 2.0 feed.
In my hobbie project I’m writting logs in the human readible data serialization format ‘YAML’. It can be colourised by vim, is indented nicely, is human readible and is parsible by any popular programming (providing the appropriate YAML library is installed). Not sure how it’s going to work out, but just a thought.
It is the intention that analysis and monitoring tools will be built around the logging format later on, and I hope this system will make this very easy indeed.
I need to finish the thoughts in this one, as it was late, but I do have some ideas about how to handle it. Strangely, a lot of it comes from times I spent on weird mainframeish systems in the past. I am not, however, arguing for a return to the book cases full of messages, though that’s better than what we have today.
[...] my opening rant on the subject of logging, I mentioned that “logging is not tracing is not narrative”, [...]
Some log messages have to be recorded in a way so that the app can be internationalized – your ops admins may not read English
That’s no doubt true, but regardless, whether it be English, French or even Hindi, it’s necessary to understand your audience and in many cases, the reduction of narrative also reduces the need to internationalize logging messages.
[...] earlier posts, I ranted about the state of logging and made some observations about who the audience might be. Now I want to talk about something [...]
Both comments and pings are currently closed.
In my book, “Release It!”, I have a lot to say about logging, along very similar lines. Most developers write log messages as though they, the developers, are the audience for the logs. They aren’t.
The true audience for logs is either a) operations staff, or b) monitoring systems. Neither of them are going to know the internal structure of the code, so logs shouldn’t reflect a conversation between the code and the developer.
Instead, logs should be written to inform operations and, particularly, to trigger actions when intervention is needed. Anything logged as “Error” or “Severe” level should mean that a Sev 1 ticket needs to be opened and Incident Management should begin.