Logfiles Suck

Very frequently we get sent logfiles from customers who are having a problem, and who are often surprised when we say that the logs aren’t really of much use. It’s a perfectly reasonable assumption, that’s what the logs are for, right? How can we say they’re useless, especially when Traffic Global – or anything else for that matter – writes to them?

The thing is that logs are really a way of finding out what happened in the past, over time. When you do something significant, or start a long bit of work, that goes in the log. If you find a problem that you know about and can deal with, that goes in the log. Any specific details that might affect things – processor type, screen resolution, version numbers and so on – they all go in the log. While all of these things are useful to know, not one of them is useful at the precise moment that it’s logged.

This partly explains one of the questions that’s asked a lot; the last line in the log is from Traffic Global, which means that if something goes wrong it must be down to Traffic Global, right? Um… no.

Firstly, it might have been logged ten minutes ago. This is why logfiles tend to have timestamps in them, but they’re also often limited to the nearest second. A modern computer can do a shitload of work in one second, so even if the last log entry looks like it happened exactly at the right time, it almost certainly isn’t telling you the last thing that happened.

To add to the mix, things that the application might have logged could easily have not actually been written to disk, especially if the application crashed. Writing to disk is very slow, no matter how fast the disk is, so writes are often stored in memory and written out in batches. When you have a complex application like X-Plane and Traffic Global, there can be several layers of these stores lying between the point where the log entry is created, and when it actually makes it out to disk. There’s also the operating system keeping it’s own store, so even after the application has written the log entries, they still might take a little time to turn up in a file on disk. If the application crashes, all of those layers of in-memory stores are almost certainly just lost, so even if there was something useful logged, it’s gone.

I’ve already said that only key things are logged, and another reason for this is that if absolutely everything were logged, the logfile would easily reach terabytes in no time at all. Modern processors can do many things at the same time because they have more than one core, usually at least four and now 32 or more, and each core would need to log separately. If this weren’t the case then every single action would end up in a massive queue to write logs, effectively wiping out any benefit of having a multi-core processor.

Having multiple threads running at the same time, which is normal and heavily used, is another reason that the last thing in a logfile may have no relationship to the cause of the crash. Thread one writes a log entry, then goes on and does another few million things before that entry makes it out to disk. In the meantime, half a dozen other threads are also doing a few million things and it could easily be one of those, one that hasn’t touched the log whatsoever, that caused the crash.

Sometimes it’s possible for the program to notice that a crash has happened, and make sure that all the log entries are fully written and even maybe add some information about the crash to the log. Both X-Plane and Traffic Global do this. The thing with crashes though, is that they’re unexpected. If the program expected the problem, it wouldn’t have crashed. And sometimes, not uncommonly, it’s just not possible to do this nice cleanup and graceful shutdown. Even if it is, the best the log can do is to note what happened, not why, and if you’re not already aware that a problem exists then it’s almost certainly not going to log the information that’s really needed to see the cause.

So, what’s the answer?

For a programmer, the answer is to get hold of a crash dump. This is a file that’s written out when a program crashes, hence the name. Unlike a logfile, it contains the precise state of the program – all of it – allowing the programmer to see precisely what was happening across all threads at the precise instant of the crash, right down to the exact processor instruction that was being run at the time. It’s not subject to any of the problems of logfiles because the program doesn’t have to do anything to maintain it, it’s only created at the precise moment that the fault happens.

A crash dump shows what it was was working on, how it got there, and exact versions of every module (that’s a DLL, dylib or in X-Plane’s case, .xpl file). It even allows me to see the exact line of code that corresponds to all of this. I can literally double-click on a crash dump, enter three commands and see the line of code that crashed, including any data it was working with, even if the code’s changed since.

And the logfile? Well, it’s still useful in a way. Sometimes it really is useful to find out what happened in the past. One thing that comes to mind is finding minor problems with aircraft or scenery, because missing or damaged files tend to be written to the log but don’t cause a crash. That’s generally where logs are useful – seeing if the program noticed something wrong that was logged but didn’t stop things entirely.

For working out what caused a crash, though? Yeah, logfiles suck.

About the Author: jimkeir