It’s 2AM, the network is down, and the CEO is on the phone asking when it is going to be back up—the overnight job crucial to the business opening in the morning has failed, and the company stands to lose millions of dollars if the network is not fixed in the next hour or so. Almost every network engineer has faced this problem at least once in their career, often involving intense bouts of troubleshooting.
And yet—troubleshooting is a skill that is hardly ever taught. There are a number of computer science programs that do include classes in troubleshooting, but these tend to be mostly focused on tools, rather than technique, or focused on practical skill application. I was also trained in troubleshooting many years ago as a young recruit into the United States Air Force—but the training was, again, practical in bent, with very few theoretical components.
Note to readers: I wrote a short piece on troubleshooting here on rule11, but I have taken that piece down and replaced it with this short series on the topic. I did start writing a book on this topic many years ago, but my co-authors and I soon discovered troubleshooting was going to be a difficult topic to push into a book form. There are a number of presentations in this area, as well, but here I am trying to put my metacognitive spin on the problem more fully, after spending some time researching the topic.
It is always best to begin at the beginning: again, it is 2AM, and there is some problem in the network that is not allowing specific work to get done (that needs to get done). It is tempting to begin with the problem, but it is actually important to back up one step and define what it means for a system to be “broken.” In other words, the first question you need to ask when faced with a broken system is: what should this system be doing?
“Well, it should be transporting this application traffic over there, so this application will work.” It probably should, but I would suggest this is far too narrow of a view of the word “working” to be useful in a more broad sense.
For instance, one of the pieces of equipment on the flightline was a wind speed indicator. This is really fancy name for a really simple device; there was a small “bird” attached to the top of a pole with a tail that would guide the bird into facing the wind, and then at the nose of the bird an impeller attached to a DC motor. The DC motor drove a simple DC voltmeter that was graduated with wind speeds, and the entire system was calibrated using a resistive bridge in the wind speed indicator box, and another in the wind bird itself. The power from the impeller was passed to the voltmeter, several miles away, through a 12 gauge cable. These cables were particularly troublesome, as they were buried, and had to be spliced using gel coated connectors, with splices buried in gel filled casing. This was all before the advent of nitrogen filled conduit to keep water out.
In one particular instance, a splice failed, resulting in us digging the cable up by hand and opening the splice. A special team was called in to resplice the cable, but even with the new splice in place, the wind system could not be calibrated to work correctly. The cable team argued that the cable had all the right voltage and resistance readings; we argued back that the equipment had been working correctly before the splice failed, and all tested on the bench okay, so the problem must still be in that splice. The argument lasted for days.
From the view of the cable team, their “system” was working properly. From the perspective of the weather techs, the system was not. Who was right? It all came down to this: What does the “system” consist of, what does “working properly” really mean? Eventually, by the way, the cable splice was fingered as the problem in a capacitive crosstalk test. The splice was redone, and the problem disappeared.
At 2AM, it is easy to think of the “system” as the network path the application runs over, and the application itself. Such a narrow view of the system can be damaging to your efforts to actually repair the problem. Instead, it is important to begin with an expansive view of the system. This should include:
- The specifications to which the network was designed, and needs to operate in order to fulfill business requirements
- The requirements placed on the network by the applications the network is supporting
- The operation of the protocols used to create the information needed to forward traffic through the network (including policy)
- The operation of the software and hardware on each network (or forwarding) device in the network
In other words: specification + application + protocols + equipment. Again, this might be a little obvious, but it is easy to forget the entire picture at 2AM when the fires are burning hot, and you are trying to put them out.