The best models will support the second crucial skill required for troubleshooting: seeing the system as a set of problems to be solved. The problem/solution mindset is so critical in really understanding how networks really work, and hence how to troubleshoot them, that Ethan Banks and I are writing an entire book around this concept. The essential points are these—
- Understand the set of problems being solved
- Understand a wide theoretical set of solutions for this problem, including how each solution interacts with other problems and solutions, potential side effects of using each solution, and where the common faults lie in each solution
- Understand this implementation
of this solution
Having this kind of information in your head will help you pull in detail where needed to fill in the models of each system; just as you cannot keep all four of the primary systems in your head at once, you also cannot effectively troubleshoot without a reservoir of more detailed knowledge about each system, or the ready ability to absorb more information about each system as needed. Having a problem/solution mindset also helps keep you focused in troubleshooting.
So you have built models of each system, and you have learned to think in terms of problems and solutions. What about technique? This is, of course, the step that everyone wants to jump to first—but I would strongly suggest that the technique of troubleshooting goes hand in hand with the models and the mindset of troubleshooting. If you do not have the models and the mindset, the technique is going to be worthless.
In terms of technique, I have tried many through the years, but the most effective is still the one I learned in electronics. This is the third crucial skill for troubleshooting: half split, measure, and move. This technique actually consists of several steps—
- Trace out the entire path of the signal. This is the first place where your knowledge of the system you are troubleshooting comes into play; if you cannot trace the path of a signal (a flow, or data, or even the way data is passed between layers in the network), then you cannot troubleshoot effectively.
- Find a half way point in the path of the signal. This half way point would ideally be in a place where you can easily measure state, but at the same time effectively splits the entire signal path in half. One mistake rookies make in troubleshooting is to start with the easiest place to measure, or the points in the flow they understand the best. If you don’t understand some part of the data path, then you need to learn it, rather than avoiding it. Again, this is where system knowledge is going to be crucial. If you have inaccurate models of the system in your head, this is where you are going to fail in your endeavor to troubleshoot a problem.
- Measure the signal at this halfway point. If the signal is correct, move closer to the tail of the signal path. If the signal is incorrect, move closer to the source of the signal.
The half split method is time proven across many different fields, from electronics to building, to electrical to fluid dynamics. It might seem like a good idea just to “jump to what I know,” but this is a mistake.
For instance, one time I was called out with another tech from my shop to work on the FPS-77 storm detection radar. There was some problem in the transmitter circuit; the transmitter just was not producing power. There was a resister that blew in the “right area” all the time, so we checked the resister, and sure enough, it seemed like it was shorted. We ordered another resister, shut things down, and went home for the morning (by the time we finished working on this, it was around 3AM). The next day, the part came in and was installed by someone else. The resister promptly showed a short again, and the radar system failed to come back up.
What went wrong? I checked what was simple to check, what was a common problem, and walked away thinking I had found the problem, that’s what. It took another day’s worth of troubleshooting to actually pin the problem down, a component that was in parallel with the original resistor, but not on the same board, or even in the same area of the schematics, had shorted out. The resister showed a short because it was in parallel with another component that was actually shorted out.
Lesson learned: do not take short cuts, do not assume the part you can easily test is the part that is broken, and do not assume you have found the problem the first time you find something that does not look right. Make certain you try to falsify your theory, instead of just trying to prove it.
This, then, is the troubleshooting model I have developed across many years of actually working in and around some very complex systems. To reiterate—
- Build accurate models of the system and all subsystems as possible, particularly the business, the applications, the protocols, and the equipment. This is probably where most failures to effectively troubleshoot problems occur, and the step that takes the longest to complete. In fact, it is probably a truism to say that no-one ever really completes this step, as there is always more to learn about every system, and more accurate ways to model any given system.
- Have a problem/solution mindset. This is probably the second most common failure point in the troubleshooting process.
- Half split, measure, and move.