Troubleshooting: Models


How well can you know each of these four systems? Can you actually know them in fine detail, down to the last packet transmitted and the last bit in each packet? Can you know the flow of every packet through the network, and every piece of information any particular application pushes into a packet, or the complete set of ever changing business requirements?

Obviously the answer to these questions is no. As these four components of the network combine, they create a system that suffers from combinatorial explosion. There are far too many combinations, and far too many possible states, for any one person to actually know all of them.

How can you reduce the amount of information to some amount a reasonable human can keep in their minds? The answer—as it is with most problems related to having too much information—is abstraction. In turn, what does abstraction really mean? It really means you build a model of the system, interacting with the system through the model, rather than trying to keep all the information about every subsystem, and how the subsystems interact, in your head. So for each subsystem of the entire system, you have a model you are using, during the troubleshooting process, to try to understand what the system should be doing, and to figure out why it might not be doing what it should be doing.

This has some interesting implications, of course. For instance, when a system is a “black box,” which means you are not supposed to know how the system works, your ability to troubleshoot the system itself is non-existent, and your ability to troubleshoot any larger system of which the black box is a component is severely hampered.

In fact, I will go so far as to say that having accurate models of a system, and each of the components of that system, is one of the three critical skills you need to develop if you are going to be effective at troubleshooting. This is a crucial skill that most people skip—to their disadvantage when it is actually 2AM, the network is down, and they are trying to trace a flow through the topology to figure out what is going on.

There are two subcomponents of this first rule:

  • The better your model of the operation of a system, the better you will be at troubleshooting the system when problems arise.
  • The more models of how “generic” systems in the space you have, the faster you will be able to fit a model to a specific system, to understand it operation, and hence to troubleshoot problems within the system

You should strive, then, to ensure that your stock of models are accurate and broad. For instance—

  • Monitoring a network over a number of years will help you understand the network itself better, and hence to having a more accurate model of that specific network
  • Monitoring, and working on/in, a large number of networks over a number of years will help you build models of general network operation in a wide array of environments, and help you understand what questions to ask and where to look for the answers
  • Learning the theoretical models of how networks actually work will help you understand the why questions that will, in turn, deepen your understanding of the operation of any particular network

Notice I have used the word network in the points above, but these same points are true of every subsystem in the overall system—businesses, applications, protocols, and equipment. Each model is a different perspective, a different lens through which you can see a particular problem; the faster you can shift between models, and combine these models to form a complete picture, the faster you will be able to figure the problem out and solve it—and the more likely you are to solve problems in a way that does not accrue technical debt over the long term.

As an example of the problem with inaccurate, or “not totally useful,” models, I would like to use the OSI model. Every network engineer in the world memorizes this model, and we all talk in terms of its layers. But how useful is the OSI model, really? In real world troubleshooting, the concept of layers is useful, but the specific layers laid out by the OSI model does not tend to be extremely useful. I would suggest the Recursive InterNetwork Architecture (RINA) is a better model for understanding the way traffic flows through a network.