Mean Time to Innocence is not Enough

A long time ago, I supported a wind speed detection system consisting of an impeller, a small electric generator, a 12 gauge cable running a few miles, and a voltmeter. The entire thing was calibrated through a resistive bridge–attach an electric motor to the generator, run it at a series of fixed speed, and adjust the resistive bridge until the voltmeter, marked in knots of wind speed, read correctly.

The primary problem in this system was the several miles of 12 gauge cable. It was often damaged, requiring us to dig the cable up (shovel ready jobs!), strip the cable back, splice the correct pairs together, seal it all in a plastic container filled with goo, and bury it all again. There was one instance, however, when we could not get the wind speed system adjusted correctly, no matter how we tried to tune the resistive bridge. We pulled things apart and determined there must be a problem in one of the (many) splices in the several miles of cable.

At first, we ran a Time Domain Reflectometer (TDR) across the cable to see if we could find the problem. The TDR turned up a couple of hot spots, so we dug those points up … and found there were no splices there. Hmmm … So we called in a specialized cable team. They ran the same TDR tests, dug up the same places, and then did some further testing and found … the cable was innocent.

This set up an argument, running all the way to the base commander level, between our team and the cable team. Who’s fault was this mess? Our inability to measure the wind speed at one end of the runway was impacting flight operations, so this had to be fixed. But rather than fixing the problem, we were spending our time arguing about who’s fault the problem was, and who should fix it.

When I read this line in a recent CAIDA research paper–

“Measurement is political, and often adversarial.”

It rang very true. In Internet terms, speed, congestion, and even usage are often political and adversarial. Just like the wind speed system, two teams were measuring the same thing to prove the problem wasn’t their’s–rather than to figure out what the problem is and how to fix it.

In other words, our goal is too often Mean Time to Innocence (MTTI), rather than Mean Time to Repair (MTTR).

MTTI is not enough. We need to work with our application counterparts to find and fix problems, rather than against them. Measurement should not be adversarial, it should be cooperative.

We need to learn to fix the problem, not the blame.

This is a cultural issue, but it also impacts the way we do telemetry. For instance, in the case of the wind speed indicator, the problem was ultimately a connection that “worked,” but with high capacive reactance such that some kinds of signals were attenuated while others were not. None of us were testing the cable using the right kind of signal, so we all just sat around arguing about who’s problem it was rather than solving the problem.

When a user brings a problem to you, resist the urge to go prove yourself–or your system–innocent. Even if your system isn’t the problem, your system can provide information that can help solve the problem. Treat problems as opportunities to help rather than as opportunies to swish your superhero cape and prove your expertise.

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

  • EIGRP receives a set of queries to which it must respond
  • EIGRP allocates memory for each packet to build the responses
  • Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
  • EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
  • EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

Troubleshooting: Half Split

[time-span]

The best models will support the second crucial skill required for troubleshooting: seeing the system as a set of problems to be solved. The problem/solution mindset is so critical in really understanding how networks really work, and hence how to troubleshoot them, that Ethan Banks and I are writing an entire book around this concept. The essential points are these—

  • Understand the set of problems being solved
  • Understand a wide theoretical set of solutions for this problem, including how each solution interacts with other problems and solutions, potential side effects of using each solution, and where the common faults lie in each solution
  • Understand this implementation
  • of this solution

Having this kind of information in your head will help you pull in detail where needed to fill in the models of each system; just as you cannot keep all four of the primary systems in your head at once, you also cannot effectively troubleshoot without a reservoir of more detailed knowledge about each system, or the ready ability to absorb more information about each system as needed. Having a problem/solution mindset also helps keep you focused in troubleshooting.
So you have built models of each system, and you have learned to think in terms of problems and solutions. What about technique? This is, of course, the step that everyone wants to jump to first—but I would strongly suggest that the technique of troubleshooting goes hand in hand with the models and the mindset of troubleshooting. If you do not have the models and the mindset, the technique is going to be worthless.

In terms of technique, I have tried many through the years, but the most effective is still the one I learned in electronics. This is the third crucial skill for troubleshooting: half split, measure, and move. This technique actually consists of several steps—

  • Trace out the entire path of the signal. This is the first place where your knowledge of the system you are troubleshooting comes into play; if you cannot trace the path of a signal (a flow, or data, or even the way data is passed between layers in the network), then you cannot troubleshoot effectively.
  • Find a half way point in the path of the signal. This half way point would ideally be in a place where you can easily measure state, but at the same time effectively splits the entire signal path in half. One mistake rookies make in troubleshooting is to start with the easiest place to measure, or the points in the flow they understand the best. If you don’t understand some part of the data path, then you need to learn it, rather than avoiding it. Again, this is where system knowledge is going to be crucial. If you have inaccurate models of the system in your head, this is where you are going to fail in your endeavor to troubleshoot a problem.
  • Measure the signal at this halfway point. If the signal is correct, move closer to the tail of the signal path. If the signal is incorrect, move closer to the source of the signal.

The half split method is time proven across many different fields, from electronics to building, to electrical to fluid dynamics. It might seem like a good idea just to “jump to what I know,” but this is a mistake.
For instance, one time I was called out with another tech from my shop to work on the FPS-77 storm detection radar. There was some problem in the transmitter circuit; the transmitter just was not producing power. There was a resister that blew in the “right area” all the time, so we checked the resister, and sure enough, it seemed like it was shorted. We ordered another resister, shut things down, and went home for the morning (by the time we finished working on this, it was around 3AM). The next day, the part came in and was installed by someone else. The resister promptly showed a short again, and the radar system failed to come back up.

What went wrong? I checked what was simple to check, what was a common problem, and walked away thinking I had found the problem, that’s what. It took another day’s worth of troubleshooting to actually pin the problem down, a component that was in parallel with the original resistor, but not on the same board, or even in the same area of the schematics, had shorted out. The resister showed a short because it was in parallel with another component that was actually shorted out.
Lesson learned: do not take short cuts, do not assume the part you can easily test is the part that is broken, and do not assume you have found the problem the first time you find something that does not look right. Make certain you try to falsify your theory, instead of just trying to prove it.

This, then, is the troubleshooting model I have developed across many years of actually working in and around some very complex systems. To reiterate—

  • Build accurate models of the system and all subsystems as possible, particularly the business, the applications, the protocols, and the equipment. This is probably where most failures to effectively troubleshoot problems occur, and the step that takes the longest to complete. In fact, it is probably a truism to say that no-one ever really completes this step, as there is always more to learn about every system, and more accurate ways to model any given system.
  • Have a problem/solution mindset. This is probably the second most common failure point in the troubleshooting process.
  • Half split, measure, and move.

Troubleshooting: Models

[time-span]

How well can you know each of these four systems? Can you actually know them in fine detail, down to the last packet transmitted and the last bit in each packet? Can you know the flow of every packet through the network, and every piece of information any particular application pushes into a packet, or the complete set of ever changing business requirements?

Obviously the answer to these questions is no. As these four components of the network combine, they create a system that suffers from combinatorial explosion. There are far too many combinations, and far too many possible states, for any one person to actually know all of them.

How can you reduce the amount of information to some amount a reasonable human can keep in their minds? The answer—as it is with most problems related to having too much information—is abstraction. In turn, what does abstraction really mean? It really means you build a model of the system, interacting with the system through the model, rather than trying to keep all the information about every subsystem, and how the subsystems interact, in your head. So for each subsystem of the entire system, you have a model you are using, during the troubleshooting process, to try to understand what the system should be doing, and to figure out why it might not be doing what it should be doing.

This has some interesting implications, of course. For instance, when a system is a “black box,” which means you are not supposed to know how the system works, your ability to troubleshoot the system itself is non-existent, and your ability to troubleshoot any larger system of which the black box is a component is severely hampered.

In fact, I will go so far as to say that having accurate models of a system, and each of the components of that system, is one of the three critical skills you need to develop if you are going to be effective at troubleshooting. This is a crucial skill that most people skip—to their disadvantage when it is actually 2AM, the network is down, and they are trying to trace a flow through the topology to figure out what is going on.

There are two subcomponents of this first rule:

  • The better your model of the operation of a system, the better you will be at troubleshooting the system when problems arise.
  • The more models of how “generic” systems in the space you have, the faster you will be able to fit a model to a specific system, to understand it operation, and hence to troubleshoot problems within the system

You should strive, then, to ensure that your stock of models are accurate and broad. For instance—

  • Monitoring a network over a number of years will help you understand the network itself better, and hence to having a more accurate model of that specific network
  • Monitoring, and working on/in, a large number of networks over a number of years will help you build models of general network operation in a wide array of environments, and help you understand what questions to ask and where to look for the answers
  • Learning the theoretical models of how networks actually work will help you understand the why questions that will, in turn, deepen your understanding of the operation of any particular network

Notice I have used the word network in the points above, but these same points are true of every subsystem in the overall system—businesses, applications, protocols, and equipment. Each model is a different perspective, a different lens through which you can see a particular problem; the faster you can shift between models, and combine these models to form a complete picture, the faster you will be able to figure the problem out and solve it—and the more likely you are to solve problems in a way that does not accrue technical debt over the long term.

As an example of the problem with inaccurate, or “not totally useful,” models, I would like to use the OSI model. Every network engineer in the world memorizes this model, and we all talk in terms of its layers. But how useful is the OSI model, really? In real world troubleshooting, the concept of layers is useful, but the specific layers laid out by the OSI model does not tend to be extremely useful. I would suggest the Recursive InterNetwork Architecture (RINA) is a better model for understanding the way traffic flows through a network.

Troubleshooting: Basics

It’s 2AM, the network is down, and the CEO is on the phone asking when it is going to be back up—the overnight job crucial to the business opening in the morning has failed, and the company stands to lose millions of dollars if the network is not fixed in the next hour or so. Almost every network engineer has faced this problem at least once in their career, often involving intense bouts of troubleshooting.

And yet—troubleshooting is a skill that is hardly ever taught. There are a number of computer science programs that do include classes in troubleshooting, but these tend to be mostly focused on tools, rather than technique, or focused on practical skill application. I was also trained in troubleshooting many years ago as a young recruit into the United States Air Force—but the training was, again, practical in bent, with very few theoretical components.

Note to readers: I wrote a short piece on troubleshooting here on rule11, but I have taken that piece down and replaced it with this short series on the topic. I did start writing a book on this topic many years ago, but my co-authors and I soon discovered troubleshooting was going to be a difficult topic to push into a book form. There are a number of presentations in this area, as well, but here I am trying to put my metacognitive spin on the problem more fully, after spending some time researching the topic.

It is always best to begin at the beginning: again, it is 2AM, and there is some problem in the network that is not allowing specific work to get done (that needs to get done). It is tempting to begin with the problem, but it is actually important to back up one step and define what it means for a system to be “broken.” In other words, the first question you need to ask when faced with a broken system is: what should this system be doing?

“Well, it should be transporting this application traffic over there, so this application will work.” It probably should, but I would suggest this is far too narrow of a view of the word “working” to be useful in a more broad sense.
For instance, one of the pieces of equipment on the flightline was a wind speed indicator. This is really fancy name for a really simple device; there was a small “bird” attached to the top of a pole with a tail that would guide the bird into facing the wind, and then at the nose of the bird an impeller attached to a DC motor. The DC motor drove a simple DC voltmeter that was graduated with wind speeds, and the entire system was calibrated using a resistive bridge in the wind speed indicator box, and another in the wind bird itself. The power from the impeller was passed to the voltmeter, several miles away, through a 12 gauge cable. These cables were particularly troublesome, as they were buried, and had to be spliced using gel coated connectors, with splices buried in gel filled casing. This was all before the advent of nitrogen filled conduit to keep water out.

In one particular instance, a splice failed, resulting in us digging the cable up by hand and opening the splice. A special team was called in to resplice the cable, but even with the new splice in place, the wind system could not be calibrated to work correctly. The cable team argued that the cable had all the right voltage and resistance readings; we argued back that the equipment had been working correctly before the splice failed, and all tested on the bench okay, so the problem must still be in that splice. The argument lasted for days.

From the view of the cable team, their “system” was working properly. From the perspective of the weather techs, the system was not. Who was right? It all came down to this: What does the “system” consist of, what does “working properly” really mean? Eventually, by the way, the cable splice was fingered as the problem in a capacitive crosstalk test. The splice was redone, and the problem disappeared.

At 2AM, it is easy to think of the “system” as the network path the application runs over, and the application itself. Such a narrow view of the system can be damaging to your efforts to actually repair the problem. Instead, it is important to begin with an expansive view of the system. This should include:

  • The specifications to which the network was designed, and needs to operate in order to fulfill business requirements
  • The requirements placed on the network by the applications the network is supporting
  • The operation of the protocols used to create the information needed to forward traffic through the network (including policy)
  • The operation of the software and hardware on each network (or forwarding) device in the network

In other words: specification + application + protocols + equipment. Again, this might be a little obvious, but it is easy to forget the entire picture at 2AM when the fires are burning hot, and you are trying to put them out.