I’ve reformatted and rebuilt my network troubleshooting live training for 2023, and am teaching it on the 26th of January (in three weeks). You can register at Safari Books Online. From the site:
The first way to troubleshoot faster is to not troubleshoot at all, or to build resilient networks. The first section of this class considers the nature of resilience, and how design tradeoffs result in different levels of resilience. The class then moves into a theoretical understanding of failures, how network resilience is measured, and how the Mean Time to Repair (MTTR) relates to human and machine-driven factors. One of these factors is the unintended consequences arising from abstractions, covered in the next section of the class.
The class then moves into troubleshooting proper, examining the half-split formal troubleshooting method and how it can be combined with more intuitive methods. This section also examines how network models can be used to guide the troubleshooting process. The class then covers two examples of troubleshooting reachability problems in a small network, and considers using ChaptGPT and other LLMs in the troubleshooting process. A third, more complex example is then covered in a data center fabric.
A short section on proving causation is included, and then a final example of troubleshooting problems in Internet-level systems.
Have you ever taught a kid to ride a bike? Kids always begin the process by shifting their focus from the handlebars to the pedals, trying to feel out how to keep the right amount of pressure on each pedal, control the handlebars, and keep moving … so they can stay balanced. During this initial learning phase, the kid will keep their eyes down, looking at the pedals, the handlebars, and . . . the ground.
After some time of riding, though, managing the pedals and handlebars are embedded in “muscle memory,” allowing them to get their head up and focus on where they’re going rather than on the mechanical process of riding. After a lot of experience, bike riders can start doing wheelies, or jumps, or off-road riding that goes far beyond basic balance.
Network engineer—any kind of engineering, really—is the same way.
I sometimes reference Keith’s Law in my teaching, but I don’t think I’ve ever explained it. Keith’s Law runs something like this:
Any large external step in a system’s capability is the result of many incremental changes within the system.
We tend to think every technology and every product is roughly unique—so we tend to stay up late at night looking at packet captures and learning how to configure each product individually, and chasing new ones as if they are the brightest new idea (or, in marketing terms, the best thing since sliced bread). Reality check: they aren’t. This applies across life, of course, but especially to technology.
I cannot count the number of times I’ve heard someone ask these two questions—
- What are other people doing?
- What is the best common practice?
While these questions have always bothered me, I could never really put my finger on why. I ran across a journal article recently that helped me understand a bit better. The root of the problem is this—what does best common mean, and how can following the best common produce a set of actions you can be confident will solve your problem?
Those who follow my work know I’ve been focused on building live webinars for the last year or two, but I am still creating pre-recorded material for Pearson. The latest is built from several live webinars which I no longer give; I’ve updated the material and turned them into a seven-hour course called How Networks Really Work. Although I begin here with the “four things,” the focus is on a problem/solution view of routed control planes.
I began writing this post just to remind readers this blog does have a number of RSS feeds—but then I thought … well, I probably need to explain why that piece of information is important.
The amount of writing, video, and audio being thrown at the average person today is astounding—so much so that, according to a lot of research, most people in the digital world have resorted to relying on social media as their primary source of news. Why do most people get their news from social media? I’m pretty convinced this is largely a matter of “it saves time.” The resulting feed might not be “perfect,” but it’s “close enough,” and no-one wants to spend time seeking out a wide variety of news sources so they will be better informed.
Decision making, especially in large organizations, fails in many interesting ways. Understanding these failure modes can help us cope with seemingly difficult situations, and learn how to make decisions better. On this episode of the Hedge, Frederico Lucifredi, Ethan Banks, and Russ White discuss Frederico’s thoughts on developing a taxonomy of indecision. You can find his presentation on this topic here.
Jack of all trades, master of none.
This singular saying—a misquote of Benjamin Franklin (more on this in a moment)—is the defining statement of our time. An alternative form might be the fox knows many small things, but the hedgehog knows one big thing.
When we think of automation—and more broadly tooling—we tend to think of automating the configuration, monitoring, and (possibly) the monitoring of a network. On the other hand, a friend once observed that when interviewing coders, the first thing he asked was about the tools they had developed and used for making themselves more efficient. This “self-tooling” process turns out to be important not just to be more efficient at work, but to use time more effectively in general. Join Nick Russo, Eyvonne Sharp, Tom Ammon, and Russ White as we discuss self-tooling.