Modern Network Troubleshooting

I’ve reformatted and rebuilt my network troubleshooting live training for 2023, and am teaching it on the 26th of January (in three weeks). You can register at Safari Books Online. From the site:

The first way to troubleshoot faster is to not troubleshoot at all, or to build resilient networks. The first section of this class considers the nature of resilience, and how design tradeoffs result in different levels of resilience. The class then moves into a theoretical understanding of failures, how network resilience is measured, and how the Mean Time to Repair (MTTR) relates to human and machine-driven factors. One of these factors is the unintended consequences arising from abstractions, covered in the next section of the class.

The class then moves into troubleshooting proper, examining the half-split formal troubleshooting method and how it can be combined with more intuitive methods. This section also examines how network models can be used to guide the troubleshooting process. The class then covers two examples of troubleshooting reachability problems in a small network, and considers using ChaptGPT and other LLMs in the troubleshooting process. A third, more complex example is then covered in a data center fabric.

A short section on proving causation is included, and then a final example of troubleshooting problems in Internet-level systems.