Learning from the Post-Mortem

Post-mortem reviews seem to be quite common in the software engineering and application development sides of the IT world—but I do not recall a lot of post-mortems in network engineering across my 30 years. This puzzling observation sprang to mind while I was reading a post over at the ACM this last week about how to effectively learn from the post-mortem exercise.

The common pattern seems to be setting aside a one hour meeting, inviting a lot of people, trying to shift blame while not actually saying you are shifting blame (because we are all supposed to live in a blame-free environment now—fix the problem, not the blame!), and then … a list is created on a whiteboard, pictures are taken, and everyone walks away with a rock-solid plan to never do that again.

In a few months’ time, the same team will be in the same room, draw the same drawings, and say the same things all over again. At least that is the way it seems to me. If there is an effective post-mortem process in use by a company someplace, I do not think I have seen it.

From the article—

Are we missing anything in this prevalent rinse-and-repeat cycle of how the industry generally addresses incidents that could be helpful? Put another way: As we experience incidents, work through them, and deal with their aftermath, if we set aside incident-specific, and therefore fundamentally static, remediation items, both in technology and process, are we learning anything else that would be useful in addressing and responding to incidents? Can we describe that knowledge? And if so, how would we then make use of it to leverage past pain and improve future chances at success?

I tend to think, from the few times I have seen network post-mortems performed, that the reason they do not work well is because we slip into the same appliance/configuration frame of mind so quickly. We want to understand what configuration was entered incorrectly, or what defect should be reported back to the vendor, rather than thinking about organizational and process changes. The smaller the detail, the safer the conclusions, after all—aim small, miss big, is what we say in the shooting world.

We focus so much on mean time to innocence, and how to create a practically perfect process that will never fail, that we fail to do the one thing we should be doing: learning.

Okay, so enough whining—what can be done about this situation? A few practical suggestions come to mind. These are not, of course, well-thought-out solutions, but rather, perhaps, “part of the solution.”

Rather than trying to figure out the root cause, spend that precious hour of post-mortem time mapping out three distinct workflows. The first should be the process that set up the failure. What drove the installation of this piece of hardware or software? What drove the deployment of this protocol? How did we get to the place where this failure had that effect? Once this is mapped out, see if there is anything in that process, or even in the political drivers and commitments made during that process, that could or should be modified to really change the way technology is deployed in your network.

The second process you should map out is the steps taken to detect the problem. Dwell time is a huge problem in modern networks—the time between a failure occurring and being detected. You should constantly focus on bringing dwell time down while paying close attention to the collateral damage of false positives. Mapping out how this failure was detected, and where it should have been caught sooner, can help improve telemetry systems, ultimately decreasing MTTR.

The third, and final, workflow you map out should be the troubleshooting process itself. People rarely map out their troubleshooting process for later reference, but this little trick I learned from way back in tube-type electronics days used to save me hours of time in the field. As you troubleshoot, make a flow chart. Record what you checked, why you checked it, how you checked it, and what you learned from the check. This flowchart, or workflow, is precious material in the post-mortem process. What can you instrument, or make easier to find, to reduce troubleshooting time in the next go-round? How can you traverse the network and find the root cause faster next time? These are crucial questions you can only answer with the use of a troubleshooting workflow.

I don’t know if you already do post-mortems or not, or how valuable you think they are—but I would suggest they can be, and are, quite useful. So long as you get out of the narrows and focus on systems and workflows. Aim small, miss big—but aim big and you’ll either hit the target or, at worst, miss small.