Post-mortem reviews seem to be quite common in the software engineering and application development sides of the IT world—but I do not recall a lot of post-mortems in network engineering across my 30 years. This puzzling observation sprang to mind while I was reading a post over at the ACM this last week about how to effectively learn from the post-mortem exercise.
The common pattern seems to be setting aside a one hour meeting, inviting a lot of people, trying to shift blame while not actually saying you are shifting blame (because we are all supposed to live in a blame-free environment now—fix the problem, not the blame!), and then … a list is created on a whiteboard, pictures are taken, and everyone walks away with a rock-solid plan to never do that again.
In a few months’ time, the same team will be in the same room, draw the same drawings, and say the same things all over again. At least that is the way it seems to me. If there is an effective post-mortem process in use by a company someplace, I do not think I have seen it.
From the article—
Are we missing anything in this prevalent rinse-and-repeat cycle of how the industry generally addresses incidents that could be helpful? Put another way: As we experience incidents, work through them, and deal with their aftermath, if we set aside incident-specific, and therefore fundamentally static, remediation items, both in technology and process, are we learning anything else that would be useful in addressing and responding to incidents? Can we describe that knowledge? And if so, how would we then make use of it to leverage past pain and improve future chances at success?
I tend to think, from the few times I have seen network post-mortems performed, that the reason they do not work well is because we slip into the same appliance/configuration frame of mind so quickly. We want to understand what configuration was entered incorrectly, or what defect should be reported back to the vendor, rather than thinking about organizational and process changes. The smaller the detail, the safer the conclusions, after all—aim small, miss big, is what we say in the shooting world.
We focus so much on mean time to innocence, and how to create a practically perfect process that will never fail, that we fail to do the one thing we should be doing: learning.
Okay, so enough whining—what can be done about this situation? A few practical suggestions come to mind. These are not, of course, well-thought-out solutions, but rather, perhaps, “part of the solution.”
Rather than trying to figure out the root cause, spend that precious hour of post-mortem time mapping out three distinct workflows. The first should be the process that set up the failure. What drove the installation of this piece of hardware or software? What drove the deployment of this protocol? How did we get to the place where this failure had that effect? Once this is mapped out, see if there is anything in that process, or even in the political drivers and commitments made during that process, that could or should be modified to really change the way technology is deployed in your network.
The second process you should map out is the steps taken to detect the problem. Dwell time is a huge problem in modern networks—the time between a failure occurring and being detected. You should constantly focus on bringing dwell time down while paying close attention to the collateral damage of false positives. Mapping out how this failure was detected, and where it should have been caught sooner, can help improve telemetry systems, ultimately decreasing MTTR.
The third, and final, workflow you map out should be the troubleshooting process itself. People rarely map out their troubleshooting process for later reference, but this little trick I learned from way back in tube-type electronics days used to save me hours of time in the field. As you troubleshoot, make a flow chart. Record what you checked, why you checked it, how you checked it, and what you learned from the check. This flowchart, or workflow, is precious material in the post-mortem process. What can you instrument, or make easier to find, to reduce troubleshooting time in the next go-round? How can you traverse the network and find the root cause faster next time? These are crucial questions you can only answer with the use of a troubleshooting workflow.
I don’t know if you already do post-mortems or not, or how valuable you think they are—but I would suggest they can be, and are, quite useful. So long as you get out of the narrows and focus on systems and workflows. Aim small, miss big—but aim big and you’ll either hit the target or, at worst, miss small.
Ironies of Automation
In 1983 I was just joining the US Air Force, and still deeply involved in electronics (rather than computers). I had written a few programs in BASIC and assembler on a COCOII with a tape drive, and at least some of the electronics I worked on were used vacuum tube triodes, plate oscillators, and operational amplifiers. This was a magical time, though—a time when “things” were being automated. In fact, one of the reasons I left electronics was because the automation wave left my job “flat.” Instead of looking into the VOR shelter to trace through a signal path using a VOM (remember the safety L!) and oscilloscope, I could sit at a terminal, select a few menu items, grab the right part off the depot shelf, replace, and go home.
Maybe the newer way of doing things was better. On the other hand, maybe not.
What brings all this to mind is a paper from 1983 titled The Ironies of Automation. It might often seem, because of our arrogant belief that we can remake the world through disruption (was the barbarian disruption of Rome in 455 the good sort of disruption, or the bad sort?), we often think we can learn nothing from the past. Reality check: the past is prelude.
What can the past teach us about automation? This is as good a place to start as any other:
There are two general categories of task left for an operator in an automated system. He may be expected to monitor that the automatic system is operating correctly, and if it is not he may be expected to call a more experienced operator or to take-over himself. We will discuss the ironies of manual take-over first, as the points made also have implications for monitoring. To take over and stabilize the process requires manual control skills, to diagnose the fault as a basis for shut down or recovery requires cognitive skills.
This is the first of the ironies of automation Lisanne Bainbridge discusses—and this is the irony I’d like to explore. The irony she is articulating is this: the less you work on a system, the less likely you are to be able to control that system efficiently. Once a system is automated, however, you will not work on the system on a regular basis, but you will be required to take control of the system when the automated controller fails in some way. Ironically, in situations where the automated controller fails, the amount of control required to make things right again will be greater than in normal operation.
In the case of machine operation, it turns out that the human operator is required to control the machine in just the situations where the least amount of experience is available. This is analogous to the automated warehouse in which automated systems are used to stack and sort material. When the automated systems break down, there is absolutely no way for the humans involved to figure out why things are stacked the way they are, nor how to sort things out to get things running again.
This seems intuitive. When I’m running the mill through manual control, after I’ve been running it for a while (I’m out of practice right now), I can “sense” when I’m feeding too fast, meaning I need to slow down to prevent chatter from ruining the piece, or worse—a crash resulting in broken bits of bit flying all over the place.
How does this apply to network operations? On the one hand, it seems like once we automate all the things we will lose the skills of using the CLI to do needed things very quickly. I always say “I can look that command up,” but if I were back in TAC, troubleshooting a common set of problems every day, I wouldn’t want to spend time looking things up—I’d want to have the right commands memorized to solve the problem quickly so I can move to the next case.
This seems to argue against automation entirely, doesn’t it? Perhaps. Or perhaps it just means we need to look at the knowledge we need (and want) in a little different way (along with the monitoring systems we use to obtain that knowledge).
Humans think quick and slow. We either react based on “muscle memory,” or we must think through a situation, dig up the information we need, and weigh out the right path forward. When you are pulling a piece of stainless through a bit and the head starts to chatter, you don’t want to spend time assessing the situation and deciding what to do—you want to react.
But if you are working on an automated machine, and the bit starts to chatter, you might want to react differently. You might want to stop the process entirely and think through how to adjust the automated sequence to prevent the bit from chattering the next time through. In manual control, each work piece is important because each one is individually built. In the automated sequence, the work piece itself is subsumed within the process.
It isn’t that you know “less” in the automated process, it’s that you know different things. In the manual process, you can feel the steel under the blade, the tension and torque, and rely on your muscle memory to react when its needed. In the automated process, you need to know more about the actual qualities of the bit and metal under the bit, the mount, and the mill itself. You have to have more of an immediate sense of how things work if you are doing it manually, but you have to have more of a sense of the theory behind why things work the way if it is automated.
A couple of thoughts in this area, then. First, when we are automating things, we need to be very careful to assume there is no “fast thinking” when things ultimately do fail (it’s not if, it’s when). We need to think through what information we are collecting, and how that information is being presented (if you read the original paper, the author spends a great deal of time discussing how to present information to the operator to overcome the ironies she illuminates) so we take maximum advantage of the “slow path” in the human brain, and stop relying on the “fast path” so much. Second, as we move towards an automated world, we need to start learning, and teaching, more about why and less about how, so we can prepare the “slow path” to be more effective—because the slow path is the part of our thinking that’s going to get more of a workout.
Simon Weckhert recently hacked Google Maps into guiding drivers around a street through a rather simple mechanism: he placed 95 cellphones, all connected to Google Maps, in a little wagon and walked down the street with the wagon in tow. Maps saw this group of cell phones as a very congested street—95 cars cannot even physically fit into the street he was walking down—and guided other drivers around the area. The idea is novel, and the result rather funny, but it also illustrates a weakness in our “modern scientific mindset” that often bleeds over into network engineering.
The basic problem is this: we assume users will use things the way we intend them to. This never works out in the real world, because users are going to use wrenches as hammers, cell phones as if they were high-end cameras, and many other things in ways they were never intended. To make matters worse, users often “infer” the way something works, and adapt their actions to get what they want based on their inference. For instance, everyone who drives “reverse-engineers” the road in their head, thinking about what the maximum safe speed might be, etc. Social media users do the same thing when posting or reading through their timeline, causing people to create novel and interesting ideas about how these things work that have no bearing on reality.
As folks who work in the world of networks, we often “reverse-engineer” a vendor product in much the same way drivers “reverse-engineer” roads and social media users “reverse-engineer” the news feed—we observe how it works in some circumstances, we read some of the documentation, we infer how it must work based on the information we have, and then we design around how we think it works. Sometimes this is a result of abstraction—the vendor has saved us from learning all the “technical details” to make our lives easier. And sometimes abstraction does make our lives easier—but sometimes abstraction makes our lives harder.
I’m reminded of a time I was working with a cable team to bring a wind speed/direction system back up. The system in question relied on several miles of 12c12 cable across which a low voltage signal was driven off a generator attached to an impeller. The folks working on the cable could “see” power flowing on the meter after their repair, so why wouldn’t it work?
In some cases, then, our belief about how these things work is completely wrong, and we end up designing precisely the wrong thing, or doing precisely the wrong thing to bring a failed network back on-line.
Folks involved in networks face this on the other side of the equation, as well—we supply application developers and business users with a set of abstractions they don’t’ really need to understand. In using them, however, they develop “folk theories” about how a network works, coming to conclusions that are often counter-productive to what they are trying to get done. The person in the airline lounge that tells you to reboot your system to see if the WiFi will work doesn’t really understand what the problem is, they just know “this worked once before, so maybe it will work now.”
There is nothing wrong per se with this kind of “reverse-engineering”—we’re going to encounter it every time we abstract things, and abstracting things is necessary to scale. On the other hand, we’re supposed to be the “engineer in the middle”—the person who knows how to relate to the vendor and the user, bridging the gap between product and service. That’s how we add value.
There are some places, like with vendor-supplied gear, that we are dealing with an abstraction we simply cannot rip the lid off. There are many times when we cannot learn the “innards” because there are 24 hours in a day, you cannot learn all that needs to be learned in the available timeframe, and there are times, as a human, that you need to back off and “do something else.” But… there are times when you really need to know what “lies beneath the abstraction”—how things really work.
I suspect the times when understanding “how it really works” would be helpful are very common—and that we would all live a world with a little less vendor hype during the day, and a lot less panic during the night, if we put a little more priority on learning how networks work.
One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—
Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.
Sounds simple in theory—but it is not in practice.
Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core. If you’ve built your network with a decent modular design, you buy enough new routers (or switches—but let’s use routers here) to build out a new aggregation module, the additional firewalls and other middleboxes you need, and the additional line cards to scale the core up. You unit test everything you can in the lab, understanding that you will not be able to fully test in the product network until you arrange a maintenance window. If you’re automating things, you build (and potentially test) the scripts—if you are smart, you will test these scripts in a virtual environment before using them.
You arrange the maintenance window, install the hardware, and … run the scripts. If it works, you go to bed, take a long nap, and get back to work doing “normal maintenance stuff” the next day. Of course, it rarely works, so you preposition some energy bars, make certain you have daycare plans, and put the vendor’s tech support number on speed dial.
What’s wrong with this picture? Well, many things, but primarily: this is not engineering. Was there any thought put into how to test beyond the individual unit level? Is there any way to test realistic traffic flows while connecting the new module to the network without impacting the rest of the network’s operation? Is there any real rollback plan in case things go wrong? Can there be?
In “modern” network design, none of these things tend to exist because they cannot exist. They cannot exist because we have not truly learned to do design life-cycles or truly modular designs. In the software world, if you don’t do modular design, it’s either because you didn’t think it through, or because you thought it through and decided the trade-off just wasn’t worth it. In the networking world, we play around the edges of resilient, modular designs, but networking folks don’t tend to know the underlying technologies—and how they work—well enough to understand how to divide a problem into modules correctly, and the interfaces between those modules.
Let’s consider the same example, but with some engineering principles applied. Instead of a traditional two-layer hierarchy, you have a single-SKU spine and leaf fabric with clearly defined separation between the fabric and pods, clearly defined underlay and overlay protocols, etc. Now you can build a pod and test it against a “fake fabric” before attaching it to the production fabric, including any required automation. Then you can connect the pod to the production fabric and bring up just the underlay protocol, testing the entire underlay before pushing the overlay out to the edge. Then you can push the overlay to the edge and test that before putting any workload on the new pod. Then you can test fake load on the new pod before pushing production traffic onto the pod…
Each of these tests, other than the initial test against a lab environment, can take place on the production network with little or no risk to the entire system. You’re not physically modifying current hardware (except plugging in new cables!), so it’s easy to roll changes back. You know the lower layer parts work before putting the higher layer parts in place. Because the testing happens on the real network, these are canaries rather than traditional “certification” style tests. Because you have real modularization, you can fail fast without causing major harm to any system. Because you are doing things in stages, you can build tests that determine clean and correct operation before moving to the next stage.
This is an engineered solution—thought has been put into proper modules, how those modules connect, what information is carried across those modules, etc. Doing this sort of work requires knowing more than how to configure—or automate—a set of protocols based on what a vendor tells you to do. Doing this sort of work requires understanding what failure looks like at each point in the cycle and deciding whether to fail out or fix it.
It may not meet the “formal” process mathematicians might prefer, but neither is it the “move fast and break stuff” attitude many see in “the Valley.” It is fail fast, but not fail foolishly. And its where we need to move to retain the title of “engineer” and not lose the confidence of the businesses who pay us to build networks that work.
If you haven’t found the tradeoffs, you haven’t looked hard enough. Something I say rather often—as Eyvonne would say, a “Russism.” Fair enough, and it’s easy enough to say “if you haven’t found the tradeoffs, you haven’t looked hard enough,” but what does it mean, exactly? How do you apply this to the everyday world of designing, deploying, operating, and troubleshooting networks?
Humans tend to extremes in their thoughts. In many cases, we end up considering everything a zero-sum game, where any gain on the part of someone else means an immediate and opposite loss on my part. In others, we end up thinking we are going to get a free lunch. The reality is there is no such thing as a free lunch, and while there are situations that are a zero-sum game, not all situations are. What we need is a way to “cut the middle” to realistically appraise each situation and realistically decide what the tradeoffs might be.
This is where the state/optimization/surface (SOS) model comes into play. You’ll find this model described in several of my books alongside some thoughts on complexity theory (see the second chapter here, for instance, or here), but I don’t spend a lot of time discussing how to apply this concept. The answer lies in the intersection between looking for tradeoffs and the SOS model.
TL;DR version: the SOS model tells you where you should look for tradeoffs.
Take the time-worn example of route aggregation, which improves the operation of a network by reducing the “blast radius” of changes in reachability. Combining aggregation with summarization (as is almost always the intent), it reduces the “blast radius” for changes in the network topology as well. The way aggregation and summarization reduce the “blast radius” is simple: if you define a failure domain as the set of devices which must somehow react to a change in the network (the correct way to define a failure domain, by the way), then aggregation and summarization reduce the failure domain by hiding changes in one part of the network from devices in some other part of the network.
Note: the depth of the failure domain is relevant, as well, but not often discussed; this is related to the depth of an interaction surface, but since this is merely a blog post . . .
According to SOS, route aggregation (and topology summarization) is a form of abstraction, which means it is a way of controlling state. If we control state, we should see a corresponding tradeoff in interaction surfaces, and a corresponding tradeoff in some form of optimization. Given these two pointers, we can search for your tradeoffs. Let’s start with interaction surfaces.
Observe aggregation is normally manually configured; this is an interaction surface. The human-to-device interaction surface now needs to account for the additional work of designing, configuring, maintaining, and troubleshooting around aggregation—these things add complexity to the network. Further, the routing protocol must also be designed to support aggregation and summarization, so the design of the protocol must also be more complex. This added complexity is often going to come in the form of . . . additional interaction surfaces, such as the not-to-stubby external conversion to a standard external in OSPF, or something similar.
Now let’s consider optimization. Controlling failure domains allows you to build larger, more stable networks—this is an increase in optimization. At the same time, aggregation removes information from the control plane, which can cause some traffic to take a suboptimal path (if you want examples of this, look at the books referenced above). Traffic taking a suboptimal path is a decrease in optimization. Finally, building larger networks means you are also building a more complex network—so we can see the increase in complexity here, as well.
Experience is often useful in helping you have more specific places to look for these sorts of things, of course. If you understand the underlying problems and solutions (hint, hint), you will know where to look more quickly. If you understand common implementations and the weak points of each of those implementations, you will be able to quickly pinpoint an implementation’s weak points. History might not repeat itself, but it certainly rhymes.
I have spent many years building networks, protocols, and software. I have never found a situation where the SOS model, combined with a solid knowledge of the underlying problems and solutions (or perhaps technologies and implementations used to solve these problems) have led me astray in being able to quickly find the tradeoffs so I could see, and then analyze, them.
Network engineers do not need to become full-time coders to succeed—but some coding skills are really useful. In this episode of the Hedge, David Barrosso (you can find David’s github repositories here), Phill Simonds, and Russ White discuss which programming skills are useful for network engineers.
Raise your hand if you think moving to platform as a service or infrastructure as a service is all about saving money. Raise it if you think moving to “the cloud” is all about increasing business agility and flexibility.
Put your hand down. You’re wrong.
Before going any further, let me clarify things a bit. You’ll notice I did not say software as a service above—for good reason. Move email to the cloud? Why not? Word processing? Sure, word processing is (relatively) a commodity service (though I’m always amazed at the number of people who say “word processor x stinks,” opting to learn complex command sets to “solve the problem,” without first consulting a user manual to see if they can customize “word processor x” to meet their needs).
What about supporting business-specific, or business-critical, applications? You know, the ones you’ve hired in-house developers to create and curate?
Will you save money by moving these applications to a platform as a service? There is, of course, some efficiency to be gained. It is cheaper for a large-scale manufacturer of potato chips to make a bag of chips than for you to cook them in your own home. They have access to specialized slicers, fryers, chemists, and even special potatoes (with more starch than the ones you can buy in a grocery store). Does this necessarily mean that buying potato chips in a bag is always cheaper? In other words, does the manufacturer pass all these savings on to you, the consumer? To ask the question is to know the answer.
And once you’ve turned making all your potato chips over to the professionals, getting rid of the equipment needed to make them, and letting the skill of making good potato chips atrophy, what is going to happen to the price? Yep, thought so.
This is not to say cost is not a factor. Rather, the cost of supporting customized applications on the cloud or local infrastructure needs to be evaluated on a case-by-case basis—either might be cheaper than the other, and the cost of both will change over time.
Does using the cloud afford you more business flexibility? Sometimes, yes. And sometimes, no. Again, the flexibility benefit normally comes from “business agnostic” kinds of flexibility. The kind of flexibility you need to run your business efficiently may, or may not, be the same as the majority of other business. Moving your business to another cloud provider is not always as simple as it initially seems.
The cost and flexibility benefit come from relatively customer-agnostic parts of the business models. To that extent, you rely more on them than they rely on you. Yes, you can vote with your feet if the mickey is taken, but if we’re honest, this kind of supply is almost as inelastic as your old IT service deal. There are few realistic options for supply at scale, and the act of reversing out of a big contract, selecting a new supplier, and making the operational switch can bleed any foreseeable benefits out of a change—something all parties in the procurement process know too well.
So… saving money is sometimes a real reason to outsource things. In some situations, flexibility or agility is going to be a factor. But… there is a third factor I have not mentioned yet—probably the most important, but almost never discussed. Risk aversion.
Let’s be honest. For the last twenty years we network engineers have specialized in building extremely complex systems and formulating the excuses required when things don’t go right. We’ve specialized in saying “yes” to every requirement (or even wish) because we think that by saying “yes” we will become indispensable. Rather than building platforms on which the business can operate, we’ve built artisanal, complex, pets that must be handled carefully lest they turn into beasts that devour time and money. You know, like the person who tries to replicate store-bought chips by purchasing expensive fryers and potatoes, and ends up just making a mess out of the kitchen?
If you want to fully understand your infrastructure, and the real risk of complexity, you need to ask about risk, money, and flexibility—all three. When designing a network, or modifying things to deploy a new service onto an existing network, you need to think about risk as well as cost and flexibility.
How do you manage risk? Sarah Clarke, in the article I quoted above, gives us a few places to start (which I’ve modified to fit the network engineering world). First, ask the question about risk. Don’t just ask “how much money is this going to cost or save,” ask “what risk is being averted or managed here?” You can’t ever think the problem through if you don’t ever ask the question. Second, ask about how you are going to assess the solution against risk, money, and flexibility. How will you know if moving in a particular direction worked? Third, build out clear demarcation points. This is both about the modules within the system as well as responsibilities.
Finally, have an escalation plan. Know what you are going to do when things go wrong, and when you are going to do it. Think about how you can back out of a situation entirely. What are the alternatives? What does it take to get there? You can’t really “unmake” decisions, but you can come to a point where you realize you need to make a different decision. Know what that point is, and at least have the information on hand to know what decision you should make when you get there.
But first, ask the question. Risk aversion drives many more decisions than you might think.