Post-mortem reviews seem to be quite common in the software engineering and application development sides of the IT world—but I do not recall a lot of post-mortems in network engineering across my 30 years. This puzzling observation sprang to mind while I was reading a post over at the ACM this last week about how to effectively learn from the post-mortem exercise.
The common pattern seems to be setting aside a one hour meeting, inviting a lot of people, trying to shift blame while not actually saying you are shifting blame (because we are all supposed to live in a blame-free environment now—fix the problem, not the blame!), and then … a list is created on a whiteboard, pictures are taken, and everyone walks away with a rock-solid plan to never do that again.
This is the first of the ironies of automation Lisanne Bainbridge discusses—and this is the irony I’d like to explore. The irony she is articulating is this: the less you work on a system, the less likely you are to be able to control that system efficiently. Once a system is automated, however, you will not work on the system on a regular basis, but you will be required to take control of the system when the automated controller fails in some way. Ironically, in situations where the automated controller fails, the amount of control required to make things right again will be greater than in normal operation.
In the case of machine operation, it turns out that the human operator is required to control the machine in just the situations where the least amount of experience is available. This is analogous to the automated warehouse in which automated systems are used to stack and sort material. When the automated systems break down, there is absolutely no way for the humans involved to figure out why things are stacked the way they are, nor how to sort things out to get things running again.
Simon Weckhert recently hacked Google Maps into guiding drivers around a street through a rather simple mechanism: he placed 95 cellphones, all connected to Google Maps, in a little wagon and walked down the street with the wagon in tow. Maps saw this group of cell phones as a very congested street—95 cars cannot even physically fit into the street he was walking down—and guided other drivers around the area. The idea is novel, and the result rather funny, but it also illustrates a weakness in our “modern scientific mindset” that often bleeds over into network engineering.
The basic problem is this: we assume users will use things the way we intend them to. This never works out in the real world, because users are going to use wrenches as hammers, cell phones as if they were high-end cameras, and many other things in ways they were never intended. To make matters worse, users often “infer” the way something works, and adapt their actions to get what they want based on their inference. For instance, everyone who drives “reverse-engineers” the road in their head, thinking about what the maximum safe speed might be, etc. Social media users do the same thing when posting or reading through their timeline, causing people to create novel and interesting ideas about how these things work that have no bearing on reality.
One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—
Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.
Sounds simple in theory—but it is not in practice.
Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core.
If you haven’t found the tradeoffs, you haven’t looked hard enough. Something I say rather often—as Eyvonne would say, a “Russism.” Fair enough, and it’s easy enough to say “if you haven’t found the tradeoffs, you haven’t looked hard enough,” but what does it mean, exactly? How do you apply this to the everyday world of designing, deploying, operating, and troubleshooting networks?
Humans tend to extremes in their thoughts. In many cases, we end up considering everything a zero-sum game, where any gain on the part of someone else means an immediate and opposite loss on my part. In others, we end up thinking we are going to get a free lunch. The reality is there is no such thing as a free lunch, and while there are situations that are a zero-sum game, not all situations are. What we need is a way to “cut the middle” to realistically appraise each situation and realistically decide what the tradeoffs might be.
Network engineers do not need to become full-time coders to succeed—but some coding skills are really useful. In this episode of the Hedge, David Barrosso (you can find David’s github repositories here), Phill Simmonds, and Russ White discuss which programming skills are useful for network engineers.
Raise your hand if you think moving to platform as a service or infrastructure as a service is all about saving money. Raise it if you think moving to “the cloud” is all about increasing business agility and flexibility.
Put your hand down. You’re wrong.
Let’s be honest. For the last twenty years we network engineers have specialized in building extremely complex systems and formulating the excuses required when things don’t go right. We’ve specialized in saying “yes” to every requirement (or even wish) because we think that by saying “yes” we will become indispensable. Rather than building platforms on which the business can operate, we’ve built artisanal, complex, pets that must be handled carefully lest they turn into beasts that devour time and money. You know, like the person who tries to replicate store-bought chips by purchasing expensive fryers and potatoes, and ends up just making a mess out of the kitchen?
The state of automation among enterprise operators has been a matter of some interest this year, with several firms undertaking studies of the space. Juniper, for instance, recently released the first yearly edition of the SONAR report, which surveyed many network operators to set a baseline for a better future understanding of how automation is being used. Another recent report in this area is Enterprise Network Automation for 2020 and Beyond, conducted by Enterprise Management Associates.
While these reports are, themselves, interesting for understanding the state of automation in the networking world, one correlation noted on page 13 of the EMA report caught my attention: “Individuals who primarily engage with automation as users are less likely to fully trust automation.” This observation is set in parallel with two others on that same page: “Enterprises that consider network automation a high priority initiative trust automation more,” and “Individuals who fully trust automation report significant improvement in change management capacity.” It seems somewhat obvious these three are related in some way, but how? The answer to this, I think, lies in the relationship between the person and the tool.
For any field of study, there are some mental habits that will make you an expert over time. Whether you are an infrastructure architect, a network designer, or a network reliability engineer, what are the habits of mind those involved in the building and operation of networks follow that mark out expertise?
Experts involve the user
Experts don’t just listen to the user, they involve the user. This means taking the time to teach the developer or application owner how their applications interact with the network, showing them how their applications either simplify or complicate the network, and the impact of these decisions on the overall network.
Experts think about data
Rather than applications. What does the data look like? How does the business use the data? Where does the data need to be, when does it need to be there, how often does it need to go, and what is the cost of moving it? What might be in the data that can be harmful? How can I protect the data while at rest and in flight?