This is the first of the ironies of automation Lisanne Bainbridge discusses—and this is the irony I’d like to explore. The irony she is articulating is this: the less you work on a system, the less likely you are to be able to control that system efficiently. Once a system is automated, however, you will not work on the system on a regular basis, but you will be required to take control of the system when the automated controller fails in some way. Ironically, in situations where the automated controller fails, the amount of control required to make things right again will be greater than in normal operation.
In the case of machine operation, it turns out that the human operator is required to control the machine in just the situations where the least amount of experience is available. This is analogous to the automated warehouse in which automated systems are used to stack and sort material. When the automated systems break down, there is absolutely no way for the humans involved to figure out why things are stacked the way they are, nor how to sort things out to get things running again.
Simon Weckhert recently hacked Google Maps into guiding drivers around a street through a rather simple mechanism: he placed 95 cellphones, all connected to Google Maps, in a little wagon and walked down the street with the wagon in tow. Maps saw this group of cell phones as a very congested street—95 cars cannot even physically fit into the street he was walking down—and guided other drivers around the area. The idea is novel, and the result rather funny, but it also illustrates a weakness in our “modern scientific mindset” that often bleeds over into network engineering.
The basic problem is this: we assume users will use things the way we intend them to. This never works out in the real world, because users are going to use wrenches as hammers, cell phones as if they were high-end cameras, and many other things in ways they were never intended. To make matters worse, users often “infer” the way something works, and adapt their actions to get what they want based on their inference. For instance, everyone who drives “reverse-engineers” the road in their head, thinking about what the maximum safe speed might be, etc. Social media users do the same thing when posting or reading through their timeline, causing people to create novel and interesting ideas about how these things work that have no bearing on reality.
What is the best way to build a large-scale network—in two words? Ask ten networking folks (engineers, designers, or whatever else), and you’re likely to get the same answer from at least nine: clean abstractions. They might not say the word abstraction, of course; instead, they might say words like build things in modules, using summarization and aggregation to divide the modules up. Or they might say make certain to reduce the failure domain to the smallest you possible can everywhere you can. Or they might say use hierarchical design. These answers are, however, variants of the single word: abstraction.
One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—
Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.
Sounds simple in theory—but it is not in practice.
Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core.
How many 9’s is your network? How about your service provider’s? Now, to ask the not-so-obvious question—why do you care? Does the number of 9’s actually describe the reliability of the network? According to Jeffery Mogul and John Wilkes, nines are not enough. The question is—while this paper was written for commercial relationships and cloud providers, is it something you can apply to running your own network? Let’s dive into the meat of the paper and find out.
While 5 9’s is normally given as a form of Service Level Agreement (SLA), there are two other measures of reliability a network operator needs to consider—the Service Level Objective (SLO), and the Service Level Indicator (SLI).
If you haven’t found the tradeoffs, you haven’t looked hard enough. Something I say rather often—as Eyvonne would say, a “Russism.” Fair enough, and it’s easy enough to say “if you haven’t found the tradeoffs, you haven’t looked hard enough,” but what does it mean, exactly? How do you apply this to the everyday world of designing, deploying, operating, and troubleshooting networks?
Humans tend to extremes in their thoughts. In many cases, we end up considering everything a zero-sum game, where any gain on the part of someone else means an immediate and opposite loss on my part. In others, we end up thinking we are going to get a free lunch. The reality is there is no such thing as a free lunch, and while there are situations that are a zero-sum game, not all situations are. What we need is a way to “cut the middle” to realistically appraise each situation and realistically decide what the tradeoffs might be.
Raise your hand if you think moving to platform as a service or infrastructure as a service is all about saving money. Raise it if you think moving to “the cloud” is all about increasing business agility and flexibility.
Put your hand down. You’re wrong.
Let’s be honest. For the last twenty years we network engineers have specialized in building extremely complex systems and formulating the excuses required when things don’t go right. We’ve specialized in saying “yes” to every requirement (or even wish) because we think that by saying “yes” we will become indispensable. Rather than building platforms on which the business can operate, we’ve built artisanal, complex, pets that must be handled carefully lest they turn into beasts that devour time and money. You know, like the person who tries to replicate store-bought chips by purchasing expensive fryers and potatoes, and ends up just making a mess out of the kitchen?
If you are looking for a good resolution for 2020 still (I know, it’s a bit late), you can’t go wrong with this one: this year, I will focus on making the networks and products I work on truly simpler. . . We need to go beyond just figuring out how to make the user interface simpler, more “intent-driven,” automated, or whatever it is. We need to think of the network as a system, rather than as a collection of bits and bobs that we’ve thrown together across the years. We need to think about the modules horizontally and vertically, think about how they interact, understand how each piece works, understand how each abstraction leaks, and be able to ask hard questions.
Yep, it’s that time of year when everyone does “retrospective pieces…” So… why not? There were several notable events this year—first and foremost, I kicked off a new podcast called the Hedge for network engineers. It’s probably not going to make anyone’s “top ten list of must listen to podcasts” anytime soon (if ever), but it’s been a lot of fun to move out of the commercial podcast space and just talk about “whatever seems interesting.” The History of Networking podcast also became independent this year; we are chugging along at more than 60 episodes, and there are a lot of great guests yet to come.
On the personal front, I moved from LinkedIn to Juniper Networks, and made some progress at school. I have finished my coursework and passed my comprehensive exams, so I’m now a PhD candidate, or as it is more commonly known, ABD.
Rule11 has, as a blog, had a good year. The most popular posts were:
The state of automation among enterprise operators has been a matter of some interest this year, with several firms undertaking studies of the space. Juniper, for instance, recently released the first yearly edition of the SONAR report, which surveyed many network operators to set a baseline for a better future understanding of how automation is being used. Another recent report in this area is Enterprise Network Automation for 2020 and Beyond, conducted by Enterprise Management Associates.
While these reports are, themselves, interesting for understanding the state of automation in the networking world, one correlation noted on page 13 of the EMA report caught my attention: “Individuals who primarily engage with automation as users are less likely to fully trust automation.” This observation is set in parallel with two others on that same page: “Enterprises that consider network automation a high priority initiative trust automation more,” and “Individuals who fully trust automation report significant improvement in change management capacity.” It seems somewhat obvious these three are related in some way, but how? The answer to this, I think, lies in the relationship between the person and the tool.