There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.
Let me try to recap some of the arguments I’ve heard used against my assertion.
Staring at the white line is fun at first, then mesmerizing, then it is frightening… then finally it is just plain dull. But let’s talk about the terrifying bit because it’s the scary stage that makes us all reject change out of fear for the future. And, trust me, a kid sitting in a car with no doors staring down at the white line while his uncle drives 60 miles-per-hour is going to be frightened from time to time.
Will cyber-insurance exist as a “separate thing” in the future? The authors largely answer in the negative. The pressures of “race to the bottom,” providing maximal coverage with minimal costs (which they attribute to the structure of the cyber-insurance market), combined with lack of regulatory clarity and inaccurate measurements, will probably end up causing cyber-insurance to “fold into” other kinds of insurance.
This is the first of the ironies of automation Lisanne Bainbridge discusses—and this is the irony I’d like to explore. The irony she is articulating is this: the less you work on a system, the less likely you are to be able to control that system efficiently. Once a system is automated, however, you will not work on the system on a regular basis, but you will be required to take control of the system when the automated controller fails in some way. Ironically, in situations where the automated controller fails, the amount of control required to make things right again will be greater than in normal operation.
In the case of machine operation, it turns out that the human operator is required to control the machine in just the situations where the least amount of experience is available. This is analogous to the automated warehouse in which automated systems are used to stack and sort material. When the automated systems break down, there is absolutely no way for the humans involved to figure out why things are stacked the way they are, nor how to sort things out to get things running again.
Simon Weckhert recently hacked Google Maps into guiding drivers around a street through a rather simple mechanism: he placed 95 cellphones, all connected to Google Maps, in a little wagon and walked down the street with the wagon in tow. Maps saw this group of cell phones as a very congested street—95 cars cannot even physically fit into the street he was walking down—and guided other drivers around the area. The idea is novel, and the result rather funny, but it also illustrates a weakness in our “modern scientific mindset” that often bleeds over into network engineering.
The basic problem is this: we assume users will use things the way we intend them to. This never works out in the real world, because users are going to use wrenches as hammers, cell phones as if they were high-end cameras, and many other things in ways they were never intended. To make matters worse, users often “infer” the way something works, and adapt their actions to get what they want based on their inference. For instance, everyone who drives “reverse-engineers” the road in their head, thinking about what the maximum safe speed might be, etc. Social media users do the same thing when posting or reading through their timeline, causing people to create novel and interesting ideas about how these things work that have no bearing on reality.
What is the best way to build a large-scale network—in two words? Ask ten networking folks (engineers, designers, or whatever else), and you’re likely to get the same answer from at least nine: clean abstractions. They might not say the word abstraction, of course; instead, they might say words like build things in modules, using summarization and aggregation to divide the modules up. Or they might say make certain to reduce the failure domain to the smallest you possible can everywhere you can. Or they might say use hierarchical design. These answers are, however, variants of the single word: abstraction.
One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—
Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.
Sounds simple in theory—but it is not in practice.
Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core.
How many 9’s is your network? How about your service provider’s? Now, to ask the not-so-obvious question—why do you care? Does the number of 9’s actually describe the reliability of the network? According to Jeffery Mogul and John Wilkes, nines are not enough. The question is—while this paper was written for commercial relationships and cloud providers, is it something you can apply to running your own network? Let’s dive into the meat of the paper and find out.
While 5 9’s is normally given as a form of Service Level Agreement (SLA), there are two other measures of reliability a network operator needs to consider—the Service Level Objective (SLO), and the Service Level Indicator (SLI).
If you haven’t found the tradeoffs, you haven’t looked hard enough. Something I say rather often—as Eyvonne would say, a “Russism.” Fair enough, and it’s easy enough to say “if you haven’t found the tradeoffs, you haven’t looked hard enough,” but what does it mean, exactly? How do you apply this to the everyday world of designing, deploying, operating, and troubleshooting networks?
Humans tend to extremes in their thoughts. In many cases, we end up considering everything a zero-sum game, where any gain on the part of someone else means an immediate and opposite loss on my part. In others, we end up thinking we are going to get a free lunch. The reality is there is no such thing as a free lunch, and while there are situations that are a zero-sum game, not all situations are. What we need is a way to “cut the middle” to realistically appraise each situation and realistically decide what the tradeoffs might be.
Raise your hand if you think moving to platform as a service or infrastructure as a service is all about saving money. Raise it if you think moving to “the cloud” is all about increasing business agility and flexibility.
Put your hand down. You’re wrong.
Let’s be honest. For the last twenty years we network engineers have specialized in building extremely complex systems and formulating the excuses required when things don’t go right. We’ve specialized in saying “yes” to every requirement (or even wish) because we think that by saying “yes” we will become indispensable. Rather than building platforms on which the business can operate, we’ve built artisanal, complex, pets that must be handled carefully lest they turn into beasts that devour time and money. You know, like the person who tries to replicate store-bought chips by purchasing expensive fryers and potatoes, and ends up just making a mess out of the kitchen?