The Art and Necessity of Refocusing

Over at his blog The Forwarding Plane, Nick Buraglio posted about embracing change and how technology is mostly unimportant. In the technology-driven world networking folks live in, how can technology be mostly inconsequential? One answer is people drive technology, rather than the other way around—but this misses the real-world consequences of technological adoption on culture. To paraphrase Andy Crouch, technology makes some things possible that were once impossible, some things easy that were once hard, some things hard that were once easy, and some things impossible that were once possible.

There is another answer to this question, though—the real versus the perceived rate of change. When I was a kid, I would ride around with my uncle in his Jeep, a 1968 CJ5 with a soft top and soft doors. He would take the doors off when he took the top down, and—these older Jeeps being much smaller than current models—you could look just to your right and see the road passing by just there under your feet. What always amazed me was I could make myself think I was moving at different speeds just by changing my focus. If I looked across a field at a telephone pole in the distance, it didn’t seem like I was moving all that quickly. If I stared down at the white line on the side of the road, it looked like I was moving very fast indeed. By shifting my focus from here to there, I could adjust my perceived speed.

Here is where the focus on details becomes critical in networking. We do tend to focus on the details. To make matters worse, the average network operator tends to be something of a generalist. Being a generalist focused on details can be a frightening experience.

If you live entirely in the world of Ethernet, then you see past and future changes in the context of the history of Ethernet. This is something like looking at an object a few hundred feet off the road, perhaps. Things are moving quickly, but they aren’t insanely fast, blurry, up close and personal. If you live in wholly in the world of routing protocols, you are going to have a different picture, but the apparent speed is going to be similar, or perhaps even slower.

If you’re a generalist who focuses on detail, though, you’re going to be staring at the white line—at all the features, physical form factors, and products created by a combination of the changes made in routing and Ethernet. If there are two changes in Ethernet, and two in routing, product marketing will create at least four, and probably eight, new features out of the combination of these two, across twenty or thirty product lines. Each of these features will likely be called something different and sold to solve completely different problems.

Staring at the white line is fun at first, then mesmerizing, then it is frightening… then finally it is just plain dull. But let’s talk about the terrifying bit because it’s the scary stage that makes us all reject change out of fear for the future. And, trust me, a kid sitting in a car with no doors staring down at the white line while his uncle drives 60 miles-per-hour is going to be frightened from time to time.

The point Nick is making is we should back off the details and embrace the change. This is great advice—but how? It can feel like you’re going to run off the road if you don’t keep staring at the white line. The answer lies in putting your eyes someplace else—on the posts way out in the field. Ethernet still solves the same problems Bob Metcalf designed it to solve, and it always solves those problems using a small’ish set of solutions. Routing still solves the same problems it did when Dijkstra was mulling around toy problems to show off the processing power of a new computer some 60-odd years ago, and it still solves these problems using a small set of tools.

If you stop looking at the white line and start looking at the poles out in the distance, you’ll not only save your sanity, you’ll also permit yourself to start looking at the sociological and business impacts of new technology, including what matters and what doesn’t.

Two hundred years ago, if you wanted to get from Memphis to say… Lake Providence, Mississippi, you could take a boat directly between the two. Today you would take a car, and the only paths between the two are pretty round-a-about and “small country road” sorts of affairs. On the other hand, getting from Memphis Tennessee to Atlanta, Georgia, is now easy, while a couple of hundred years ago would be a big deal indeed. The sociological changes wrought by moving from rivers to roads are almost impossible to fathom. But you wouldn’t know that if you just stared at the white lines.

Whither Cyber-Insurance?

Note: I’m off in the weeds a little this week thinking about cyber-insurance because of a paper that landed in one of my various feeds—while this isn’t something we often think about as network operators, it does impact the overall security of the systems we build.

When you go to the doctor for a yearly checkup, do you think about health or insurance? You probably think about health, but the practice of going to the doctor for regular checkups began because of large life insurance companies in the United States. These companies began using statistical methods to make risk, or to build actuarial tables they could use to set the premiums properly. Originally, life insurance companies relied on the “hunches” of their salesmen, combined with some checking by people in the “back office,” to determine the correct premium. Over time, they developed networks of informers in local communities, such as doctors, lawyers, and even local politicians, who could describe the life of anyone in their area, providing the information the company needed to set premiums correctly.

Over time, however, statistical methods came into play, particularly relying on an initial visit with a doctor. The information these insurance companies gathered, however, gave them insight into what habits increased or decreased longevity—they decided they should use this information to help shape people’s lives so they would live longer, rather than just using it to discover the correct premiums. To gather more information, and to help people live better lives, life insurance companies started encouraging yearly doctor visits, even setting up non-profit organizations to support the doctors who gave these examinations. Thus was born the yearly doctor’s visit, the credit rating agencies, and a host of other things we take for granted in modern life.

You can read about the early history of life insurance and its impact on society in How Our Days Became Numbered.

What does any of this have to do with networks? Only this—we are in much the same position in the cyber-insurance market right now as the life insurance market in the late 1800s through the mid-1900s—insurance agents interview a company and make a “hunch bet” on how much to charge the company for cyber-insurance. Will cyber-insurance ever mature to the same point as life insurance? According to a recent research paper, the answer is “probably not.”  Why not?

First, legal restrictions will not allow a solution such as the one imposed by payment processors. Second, there does not seem to be a lot of leverage in cyber-insurance premiums. The cost of increasing security is generally much higher than any possible premium discount, making it cheaper for companies just to pay the additional premium than to improve their security posture. Third, there is no real evidence tying the use of specific products to reductions in security breaches. Instead, network and data security tend to be tied to practices rather than products, making it harder for an insurer to precisely specify what a company can and should to improve their posture.

Finally, the largest problem is measurement. What does it look like for a company to “go to the doctor” regularly? Does this mean regular penetration tests? Standardizing penetration tests is difficult, and it can be far too easy to counter pentests without improving the overall security posture. Like medical care in the “early days,” there is no way to know you have gathered enough information on the population to know if you correctly understand the kinds of things that improve “health”—but there is no way to compel reporting (much less accurate reporting), nor is there any way to compel insurance companies to share the information they have about cyber incidents.

Will cyber-insurance exist as a “separate thing” in the future? The authors largely answer in the negative. The pressures of “race to the bottom,” providing maximal coverage with minimal costs (which they attribute to the structure of the cyber-insurance market), combined with lack of regulatory clarity and inaccurate measurements, will probably end up causing cyber-insurance to “fold into” other kinds of insurance.

Whether this is a positive or negative result is a matter of conjecture—the legacy of yearly doctor’s visits and public health campaigns is not universally “good,” after all.

Ironies of Automation

Ironies of Automation

In 1983 I was just joining the US Air Force, and still deeply involved in electronics (rather than computers). I had written a few programs in BASIC and assembler on a COCOII with a tape drive, and at least some of the electronics I worked on were used vacuum tube triodes, plate oscillators, and operational amplifiers. This was a magical time, though—a time when “things” were being automated. In fact, one of the reasons I left electronics was because the automation wave left my job “flat.” Instead of looking into the VOR shelter to trace through a signal path using a VOM (remember the safety L!) and oscilloscope, I could sit at a terminal, select a few menu items, grab the right part off the depot shelf, replace, and go home.

Maybe the newer way of doing things was better. On the other hand, maybe not.

What brings all this to mind is a paper from 1983 titled The Ironies of Automation.  It might often seem, because of our arrogant belief that we can remake the world through disruption (was the barbarian disruption of Rome in 455 the good sort of disruption, or the bad sort?), we often think we can learn nothing from the past. Reality check: the past is prelude.

What can the past teach us about automation? This is as good a place to start as any other:

There are two general categories of task left for an operator in an automated system. He may be expected to monitor that the automatic system is operating correctly, and if it is not he may be expected to call a more experienced operator or to take-over himself. We will discuss the ironies of manual take-over first, as the points made also have implications for monitoring. To take over and stabilize the process requires manual control skills, to diagnose the fault as a basis for shut down or recovery requires cognitive skills.

This is the first of the ironies of automation Lisanne Bainbridge discusses—and this is the irony I’d like to explore. The irony she is articulating is this: the less you work on a system, the less likely you are to be able to control that system efficiently. Once a system is automated, however, you will not work on the system on a regular basis, but you will be required to take control of the system when the automated controller fails in some way. Ironically, in situations where the automated controller fails, the amount of control required to make things right again will be greater than in normal operation.

In the case of machine operation, it turns out that the human operator is required to control the machine in just the situations where the least amount of experience is available. This is analogous to the automated warehouse in which automated systems are used to stack and sort material. When the automated systems break down, there is absolutely no way for the humans involved to figure out why things are stacked the way they are, nor how to sort things out to get things running again.

This seems intuitive. When I’m running the mill through manual control, after I’ve been running it for a while (I’m out of practice right now), I can “sense” when I’m feeding too fast, meaning I need to slow down to prevent chatter from ruining the piece, or worse—a crash resulting in broken bits of bit flying all over the place.

How does this apply to network operations? On the one hand, it seems like once we automate all the things we will lose the skills of using the CLI to do needed things very quickly. I always say “I can look that command up,” but if I were back in TAC, troubleshooting a common set of problems every day, I wouldn’t want to spend time looking things up—I’d want to have the right commands memorized to solve the problem quickly so I can move to the next case.

This seems to argue against automation entirely, doesn’t it? Perhaps. Or perhaps it just means we need to look at the knowledge we need (and want) in a little different way (along with the monitoring systems we use to obtain that knowledge).

Humans think quick and slow. We either react based on “muscle memory,” or we must think through a situation, dig up the information we need, and weigh out the right path forward. When you are pulling a piece of stainless through a bit and the head starts to chatter, you don’t want to spend time assessing the situation and deciding what to do—you want to react.

But if you are working on an automated machine, and the bit starts to chatter, you might want to react differently. You might want to stop the process entirely and think through how to adjust the automated sequence to prevent the bit from chattering the next time through. In manual control, each work piece is important because each one is individually built. In the automated sequence, the work piece itself is subsumed within the process.

It isn’t that you know “less” in the automated process, it’s that you know different things. In the manual process, you can feel the steel under the blade, the tension and torque, and rely on your muscle memory to react when its needed. In the automated process, you need to know more about the actual qualities of the bit and metal under the bit, the mount, and the mill itself. You have to have more of an immediate sense of how things work if you are doing it manually, but you have to have more of a sense of the theory behind why things work the way if it is automated.

A couple of thoughts in this area, then. First, when we are automating things, we need to be very careful to assume there is no “fast thinking” when things ultimately do fail (it’s not if, it’s when). We need to think through what information we are collecting, and how that information is being presented (if you read the original paper, the author spends a great deal of time discussing how to present information to the operator to overcome the ironies she illuminates) so we take maximum advantage of the “slow path” in the human brain, and stop relying on the “fast path” so much. Second, as we move towards an automated world, we need to start learning, and teaching, more about why and less about how, so we can prepare the “slow path” to be more effective—because the slow path is the part of our thinking that’s going to get more of a workout.

Knowing How Things Work

Simon Weckhert recently hacked Google Maps into guiding drivers around a street through a rather simple mechanism: he placed 95 cellphones, all connected to Google Maps, in a little wagon and walked down the street with the wagon in tow. Maps saw this group of cell phones as a very congested street—95 cars cannot even physically fit into the street he was walking down—and guided other drivers around the area. The idea is novel, and the result rather funny, but it also illustrates a weakness in our “modern scientific mindset” that often bleeds over into network engineering.

The basic problem is this: we assume users will use things the way we intend them to. This never works out in the real world, because users are going to use wrenches as hammers, cell phones as if they were high-end cameras, and many other things in ways they were never intended. To make matters worse, users often “infer” the way something works, and adapt their actions to get what they want based on their inference. For instance, everyone who drives “reverse-engineers” the road in their head, thinking about what the maximum safe speed might be, etc. Social media users do the same thing when posting or reading through their timeline, causing people to create novel and interesting ideas about how these things work that have no bearing on reality.

As folks who work in the world of networks, we often “reverse-engineer” a vendor product in much the same way drivers “reverse-engineer” roads and social media users “reverse-engineer” the news feed—we observe how it works in some circumstances, we read some of the documentation, we infer how it must work based on the information we have, and then we design around how we think it works. Sometimes this is a result of abstraction—the vendor has saved us from learning all the “technical details” to make our lives easier. And sometimes abstraction does make our lives easier—but sometimes abstraction makes our lives harder.

I’m reminded of a time I was working with a cable team to bring a wind speed/direction system back up. The system in question relied on several miles of 12c12 cable across which a low voltage signal was driven off a generator attached to an impeller. The folks working on the cable could “see” power flowing on the meter after their repair, so why wouldn’t it work?

In some cases, then, our belief about how these things work is completely wrong, and we end up designing precisely the wrong thing, or doing precisely the wrong thing to bring a failed network back on-line.

Folks involved in networks face this on the other side of the equation, as well—we supply application developers and business users with a set of abstractions they don’t’ really need to understand. In using them, however, they develop “folk theories” about how a network works, coming to conclusions that are often counter-productive to what they are trying to get done. The person in the airline lounge that tells you to reboot your system to see if the WiFi will work doesn’t really understand what the problem is, they just know “this worked once before, so maybe it will work now.”

There is nothing wrong per se with this kind of “reverse-engineering”—we’re going to encounter it every time we abstract things, and abstracting things is necessary to scale. On the other hand, we’re supposed to be the “engineer in the middle”—the person who knows how to relate to the vendor and the user, bridging the gap between product and service. That’s how we add value.

There are some places, like with vendor-supplied gear, that we are dealing with an abstraction we simply cannot rip the lid off. There are many times when we cannot learn the “innards” because there are 24 hours in a day, you cannot learn all that needs to be learned in the available timeframe, and there are times, as a human, that you need to back off and “do something else.” But… there are times when you really need to know what “lies beneath the abstraction”—how things really work.

I suspect the times when understanding “how it really works” would be helpful are very common—and that we would all live a world with a little less vendor hype during the day, and a lot less panic during the night, if we put a little more priority on learning how networks work.

Letting go of Clean Design

What is the best way to build a large-scale network—in two words? Ask ten networking folks (engineers, designers, or whatever else), and you’re likely to get the same answer from at least nine: clean abstractions. They might not say the word abstraction, of course; instead, they might say words like build things in modules, using summarization and aggregation to divide the modules up. Or they might say make certain to reduce the failure domain to the smallest you possible can everywhere you can. Or they might say use hierarchical design. These answers are, however, variants of the single word: abstraction.

This response came to mind when I was reading an article on clean code this last week (it’s amazing how often software architecture overlaps with network architecture):

Once we learn how to create abstractions, it is tempting to get high on that ability, and pull abstractions out of thin air whenever we see repetitive code. After a few years of coding, we see repetition everywhere — and abstracting is our new superpower. If someone tells us that abstraction is a virtue, we’ll eat it. And we’ll start judging other people for not worshipping “cleanliness”.

I have been teaching network design for many, many years. I co-authored my first book on network design, Advanced IP Network Design, with Don Slice and Alvaro Retana; it was published in 1999, and it typically takes about a year to write a book, so we probably started working on it in the middle of 1998. The entire object that book was to teach hierarchical network design, which relies on modularization through aggregation and summarization to separate complexity from complexity (though I didn’t really use this wording until many years later) in order to break up failure domains.

It has been twenty-two years since Don, Alvaro, and I wrote that book—and hierarchical network design is still as relevant today as it was then. But in the last 22 years, I think I’ve learned just a little more about network design.

Among the things I’ve picked up in that 22 years is this one: if you haven’t found the tradeoffs, you haven’t looked hard enough. Or perhaps there is no such thing as a free lunch. Abstraction is a superpower, and it can make your network a lot cleaner, even when you’re using it correctly (not using it to paper over complexity). But building the perfectly clean network can mean reducing the agility of the design to the point of fragility. For instance, in the article linked above, Dan Abramov notes changing requirements made his “clean revision” of the code much more complex—a classic sign of fragility.

Perhaps an example would be helpful here. If you think of RIP as a link state protocol with summarization (abstraction of topology) at every hop, given you understand how link state and distance-vector protocols work, you can probably quickly grasp what you have gained by summarizing at every hop—and what you have lost.

You should still use abstraction to break up failure domains. You should still use abstraction to separate complexity from complexity. But you should use abstraction like you would any other tool. You should decide the best places and times to use abstraction after understanding the whole system.

For instance—a lot of people really insist on aggregating routing information in their data center fabric, especially in the underlay control plane. Why? The underlay is a constrained routing domain with known properties. Aggregation in this environment can cause routing black holes and unpredictable traffic flow behavior—both of which require added complexity to “work around.” If there is another solution available, it might be best to use it.

At the same time, I see a lot of people insisting BGP is the only option for data center underlays, or that it is the simplest option because you can use a single protocol for the underlay and overlay. This, in my opinion, is wrong, as well—because it does not properly separate two different parts of the network, each with their own purpose, into separate failure domains.

Rather than looking at a network and saying, “we can abstract here, so we should abstract here,” you should look at a network and say, “what are the modules here, and what purposes do they serve?” Once you know that, you can start thinking about when and were abstraction makes sense.

To paraphrase Dan, don’t be a clean network design zealot. Clean network design is not a goal. It’s a good guide when you don’t understand the network; such guides are often useful, but they are guides rather than rules.

Too Little Engineering

One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—

Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.

Sounds simple in theory—but it is not in practice.

Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core. If you’ve built your network with a decent modular design, you buy enough new routers (or switches—but let’s use routers here) to build out a new aggregation module, the additional firewalls and other middleboxes you need, and the additional line cards to scale the core up. You unit test everything you can in the lab, understanding that you will not be able to fully test in the product network until you arrange a maintenance window. If you’re automating things, you build (and potentially test) the scripts—if you are smart, you will test these scripts in a virtual environment before using them.

You arrange the maintenance window, install the hardware, and … run the scripts. If it works, you go to bed, take a long nap, and get back to work doing “normal maintenance stuff” the next day. Of course, it rarely works, so you preposition some energy bars, make certain you have daycare plans, and put the vendor’s tech support number on speed dial.

What’s wrong with this picture? Well, many things, but primarily: this is not engineering. Was there any thought put into how to test beyond the individual unit level? Is there any way to test realistic traffic flows while connecting the new module to the network without impacting the rest of the network’s operation? Is there any real rollback plan in case things go wrong? Can there be?

In “modern” network design, none of these things tend to exist because they cannot exist. They cannot exist because we have not truly learned to do design life-cycles or truly modular designs. In the software world, if you don’t do modular design, it’s either because you didn’t think it through, or because you thought it through and decided the trade-off just wasn’t worth it. In the networking world, we play around the edges of resilient, modular designs, but networking folks don’t tend to know the underlying technologies—and how they work—well enough to understand how to divide a problem into modules correctly, and the interfaces between those modules.

Let’s consider the same example, but with some engineering principles applied. Instead of a traditional two-layer hierarchy, you have a single-SKU spine and leaf fabric with clearly defined separation between the fabric and pods, clearly defined underlay and overlay protocols, etc. Now you can build a pod and test it against a “fake fabric” before attaching it to the production fabric, including any required automation. Then you can connect the pod to the production fabric and bring up just the underlay protocol, testing the entire underlay before pushing the overlay out to the edge. Then you can push the overlay to the edge and test that before putting any workload on the new pod. Then you can test fake load on the new pod before pushing production traffic onto the pod…

Each of these tests, other than the initial test against a lab environment, can take place on the production network with little or no risk to the entire system. You’re not physically modifying current hardware (except plugging in new cables!), so it’s easy to roll changes back. You know the lower layer parts work before putting the higher layer parts in place. Because the testing happens on the real network, these are canaries rather than traditional “certification” style tests. Because you have real modularization, you can fail fast without causing major harm to any system. Because you are doing things in stages, you can build tests that determine clean and correct operation before moving to the next stage.

This is an engineered solution—thought has been put into proper modules, how those modules connect, what information is carried across those modules, etc. Doing this sort of work requires knowing more than how to configure—or automate—a set of protocols based on what a vendor tells you to do. Doing this sort of work requires understanding what failure looks like at each point in the cycle and deciding whether to fail out or fix it.

It may not meet the “formal” process mathematicians might prefer, but neither is it the “move fast and break stuff” attitude many see in “the Valley.” It is fail fast, but not fail foolishly. And its where we need to move to retain the title of “engineer” and not lose the confidence of the businesses who pay us to build networks that work.

Nines are not enough

How many 9’s is your network? How about your service provider’s? Now, to ask the not-so-obvious question—why do you care? Does the number of 9’s actually describe the reliability of the network? According to Jeffery Mogul and John Wilkes, nines are not enough. The question is—while this paper was written for commercial relationships and cloud providers, is it something you can apply to running your own network? Let’s dive into the meat of the paper and find out.

While 5 9’s is normally given as a form of Service Level Agreement (SLA), there are two other measures of reliability a network operator needs to consider—the Service Level Objective (SLO), and the Service Level Indicator (SLI). The SLO defines a set of expectations about the level of service; internal SLO’s define “trigger points” where actions should be taken to prevent an external SLO from failing. For instance, if the external SLO says no more than 2% of the traffic will be dropped on this link, the internal SLO might say if more than 1% of the traffic on this link is dropped, you need to act. The SLA, on the other hand, says if more than 2% of the traffic on this link is dropped, the operator will rebate (some amount) to the customer. The SLI says this is how I am going to measure the percentage of packets dropped on this link.

Splitting these three concepts apart helps reveal what is wrong with the entire 5 9’s way of thinking, because it enables you to ask questions like—can my telemetry system measure and report on the amount of traffic dropped on this link? Across what interval should this SLI apply? If I combine all the SLI’s across my entire network, what does the monitoring system need to look like? Can I support the false positives likely to occur with such a monitoring system?

These questions might be obvious, of course, but there are more non-obvious ones, as well. For instance—how do my internal and external SLO’s correlate to my SLI’s? Measuring the amount of traffic dropped on a link is pretty simple (in theory). Measuring something like this application will not perform at less than 50% capacity because of network traffic is going to be much, much harder.

The point Mogul and Wilkes make in this paper is that we just need to rethink the way we write SLO’s and their resulting SLA’s to be more realistic—in particular, we need to think about whether or not the SLI’s we can actually measure and act on can cash the SLO and SLA checks we’re writing. This means we probably need to expose more, rather than less, of the complexity of the network itself—even though this cuts against the grain of the current move towards abstracting the network down to “ports and packets.” To some degree, the consumer of networking services is going to need to be more informed if we are to build realistic SLA’s that can be written and kept.

How does this apply to the “average enterprise network engineer?” At first glance, it might seem like this paper is strongly oriented towards service providers, since there are definite contracts, products, etc.,  in play. If you squint your eyes, though, you can see how this would apply to the rest of the world. The implicit promise you make to an application developer or owner that their application will, in fact, run on the network with little or no performance degradation is, after all, an SLO. Your yearly review examining how well the network has met the needs of the organization is an SLA of sorts.

The kind of thinking represented here, if applied within an organization, could turn the conversation about whether to out- or in-source on its head. Rather than talking about the 5 9’s some cloud provider is going to offer, it opens up discussions about how and what to measure, even within the cloud service, to understand the performance being offered, and how more specific and nuanced results can be measured against a fuller picture of value added.

This is a short paper—but well worth reading and considering.