What percentage of business-impacting application outages are caused by networks? According to a recent survey by the Uptime Institute, about 30% of the 300 operators they surveyed, 29% have experienced network related outages in the last three years—the highest percentage of causes for IT failures across the period.
A secondary question on the survey attempted to “dig a little deeper” to understand the reasons for network failure; the chart below shows the result.
We can be almost certain the third-party failures, if the providers were queried, would break down along the same lines. Is there a pattern among the reasons for failure?
Configuration change—while this could be somewhat managed through automation, these kinds of failures are more generally the result of complexity. Firmware and software failures? The more complex the pieces of software, the more likely it is to have mission-impacting errors of some kind—so again, complexity related. Corrupted policies and routing tables are also complexity related. The only item among the top preventable causes that does not seem, at first, to relate directly to complexity is network overload and/or congestion problems. Many of these cases, however, might also be complexity related.
The Uptime Institute draws this same lesson, though through a slightly different process, saying: “Networks are complex not only technically, but also operationally.”
For years—decades, even—we have talked about the increasing complexity of networks, but we have done little about it. Yes, we have automated all the things, but automation can only carry us so far in covering complexity up. Automation also adds a large dop of complexity on top of the existing network—sometimes (not always, of course!) automating a complex system without making substantial efforts at simplification is just like trying to put a fire out with a can of gas (or, in one instance I actually saw, trying to put out an electrical fire with a can of soda, with the predictable trip to the local hospital.
We are (finally) starting to be “bit hard” by complexity problems in our networks—and I suspect this is the leading edge of the problem, rather than the trailing edge.
Maybe it’s time to realize making every protocol serve every purpose in the network wasn’t a good idea—we now have protocols that are so complex that they can only be correctly configured by machines, and then only when you narrow the use case enough to make the design parameters intelligible.
Maybe it’s time to realize optimizing for every edge use case wasn’t a good idea. Sometimes it’s just better to throw resources at the problem, rather than throwing state at the control plane to squeeze out just one more ounce of optimization.
Maybe it’s time to stop building networks around “whatever the application developer can dream up.” To start working as a team with the application developers to build a complete system that puts complexity where it most makes sense, and divides complexity from complexity, rather than just assuming “the network can do that.”
Maybe it’s time to stop thinking we can automate our way out of this.
Maybe it’s time to lay our superhero capes down and just start building simpler systems.
Crossing from the domain of test pilots to the domain of network engineering might seem like a large leap indeed—but user interfaces and their tradeoffs are common across physical and virtual spaces. Brian Keys, Eyvonne Sharp, Tom Ammon, and Russ White as we start with user interfaces and move into a wider discussion around attitudes and beliefs in the network engineering world.
Every software developer has run into “god objects”—some data structure or database that every process must access no matter what it is doing. Creating god objects in software is considered an anti-pattern—something you should not do. Perhaps the most apt description of the god object I’ve seen recently is you ask for a banana, and you get the gorilla as well.
We seem to have a deep desire to solve all the complexity of modern networks through god objects. There was ATM, which was going to solve all our networking problems by allowing the edge device (or a centralized controller) to control the path its traffic takes through the network. There is LISP, which is going to solve every mapping and tunneling/transport problem in the entire networking world (including mobility and security). There is SDN, which is going to solve everything by pushing it all into a controller.
And now there is BGP, which can be a link state protocol (LSVR), the ideal DC fabric control plane, the ideal interdomain protocol, the ideal IGP … a sort-of distributed god object that solves everything, everywhere, all the time (life in the fast lane…).
The problem is, a bunch of people are asking for different bananas, and what we keep getting is the gorilla.
A lot of this is our fault. We crave simplicity so much that we are willing to believe just about anyone, or anything, that promises to solve all the problems we face in building and operating networks in a simple and easy way. The truth is, however, there are no simple solutions to hard problems—solving hard problems requires complex solutions.
Okay, so we can intuitively know god objects are bad, but why? Because they are too complex to really understand, and therefore they cannot be truly operated, nor can you effectively troubleshoot them. There are too many unintended consequences, too many places where you cannot understand the relationship between this and that.
To put this in more human terms—there are many people in the modern world who think that if they just have enough data, they will understand people well enough to operate and troubleshoot them like some kind of machine. Sorry to tell you this—humans are just too complex, and human social institutions, being made up of people (well, duh!), are way more complex than even the most intelligent artificial intelligence can ever “understand.”
We need to fall out of love with the utopia of the god object for once and for all. We need to go back to building simpler systems that solve one or two problems well, and then combining these into intelligible solutions that can be understood, managed, and repaired.
But the move back towards real simplicity has to begin with you.
One of the major sources of complexity in modern systems is the simple failure to pull back the curtains. From a recent blog post over at the ACM—
The Wizard of Oz was a charlatan. You’d be surprised, too, how many programmers don’t understand what’s going on behind the curtain either. Some years ago, I was talking with the CTO of a company, and he asked me to explain what happens when you type a URL into your browser and hit enter. Do you actually know what happens? Think about it for a moment.
Yegor describes three different reactions when a coder faces something unexpected when solving a problem.
Throw in the towel. Just give up on solving the problem. This is fairly uncommon in the networking and programming fields, so I don’t have much to say here.
Muddle through. Just figure out how to make it work by whatever means necessary.
Open the curtains and build an excellent solution. Learn how the underlying systems work, understand how to interact with them, and create a solution that best takes advantage of them.
The first and third options are rare indeed; it is the second solution that seems to dominate our world. What generally tends to happen is we set out to solve some problem, we encounter resistance, and we either “just make it work” by fiddling around with the bits or we say “this is just too complex, I’m going to build something new that simpler and easier.” The problem with building something new is the “something new” must go someplace … which generally means on top of existing “stuff.” Adding more stuff you do understand on top of stuff you don’t understand to solve a problem is, of course, a prime way to increase complexity in a network.
And thus we have one of the prime reasons for ever-increasing complexity in networks.
Yegor says being a great programmer by pulling back the curtain increases job satisfaction, helping him avoid depression. The same is probably true of network engineers who are deeply interested in solving problems—who are only happy at the end of the day if they know they have solved some problem, even if no-one ever notices.
Pulling back the curtains, then not only helps us to manage complexity, it can alos improve job satisfaction for those with the problem-solving mindset. Great reasons to pull back to the curtains, indeed.
According to Maor Rudick, in a recent post over at Cloud Native, programming is 10% writing code and 90% understanding why it doesn’t work. This expresses the art of deploying network protocols, security, or anything that needs thought about where and how. I’m not just talking about the configuration, either—why was this filter deployed here rather than there? Why was this BGP community used rather than that one? Why was this aggregation range used rather than some other? Even in a fully automated world, the saying holds true.
So how can you improve the understandability of your network design? Maor defines understandability as “the dev who creates the software is to effortlessly … comprehend what is happening in it.” Continuing—“the more understandable a system is, the easier it becomes for the developers who created it to change it in a way that is safe and predictable.” What are the elements of understandability?
Documentation must be complete, clear, concise, and organized. The two primary failings I encounter in documentation are completeness and organization. Why something is done, when it was last changed, and why it was changed are often missing. The person making the change just assumes “I’ll remember this, or someone will figure it out.” You won’t, and they won’t. Concise is the “other side” of complete … Recording unsubstantial changes just adds information that won’t ever be needed. You have to balance between enough and too much, of course.
Organization is another entire problem in documentation—most people have a favorite way to organize things. When you get a team of people all organizing things based on their favorite way, you end up with a mess. Going back in time … I remember that just about everyone who was assigned to the METNAV shop began their time by re-organizing the tools. Each time the re-organization made things so much easier to find, and improved the MTTR for the airfield equipment we supported … After a while, you’d think someone would ask, “Does re-organizing all the tools every year really help? Or are you just making stuff up for new folks to do?”
Moving beyond documentation, what else can we do to make our networks more understandable?
First, we can focus on actually making networks simpler. I don’t mean just glossing things over with a pretty GUI, or automating thousands of lines of configuration using Python. I mean taking steps by using protocols that are simpler to run, require less configuration, and produce more information you can use for troubleshooting—choose something like IS-IS for your DC fabric underlay rather than BGP, unless you really have several hundred thousand of underlay destinations (hint, if you’ve properly separated “customer” routes in the overlay from “infrastructure” routes in the underlay, you shouldn’t have this kind of routing tangle in the underlay anyway).
What about having multiple protocols that do the same job? Do you really need three or four routing protocols, four or five tunneling protocols, and five or six … well, you get the idea. Reducing the sheer number of protocols running in your network can make a huge difference in the tooling troubleshooting time. What about having four or five kinds of boxes in your network that fulfill the same role? Okay—so maybe you have three DC fabrics, and you run each one using a different vendor. But is there is any reason to have three DC fabrics, each of which has a broad mix of equipment from five different vendors? I doubt it.
Second, you can think about what you would measure in the case of failure, how you would measure it, and put the basic piece in place in the design phase to do those measurements. Don’t wait until you need the data to figure out how to get at it, and what the performance results of trying to get it are going to be.
Third, you can think about where you put policy in your network. There is no “right” answer to this question, other than … be consistent. The first option is to put all your policy in one place—say, on the devices that connect the core to the aggregation, or the devices in the distribution layer. The second option is to always put the policy as close to the source or destination of the traffic impacted by the policy. In a DC fabric, you should always put policy and external connectivity in the T0 or ToR, never in the spine (it’s not a core, it’s a spine).
Maybe you have other ideas on how to improve understandability in networks … If you do, get in touch and let’s talk about it. I’m always looking for practical ways to make networks more understandable.
I think we can all agree networks have become too complex—and this complexity is a result of the network often becoming the “final dumping ground” of every problem that seems like it might impact more than one system, or everything no-one else can figure out how to solve. It’s rather humorous, in fact, to see a lot of server and application folks sitting around saying “this networking stuff is so complex—let’s design something better and simpler in our bespoke overlay…” and then falling into the same complexity traps as they start facing the real problems of policy and scale.
This complexity cannot be “automated away.” It can be smeared over with intent, but we’re going to find—soon enough—that smearing intent on top of complexity just makes for a dirty kitchen and a sub-standard meal.
While this is always “top of mind” in my world, what brings it to mind this particular week is a paper by Jen Rexford et al. (I know Jen isn’t on the lead position in the author list, but still…) called A Clean Slate 4D Approach to Network Control and Management. Of course, I can appreciate the paper in part because I agree with a lot of what’s being said here. For instance—
We believe the root cause of these problems lies in the control plane running on the network elements and the management plane that monitors and configures them. In this paper, we argue for revisiting the division of functionality and advocate an extreme design point that completely separates a network’s decision logic from the protocols that govern interaction of network elements.
In other words, we’ve not done our modularization homework very well—and our lack of focus on doing modularization right is adding a lot of unnecessary complexity to our world. The four planes proposed in the paper are decision, dissemination, discovery, and data. The decision plane drives network control, including reachability, load balancing, and access control. The dissemination plane “provides a robust and efficient communication substrate” across which the other planes can send information. The discovery plane “is responsible for discovering the physical components of the network,” giving each item an identifier, etc. The data plane carries packets edge-to-edge.
I do have some quibbles with this architecture, of course. To begin, I’m not certain the word “plane” is the right one here. Maybe “layers,” or something else that implies more of a modular concept with interactions, and less a “ships in the night” sort of affair. My more substantial disagreement is with the placement of “interface configuration” and where reachability is placed in the model.
Consider this: reachability and access control are, in a sense, two sides of the same coin. You learn where something is to make it reachable, and then you block access to it by hiding reachability from specific places in the network. There are two ways to control reachability—by hiding the destination, or by blocking traffic being sent to the destination. Each of these has positive and negative aspects.
But notice this paradox—the control plane cannot hide reachability towards something it does not know about. You must know about something to prevent someone from reaching it. While reachability and access control are two sides of the same coin, they are also opposites. Access control relies on reachability to do its job.
To solve this paradox, I would put reachability into discovery rather than decision. Discovery would then become the discovery of physical devices, paths, and reachability through the network. No policy would live here—discovery would just determine what exists. All the policy about what to expose about what exists would live within the decision plane.
While the paper implies this kind of system must wait for some day in the future to build a network using these principles, I think you can get pretty close today. My “ideal design” for a data center fabric right now is (modified) IS-IS or RIFT in the underlay and eVPN in the overlay, with a set of controllers sitting on top. Why?
IS-IS doesn’t rely on IP, so it can serve in a mostly pure discovery role, telling any upper layers where things are, and what is changing. eVPN can provide segmented reachability on top of this, as well as any other policies. Controllers can be used to manage eVPN configuration to implement intent. A separate controller can work with IS-IS on inventory and lifecycle of installed hardware. This creates a clean “break” between the infrastructure underlay and overlays, and pretty much eliminates any dependence the underlay has on the overlay. Replacing the overlay with a more SDN’ish solution (rather than BGP) is perfectly do-able and reasonable.
While not perfect, the link-state underlay/BGP overlay model comes pretty close to implementing what Jen and her co-authors are describing, only using protocols we have around today—although with some modification.
But the main point of this paper stands—a lot of the reason for the complexity we fact today is simply because we modularize using aggregation and summarization and call the job “done.” We aren’t thinking about the network as a system, but rather as a bunch of individual appliances we slap together into places, which we then connect through other places (or strings, like between tin cans).
Something to think about the next time you get that 2AM call.
On a Spring 2019 walk in Beijing I saw two street sweepers at a sunny corner. They were beat-up looking and grizzled but probably younger than me. They’d paused work to smoke and talk. One told a story; the other’s eyes widened and then he laughed so hard he had to bend over, leaning on his broom. I suspect their jobs and pay were lousy and their lives constrained in ways I can’t imagine. But they had time to smoke a cigarette and crack a joke. You know what that’s called? Waste, inefficiency, a suboptimal outcome. Some of the brightest minds in our economy are earnestly engaged in stamping it out. They’re winning, but everyone’s losing. —Tim Bray
This, in a nutshell, is what is often wrong with our design thinking in the networking world today. We want things to be efficient, wringing the last little dollar, and the last little bit of bandwidth, out of everything.
This is also, however, a perfect example of the problem of triads and tradeoffs. In the case of the street sweeper, we might thing, “well, we could replace those folks sitting around smoking a cigarette and cracking jokes with a robot, making things much more efficient.” We might notice the impact on the street sweeper’s salaries—but after all, it’s a boring job, and they are better off doing something else anyway, right?
We’re actually pretty good at finding, and “solving” (for some meaning of “solving,” of course), these kinds of immediately obvious tradeoffs. It’s obvious the street sweepers are going to lose their jobs if we replace them with a robot. What might not be so obvious is the loss of the presence of a person on the street. That’s a pair of eyes who can see when a child is being taken by someone who’s not a family member, a pair of ears that can hear the rumble of a car that doesn’t belong in the neighborhood, a pair of hands that can help someone who’s fallen, etc.
This is why these kinds of tradeoffs always come in (at least) threes.
Let’s look at the street sweepers in terms of the SOS triad. Replacing the street sweepers with a robot or machine certainly increases optimization. According to the triad, though, increasing optimization in one area should result in some increase in complexity someplace, and some loss of optimization in other places.
What about surfaces? The robot must be managed, and it must interact with people and vehicles on the street—which means people and vehicles must also interact with the robot. Someone must build and maintain the robot, so there must be some sort of system, with a plethora of interaction surfaces, to make this all happen. So yes, there may be more efficiency, but there are now more interaction surfaces to deal with now, too. These interaction surfaces increase complexity.
What about state? In a sense, there isn’t much change in state other than moving it—purely in terms of sweeping the street, anyway. The sweeper and the robot must both understand when and how to sweep the street, etc., so the state doesn’t seem to change much here.
On the other hand, that extra set of eyes and ears, that extra mind, that is no longer on the street in a personal way represents a loss of state. The robot is an abstraction of the person who was there before, and abstraction always represents a loss of state in some way. Whether this loss of state decreases the optimal handling of local neighborhood emergencies is probably a non-trivial problem to consider.
The bottom line is this—when you go after efficiency, you need to think in terms of efficiency of what, rather than efficiency as a goal-in-itself. That’s because there is no such thing as “efficiency-in-itself,” there is only something you are making more efficient—and a lot of things you potentially making less efficient.
Automate your network, certainly, or even buy a system that solves “all the problems.” But remember there are tradeoffs—often a large number of tradeoffs you might not have thought about—and those tradeoffs have consequences.
It’s not “if you haven’t found the tradeoff, you haven’t looked hard enough…” Don’t stop at one. It’s “if you haven’t found the tradeoffs, you haven’t looked hard enough.” It’s a plural for a reason.