Back in January, I ran into an interesting article called The many lies about reducing complexity:
Reducing complexity sells. Especially managers in IT are sensitive to it as complexity generally is their biggest headache. Hence, in IT, people are in a perennial fight to make the complexity bearable.
Gerben then discusses two ways we often try to reduce complexity. First, we try to simply reduce the number of applications we’re using. We see this all the time in the networking world—if we could only get to a single pane of glass, or reduce the number of management packages we use, or reduce the number of control planes (generally to one), or reduce the number of transport protocols … but reducing the number of protocols doesn’t necessarily reduce complexity. Instead, we can just end up with one very complex protocol. Would it really be simpler to push DNS and HTTP functionality into BGP so we can use a single protocol to do everything?
Second, we try to reduce complexity by hiding it. While this is sometimes effective, it can also lead to unacceptable tradeoffs in performance (we run into the state, optimization, surfaces triad here). It can also make the system more complex if we need to go back and leak information to regain optimal behavior. Think of the OSPF type 4, which just reinjects information lost in building an area summary, or even the complexity involved in the type7 to type 5 process required to create not-so-stubby areas.
It would seem, then, that you really can’t get rid of complexity. You can move it around, and sometimes you can effectively hide it, but you cannot get rid of it.
This is, to some extent, true. Complexity is a reaction to difficult environments, and networks are difficult environments.
Even so, there are ways to actually reduce complexity. The solution is not just hiding information because it’s messy, or munging things together because it requires fewer applications or protocols. You cannot eliminate complexity, but if you think about how information flows through a system you might be able to reduce the amount of complexity, and even create boundaries where state (hence complexity) can be more effectively hidden.
As an instance, I have argued elsewhere that building a DC fabric with distinct overlay and underlay protocols can actually create a simpler overall design than using a single protocol. Another instance might be to really think about where route aggregation takes place—is it really needed at all? Why? Is this the right place to aggregate routes? Is there any way I can change the network design to reduce state leaking through the abstraction?
The problem is there are no clear-cut rules for thinking about complexity in this way. There’s no rule of thumb, there’s no best practices. You just have to think through each individual situation and consider how, where, and why state flows, and then think through the state/optimization/surface tradeoffs for each possible way of reducing the complexity of the system. You have to take into account that local reductions in complexity can cause the overall system to be much more complex, as well, and eventually make the system brittle.
There’s no “pat” way to reduce complexity—that there is, is perhaps one of the biggest lies about complexity in the networking world.
Why are networks so insecure?
One reason is we don’t take network security seriously. We just don’t think of the network as a serious target of attack. Or we think of security as a problem “over there,” something that exists in the application realm, that needs to be solved by application developers. Or we think the consequences of a network security breach as “well, they can DDoS us, and then we can figure out how to move load around, so if we build with resilience (enough redundancy) we’re already taking care of our security issues.” Or we put our trust in the firewall, which sits there like some magic box solving all our problems.
The problem is–none of this is true. In any system where overall security is important, defense-in-depth is the key to building a secure system. No single part of the system bears the “primary responsibility” for “security.” The network is certainly a part of any defense-in-depth scheme that is going to work.
Which means network protocols need to be secure, at least in some sense, as well. I don’t mean “secure” in the sense of privacy—routes are not (generally) personally identifiable information (there are always exceptions, however). But rather “secure” in the sense that they cannot be easily attacked. On-the-wire encryption should prevent anyone from reading the contents of the packet or stream all the time. Network devices like routers and switches should be difficult to break in too, which means the protocols they run must be “secure” in the fuzzing sense—there should be no unexpected outputs because you’ve received an unexpected input.
I definitely do not mean path security of any sort. Making certain a packet (or update or whatever else) has followed a specified path is a chimera in packet switched networks. It’s like trying to nail your choice of multicolored gelatin desert to the wall. Packet switched networks are designed to adapt to changes in the network by rerouting traffic. Get over it.
So why are protocols and network devices so insecure? I recently ran into an interesting piece of research that provides some of the answer. To wit—
Our research saw that ambiguous keywords SHOULD and MAY had the second highest number of occurrences across all RFCs. We’ve also seen that their intended meaning is only to be interpreted as such when written in uppercase (whereas often they are written in lowercase). In addition, around 40% of RFCs made no use of uppercase requirements level keywords. These observations point to inconsistency in use of these keywords, and possibly misunderstanding about their importance in a security context. We saw that RFCs relating to Session Initiation Protocol (SIP) made most use of ambiguous keywords, and had the most number of implementation flaws as seen across SIP-based CVEs. While not conclusive, this suggests that there may be some correlation between the level of ambiguity in RFCs and subsequent implementation security flaws.
In other words, ambiguous language leads to ambiguous implementations which leads to security flaws in protocols.
The solution for this situation might be just this—specify protocols more rigorously. But simple solutions rarely admit reality within their scope. It’s easy to build more precise specifications—so why aren’t our specifications more precise?
In a word: politics.
For every RFC I’ve been involved in drafting, reviewing, or otherwise getting through the IETF, there are two reasons for each MAY or SHOULD therein. The first is someone has thought of a use-case where an implementor or operator might want to do something that would be otherwise not allowed by MUST. In these cases, everyone looks at the proposed MAY or SHOULD, thinks about how not doing it might be useful, and then thinks … “this isn’t so bad, the available functionality is a good thing, and there’s no real problem I can see with making this a MAY or SHOULD.” In other words, we can think of possible worlds where someone might want to do something, so we allow them to do it. Call this the “freedom principle.”
The second reason is that multiple vendors have multiple customers who want to do things different ways. When the two vendors clash in the realm of standards, the result is often a set of interlocking MAYs and SHOULDs that allow two implementors to build solutions that are interoperable in the main, but not along the edges, that satisfy both of their existing customer’s requirements. Call this the “big check principle.”
The problem with these situations is—the specification has an undetermined set of MAYs and SHOULDs that might interlock in unforeseen ways, resulting in unanticipated variances in implementations that ultimately show up as security holes.
Okay—now that I’ve described the problem, what can you do about it? One thing is to simplify. Stop putting everything into a small set of protocols. The more functionality you pour into a protocol or system, the harder it is to secure. Complexity is the enemy of security (and privacy!).
As for the political problems, these are human-scale, which means they are larger than any network you can ever build—but I’ll ponder this more and get back to you if I come up with any answers.
What percentage of business-impacting application outages are caused by networks? According to a recent survey by the Uptime Institute, about 30% of the 300 operators they surveyed, 29% have experienced network related outages in the last three years—the highest percentage of causes for IT failures across the period.
A secondary question on the survey attempted to “dig a little deeper” to understand the reasons for network failure; the chart below shows the result.
We can be almost certain the third-party failures, if the providers were queried, would break down along the same lines. Is there a pattern among the reasons for failure?
Configuration change—while this could be somewhat managed through automation, these kinds of failures are more generally the result of complexity. Firmware and software failures? The more complex the pieces of software, the more likely it is to have mission-impacting errors of some kind—so again, complexity related. Corrupted policies and routing tables are also complexity related. The only item among the top preventable causes that does not seem, at first, to relate directly to complexity is network overload and/or congestion problems. Many of these cases, however, might also be complexity related.
The Uptime Institute draws this same lesson, though through a slightly different process, saying: “Networks are complex not only technically, but also operationally.”
For years—decades, even—we have talked about the increasing complexity of networks, but we have done little about it. Yes, we have automated all the things, but automation can only carry us so far in covering complexity up. Automation also adds a large dop of complexity on top of the existing network—sometimes (not always, of course!) automating a complex system without making substantial efforts at simplification is just like trying to put a fire out with a can of gas (or, in one instance I actually saw, trying to put out an electrical fire with a can of soda, with the predictable trip to the local hospital.
We are (finally) starting to be “bit hard” by complexity problems in our networks—and I suspect this is the leading edge of the problem, rather than the trailing edge.
Maybe it’s time to realize making every protocol serve every purpose in the network wasn’t a good idea—we now have protocols that are so complex that they can only be correctly configured by machines, and then only when you narrow the use case enough to make the design parameters intelligible.
Maybe it’s time to realize optimizing for every edge use case wasn’t a good idea. Sometimes it’s just better to throw resources at the problem, rather than throwing state at the control plane to squeeze out just one more ounce of optimization.
Maybe it’s time to stop building networks around “whatever the application developer can dream up.” To start working as a team with the application developers to build a complete system that puts complexity where it most makes sense, and divides complexity from complexity, rather than just assuming “the network can do that.”
Maybe it’s time to stop thinking we can automate our way out of this.
Maybe it’s time to lay our superhero capes down and just start building simpler systems.
Crossing from the domain of test pilots to the domain of network engineering might seem like a large leap indeed—but user interfaces and their tradeoffs are common across physical and virtual spaces. Brian Keys, Eyvonne Sharp, Tom Ammon, and Russ White as we start with user interfaces and move into a wider discussion around attitudes and beliefs in the network engineering world.
Every software developer has run into “god objects”—some data structure or database that every process must access no matter what it is doing. Creating god objects in software is considered an anti-pattern—something you should not do. Perhaps the most apt description of the god object I’ve seen recently is you ask for a banana, and you get the gorilla as well.
We seem to have a deep desire to solve all the complexity of modern networks through god objects. There was ATM, which was going to solve all our networking problems by allowing the edge device (or a centralized controller) to control the path its traffic takes through the network. There is LISP, which is going to solve every mapping and tunneling/transport problem in the entire networking world (including mobility and security). There is SDN, which is going to solve everything by pushing it all into a controller.
And now there is BGP, which can be a link state protocol (LSVR), the ideal DC fabric control plane, the ideal interdomain protocol, the ideal IGP … a sort-of distributed god object that solves everything, everywhere, all the time (life in the fast lane…).
The problem is, a bunch of people are asking for different bananas, and what we keep getting is the gorilla.
A lot of this is our fault. We crave simplicity so much that we are willing to believe just about anyone, or anything, that promises to solve all the problems we face in building and operating networks in a simple and easy way. The truth is, however, there are no simple solutions to hard problems—solving hard problems requires complex solutions.
Okay, so we can intuitively know god objects are bad, but why? Because they are too complex to really understand, and therefore they cannot be truly operated, nor can you effectively troubleshoot them. There are too many unintended consequences, too many places where you cannot understand the relationship between this and that.
To put this in more human terms—there are many people in the modern world who think that if they just have enough data, they will understand people well enough to operate and troubleshoot them like some kind of machine. Sorry to tell you this—humans are just too complex, and human social institutions, being made up of people (well, duh!), are way more complex than even the most intelligent artificial intelligence can ever “understand.”
We need to fall out of love with the utopia of the god object for once and for all. We need to go back to building simpler systems that solve one or two problems well, and then combining these into intelligible solutions that can be understood, managed, and repaired.
But the move back towards real simplicity has to begin with you.
One of the major sources of complexity in modern systems is the simple failure to pull back the curtains. From a recent blog post over at the ACM—
The Wizard of Oz was a charlatan. You’d be surprised, too, how many programmers don’t understand what’s going on behind the curtain either. Some years ago, I was talking with the CTO of a company, and he asked me to explain what happens when you type a URL into your browser and hit enter. Do you actually know what happens? Think about it for a moment.
Yegor describes three different reactions when a coder faces something unexpected when solving a problem.
Throw in the towel. Just give up on solving the problem. This is fairly uncommon in the networking and programming fields, so I don’t have much to say here.
Muddle through. Just figure out how to make it work by whatever means necessary.
Open the curtains and build an excellent solution. Learn how the underlying systems work, understand how to interact with them, and create a solution that best takes advantage of them.
The first and third options are rare indeed; it is the second solution that seems to dominate our world. What generally tends to happen is we set out to solve some problem, we encounter resistance, and we either “just make it work” by fiddling around with the bits or we say “this is just too complex, I’m going to build something new that simpler and easier.” The problem with building something new is the “something new” must go someplace … which generally means on top of existing “stuff.” Adding more stuff you do understand on top of stuff you don’t understand to solve a problem is, of course, a prime way to increase complexity in a network.
And thus we have one of the prime reasons for ever-increasing complexity in networks.
Yegor says being a great programmer by pulling back the curtain increases job satisfaction, helping him avoid depression. The same is probably true of network engineers who are deeply interested in solving problems—who are only happy at the end of the day if they know they have solved some problem, even if no-one ever notices.
Pulling back the curtains, then not only helps us to manage complexity, it can alos improve job satisfaction for those with the problem-solving mindset. Great reasons to pull back to the curtains, indeed.
According to Maor Rudick, in a recent post over at Cloud Native, programming is 10% writing code and 90% understanding why it doesn’t work. This expresses the art of deploying network protocols, security, or anything that needs thought about where and how. I’m not just talking about the configuration, either—why was this filter deployed here rather than there? Why was this BGP community used rather than that one? Why was this aggregation range used rather than some other? Even in a fully automated world, the saying holds true.
So how can you improve the understandability of your network design? Maor defines understandability as “the dev who creates the software is to effortlessly … comprehend what is happening in it.” Continuing—“the more understandable a system is, the easier it becomes for the developers who created it to change it in a way that is safe and predictable.” What are the elements of understandability?
Documentation must be complete, clear, concise, and organized. The two primary failings I encounter in documentation are completeness and organization. Why something is done, when it was last changed, and why it was changed are often missing. The person making the change just assumes “I’ll remember this, or someone will figure it out.” You won’t, and they won’t. Concise is the “other side” of complete … Recording unsubstantial changes just adds information that won’t ever be needed. You have to balance between enough and too much, of course.
Organization is another entire problem in documentation—most people have a favorite way to organize things. When you get a team of people all organizing things based on their favorite way, you end up with a mess. Going back in time … I remember that just about everyone who was assigned to the METNAV shop began their time by re-organizing the tools. Each time the re-organization made things so much easier to find, and improved the MTTR for the airfield equipment we supported … After a while, you’d think someone would ask, “Does re-organizing all the tools every year really help? Or are you just making stuff up for new folks to do?”
Moving beyond documentation, what else can we do to make our networks more understandable?
First, we can focus on actually making networks simpler. I don’t mean just glossing things over with a pretty GUI, or automating thousands of lines of configuration using Python. I mean taking steps by using protocols that are simpler to run, require less configuration, and produce more information you can use for troubleshooting—choose something like IS-IS for your DC fabric underlay rather than BGP, unless you really have several hundred thousand of underlay destinations (hint, if you’ve properly separated “customer” routes in the overlay from “infrastructure” routes in the underlay, you shouldn’t have this kind of routing tangle in the underlay anyway).
What about having multiple protocols that do the same job? Do you really need three or four routing protocols, four or five tunneling protocols, and five or six … well, you get the idea. Reducing the sheer number of protocols running in your network can make a huge difference in the tooling troubleshooting time. What about having four or five kinds of boxes in your network that fulfill the same role? Okay—so maybe you have three DC fabrics, and you run each one using a different vendor. But is there is any reason to have three DC fabrics, each of which has a broad mix of equipment from five different vendors? I doubt it.
Second, you can think about what you would measure in the case of failure, how you would measure it, and put the basic piece in place in the design phase to do those measurements. Don’t wait until you need the data to figure out how to get at it, and what the performance results of trying to get it are going to be.
Third, you can think about where you put policy in your network. There is no “right” answer to this question, other than … be consistent. The first option is to put all your policy in one place—say, on the devices that connect the core to the aggregation, or the devices in the distribution layer. The second option is to always put the policy as close to the source or destination of the traffic impacted by the policy. In a DC fabric, you should always put policy and external connectivity in the T0 or ToR, never in the spine (it’s not a core, it’s a spine).
Maybe you have other ideas on how to improve understandability in networks … If you do, get in touch and let’s talk about it. I’m always looking for practical ways to make networks more understandable.