Details and Complexity

What is the first thing almost every training course in routing protocols begin with? Building adjacencies. What is considered the “deep stuff” in routing protocols? Knowing packet formats and processes down to the bit level. What is considered the place where the rubber meets the road? How to configure the protocol.

I’m not trying to cast aspersions at widely available training, but I sense we have this all wrong—and this is a sense I’ve had ever since my first book was released in 1999. It’s always hard for me to put my finger on why I consider this way of thinking about network engineering less-than-optimal, or why we approach training this way.

This, however, is one thing I think is going on here—

The typical program aims to counter the inherent complexity of the decision by providing in-depth information. By providing such extremely detailed and complex information, these interventions try to enable people to make perfect decisions.

We believe that by knowing ever-deeper reaches of detail about a protocol, we are not only more educated engineers, but we will be able to make better decisions in the design and troubleshooting spaces.

To some degree, we think we are managing the complexity of the protocol by “making our knowledge practical”—by knowing the bits, bytes, and configurations. This natural tendency to “dig in,” to learn more detail, turns out to be counterproductive. Continuing from the same article—

The scientific opinion of many psychologists and behavioral scientists suggests the key to time-sensitive decision making in complex and chaotic situations is simplicity, not complexity. Simple-to-remember rules of thumb, or heuristics, speed the cognitive process, enabling faster decisionmaking and action. Recognizing that heuristics have limitations and are not a substitute for basic research and analysis, they nevertheless help break complexity-induced paralysis and support the development of good plans that can achieve timely and acceptable results. The best heuristics capture useful information in an intuitive, easy-to-recall way. Their utility is in assisting decision makers in complex and chaotic situations to make better and timelier decisions.

Knowing why a protocol works the way it does—understanding what it’s doing and why—from an abstract perspective is, I believe, a more important skill for the average network engineer than knowing the bits and bytes—or the configuration.

Abstract correctly—but abstract more. Get back to the basics and know why things work the way they do. It’s easier to fill in the details if you know the how and why.

Ambiguity and complexity: once more into the breach

Recent research into the text of RFCs versus the security of the protocols described came to this conclusion—

While not conclusive, this suggests that there may be some correlation between the level of ambiguity in RFCs and subsequent implementation security flaws.

This should come as no surprise to network engineers—after all, complexity is the enemy of security. Beyond the novel ways the authors use to understand the shape of the world of RFCs (you should really read the paper; it’s really interesting), this desire to increase security by decreasing the ambiguity of specifications is fascinating. We often think that writing better specifications requires having better requirements, but down this path only lies despair.

Better requirements are the one thing a network engineer can never really hope for.

It’s not just that networks are often used as a sort of “complexity sink,” the place where every hard problem goes to be solved. It’s also the uncertainty of the environment in which the network must operate. What new application will be stuffed on top of the network this week? Will anyone tell the network folks about this new application, or just open a ticket when it doesn’t work right? What about all the changes developers are making to applications right now, and their impact on the network? There are link failures, software failures, hardware failures, and the mean time between mistakes. There is the pace of innovation (which I tend to think is a bit overblown—rule11, after all—we are often talking about new products rather than new ideas).

What the network is supposed to do—just provide IP transport between two devices—turns out to be hard. It’s hard because “just transporting packets” isn’t ever enough. These packets must be delivered consistently (jitter and drops) across an ever-changing landscape.

To this end—

[C]omplexity is most succinctly discussed in terms of functionality and its robustness. Specifically, we argue that complexity in highly organized systems arises primarily from design strategies intended to create robustness to uncertainty in their environments and component parts.

Uncertainty is the key word here. What can we do about all of this?

We can reduce uncertainty. There are three ways to reduce uncertainty. First, you can obfuscate it—this is harmful. Second, you can reduce the scope of the job at hand, throwing some of the uncertainty (and therefore complexity) over the cubicle way. This can be useful in some situations, but remember that the less work you’re doing, the less value you add. Beware of self-commodifying.

Finally, you can manage the uncertainty. This generally means using modularization intelligently to partition off problems into smaller sets. It’s easier to solve a set of well-scope problems with little uncertainty than to solve one big problem with unknowable uncertainty.

This might all sound great in theory, but how do we do this in real life? Where does the rubber hit the road? This is what Ethan and I tried to show in Problems and Solutions—how to understand the problems that need to be solved, and then how to solve each of those problems within a larger system. This is also what many parts of The Art of Network Architecture are about, and then again what Jeff and I wrote about in Navigating Network Complexity.

I know it often seems like it’s not worth learning the theory; it’s so much easier to focus on the day-to-day, the configuration of this device, or the shiny thing that vendor just created. It’s easier to assume that if I can just hide all the complexity behind intent or automation, I can get my weekends back.

The truth is that we’re paid to solve hard problems, and solving hard problems involves complexity. We can either try to cover that up, or we can learn to manage it.

If you haven’t found the tradeoffs …

One of the big movements in the networking world is disaggregation—splitting the control plane and other applications that make the network “go” from the hardware and the network operating system. This is, in fact, one of the movements I’ve been arguing in favor of for many years—and I’m not about to change my perspective on the topic. There are many different arguments in favor of breaking the software from the hardware. The arguments for splitting hardware from software and componentizing software are so strong that much of the 5G transition also involves the open RAN, which is a disaggregated stack for edge radio networks.

If you’ve been following my work for any amount of time, you know what comes next: If you haven’t found the tradeoffs, you haven’t looked hard enough.

This article on hardening Linux (you should go read it, I’ll wait ’til you get back) exposes some of the complexities and tradeoffs involved in disaggregation in the area of security. Some further thoughts on hardening Linux here, as well. Two points.

First, disaggregation has serious advantages, but disaggregation is also hard work. With a commercial implementation you wouldn’t necessarily think about these kinds of supply chain issues. This is an example of the state/optimization/surfaces tradeoff. You can optimize your network more fully using disaggregation techniques, but there are going to be more interaction surfaces, and there’s going to be more state to deal with (for instance, the security state on individual devices).

There are several items on this list that also illustrate the state/optimization/surfaces tradeoff. For instance, eBPF is on the list of things to disable … but eBPF is probably going to be crucial to many future network-facing implementations. Anything that’s useful is going to inherently create attack surfaces you need to deal with. Get over it.

Second, just because you don’t think about these issues with a commercial implementation does not mean you don’t need to think about these things—it just means these kinds of things are opaque to you. Rather than trying to do the “right thing” yourself, you are outsourcing this work to a vendor. This is often a rational decision, and even might often be the right decision, but it’s a decision. We often “bury” these kinds of decisions in our thinking, not realizing we are making tradeoffs.

Complexity Reduction?

Back in January, I ran into an interesting article called The many lies about reducing complexity:

Reducing complexity sells. Especially managers in IT are sensitive to it as complexity generally is their biggest headache. Hence, in IT, people are in a perennial fight to make the complexity bearable.

Gerben then discusses two ways we often try to reduce complexity. First, we try to simply reduce the number of applications we’re using. We see this all the time in the networking world—if we could only get to a single pane of glass, or reduce the number of management packages we use, or reduce the number of control planes (generally to one), or reduce the number of transport protocols … but reducing the number of protocols doesn’t necessarily reduce complexity. Instead, we can just end up with one very complex protocol. Would it really be simpler to push DNS and HTTP functionality into BGP so we can use a single protocol to do everything?

Second, we try to reduce complexity by hiding it. While this is sometimes effective, it can also lead to unacceptable tradeoffs in performance (we run into the state, optimization, surfaces triad here). It can also make the system more complex if we need to go back and leak information to regain optimal behavior. Think of the OSPF type 4, which just reinjects information lost in building an area summary, or even the complexity involved in the type7 to type 5 process required to create not-so-stubby areas.

It would seem, then, that you really can’t get rid of complexity. You can move it around, and sometimes you can effectively hide it, but you cannot get rid of it.

This is, to some extent, true. Complexity is a reaction to difficult environments, and networks are difficult environments.

Even so, there are ways to actually reduce complexity. The solution is not just hiding information because it’s messy, or munging things together because it requires fewer applications or protocols. You cannot eliminate complexity, but if you think about how information flows through a system you might be able to reduce the amount of complexity, and even create boundaries where state (hence complexity) can be more effectively hidden.

As an instance, I have argued elsewhere that building a DC fabric with distinct overlay and underlay protocols can actually create a simpler overall design than using a single protocol. Another instance might be to really think about where route aggregation takes place—is it really needed at all? Why? Is this the right place to aggregate routes? Is there any way I can change the network design to reduce state leaking through the abstraction?

The problem is there are no clear-cut rules for thinking about complexity in this way. There’s no rule of thumb, there’s no best practices. You just have to think through each individual situation and consider how, where, and why state flows, and then think through the state/optimization/surface tradeoffs for each possible way of reducing the complexity of the system. You have to take into account that local reductions in complexity can cause the overall system to be much more complex, as well, and eventually make the system brittle.

There’s no “pat” way to reduce complexity—that there is, is perhaps one of the biggest lies about complexity in the networking world.

The Insecurity of Ambiguous Standards

Why are networks so insecure?

One reason is we don’t take network security seriously. We just don’t think of the network as a serious target of attack. Or we think of security as a problem “over there,” something that exists in the application realm, that needs to be solved by application developers. Or we think the consequences of a network security breach as “well, they can DDoS us, and then we can figure out how to move load around, so if we build with resilience (enough redundancy) we’re already taking care of our security issues.” Or we put our trust in the firewall, which sits there like some magic box solving all our problems.

The problem is–none of this is true. In any system where overall security is important, defense-in-depth is the key to building a secure system. No single part of the system bears the “primary responsibility” for “security.” The network is certainly a part of any defense-in-depth scheme that is going to work.

Which means network protocols need to be secure, at least in some sense, as well. I don’t mean “secure” in the sense of privacy—routes are not (generally) personally identifiable information (there are always exceptions, however). But rather “secure” in the sense that they cannot be easily attacked. On-the-wire encryption should prevent anyone from reading the contents of the packet or stream all the time. Network devices like routers and switches should be difficult to break in too, which means the protocols they run must be “secure” in the fuzzing sense—there should be no unexpected outputs because you’ve received an unexpected input.

I definitely do not mean path security of any sort. Making certain a packet (or update or whatever else) has followed a specified path is a chimera in packet switched networks. It’s like trying to nail your choice of multicolored gelatin desert to the wall. Packet switched networks are designed to adapt to changes in the network by rerouting traffic. Get over it.

So why are protocols and network devices so insecure? I recently ran into an interesting piece of research that provides some of the answer. To wit—

Our research saw that ambiguous keywords SHOULD and MAY had the second highest number of occurrences across all RFCs. We’ve also seen that their intended meaning is only to be interpreted as such when written in uppercase (whereas often they are written in lowercase). In addition, around 40% of RFCs made no use of uppercase requirements level keywords. These observations point to inconsistency in use of these keywords, and possibly misunderstanding about their importance in a security context. We saw that RFCs relating to Session Initiation Protocol (SIP) made most use of ambiguous keywords, and had the most number of implementation flaws as seen across SIP-based CVEs. While not conclusive, this suggests that there may be some correlation between the level of ambiguity in RFCs and subsequent implementation security flaws.

In other words, ambiguous language leads to ambiguous implementations which leads to security flaws in protocols.

The solution for this situation might be just this—specify protocols more rigorously. But simple solutions rarely admit reality within their scope. It’s easy to build more precise specifications—so why aren’t our specifications more precise?

In a word: politics.

For every RFC I’ve been involved in drafting, reviewing, or otherwise getting through the IETF, there are two reasons for each MAY or SHOULD therein. The first is someone has thought of a use-case where an implementor or operator might want to do something that would be otherwise not allowed by MUST. In these cases, everyone looks at the proposed MAY or SHOULD, thinks about how not doing it might be useful, and then thinks … “this isn’t so bad, the available functionality is a good thing, and there’s no real problem I can see with making this a MAY or SHOULD.” In other words, we can think of possible worlds where someone might want to do something, so we allow them to do it. Call this the “freedom principle.”

The second reason is that multiple vendors have multiple customers who want to do things different ways. When the two vendors clash in the realm of standards, the result is often a set of interlocking MAYs and SHOULDs that allow two implementors to build solutions that are interoperable in the main, but not along the edges, that satisfy both of their existing customer’s requirements. Call this the “big check principle.”

The problem with these situations is—the specification has an undetermined set of MAYs and SHOULDs that might interlock in unforeseen ways, resulting in unanticipated variances in implementations that ultimately show up as security holes.

Okay—now that I’ve described the problem, what can you do about it? One thing is to simplify. Stop putting everything into a small set of protocols. The more functionality you pour into a protocol or system, the harder it is to secure. Complexity is the enemy of security (and privacy!).

As for the political problems, these are human-scale, which means they are larger than any network you can ever build—but I’ll ponder this more and get back to you if I come up with any answers.

Complexity Bites Back

What percentage of business-impacting application outages are caused by networks? According to a recent survey by the Uptime Institute, about 30% of the 300 operators they surveyed, 29% have experienced network related outages in the last three years—the highest percentage of causes for IT failures across the period.

A secondary question on the survey attempted to “dig a little deeper” to understand the reasons for network failure; the chart below shows the result.

We can be almost certain the third-party failures, if the providers were queried, would break down along the same lines. Is there a pattern among the reasons for failure?

Configuration change—while this could be somewhat managed through automation, these kinds of failures are more generally the result of complexity. Firmware and software failures? The more complex the pieces of software, the more likely it is to have mission-impacting errors of some kind—so again, complexity related. Corrupted policies and routing tables are also complexity related. The only item among the top preventable causes that does not seem, at first, to relate directly to complexity is network overload and/or congestion problems. Many of these cases, however, might also be complexity related.

The Uptime Institute draws this same lesson, though through a slightly different process, saying: “Networks are complex not only technically, but also operationally.”

For years—decades, even—we have talked about the increasing complexity of networks, but we have done little about it. Yes, we have automated all the things, but automation can only carry us so far in covering complexity up. Automation also adds a large dop of complexity on top of the existing network—sometimes (not always, of course!) automating a complex system without making substantial efforts at simplification is just like trying to put a fire out with a can of gas (or, in one instance I actually saw, trying to put out an electrical fire with a can of soda, with the predictable trip to the local hospital.

We are (finally) starting to be “bit hard” by complexity problems in our networks—and I suspect this is the leading edge of the problem, rather than the trailing edge.

Maybe it’s time to realize making every protocol serve every purpose in the network wasn’t a good idea—we now have protocols that are so complex that they can only be correctly configured by machines, and then only when you narrow the use case enough to make the design parameters intelligible.

Maybe it’s time to realize optimizing for every edge use case wasn’t a good idea. Sometimes it’s just better to throw resources at the problem, rather than throwing state at the control plane to squeeze out just one more ounce of optimization.

Maybe it’s time to stop building networks around “whatever the application developer can dream up.” To start working as a team with the application developers to build a complete system that puts complexity where it most makes sense, and divides complexity from complexity, rather than just assuming “the network can do that.”

Maybe it’s time to stop thinking we can automate our way out of this.

Maybe it’s time to lay our superhero capes down and just start building simpler systems.

The Hedge 74: Brian Keys and the Complexity of User Interfaces

Crossing from the domain of test pilots to the domain of network engineering might seem like a large leap indeed—but user interfaces and their tradeoffs are common across physical and virtual spaces. Brian Keys, Eyvonne Sharp, Tom Ammon, and Russ White as we start with user interfaces and move into a wider discussion around attitudes and beliefs in the network engineering world.

download