Way in the past, the EIGRP team (including me) had an interesting idea–why not aggregate routes automatically as much as possible, along classless bounds, and then deaggregate routes when we could detect some failure was causing a routing black hole? To understand this concept better, consider the network below.
In this network, B and C are connected to four different routers, each of which is advertising a different subnet. In turn, B and C are aggregating these four routes into 2001:db8:3e8:10::/60, and advertising this aggregate towards A. From a control plane state perspective, this is a major win. The obvious gain is that the amount of state is reduced from four routes to one. The less obvious gain is A doesn’t need to know about any changes in the state for the four destinations aggregated into the /60. Depending on how often these links change state, the reduction in the rate of change is, perhaps, more important than the reduction in the amount of control plane state.
We always know there will be a tradeoff when reducing state; what is the tradeoff here? If C somehow loses its connection to one of the four routers, say the router advertising 11::/64, C’s 10::/60 aggregate will not change. Since A thinks C still has a route to every subnet within 10::/60, it will continue sending traffic destined to addresses in the 11::/64 towards both B and C. C will not have a route towards these destinations, so it will drop the traffic.
We have a routing black hole.
This much is pretty simple. The harder part is figuring out to eliminate this routing black hole. Our first choice is to just not aggregate these routes. While you might be cringing right now, this isn’t such a bad option in many networks. We often underestimate the amount of state and the speed of state change modern routing protocols running on modern processors can support. I’ve seen networks running IS-IS in a single flooding domain with tens of thousands of routes and thousands of nodes running “in the wild.” I’ve seen IS-IS networks with thousands of nodes and hundreds of thousands of routes running in lab environments. These networks still converge.
But what if we really think we need to reduce the amount and speed of state, so we really need to aggregate these routes?
One solution that has been proposed a number of times through the years is auto disaggregation.
In this case, suppose D somehow realizes C cannot reach one of the components of a shared aggregate route. D could simply stop advertising the aggregate, advertising each of the components instead. The question here might be: is this a good idea? Looking at this from the perspective of the SOS triad, the aggregation replaced four routes with a single route. In the auto disaggregation case, the single route change is replaced by four route changes. The amount of state is variable, and in some cases the rate of change in state is actually higher than without the aggregation.
I don’t hold that auto disaggregation is either good nor bad—it just presents a different set of challenges to the network designer. Instead of designing for average rates of change and given table sizes, you can count on much smaller tables, but you might find there are times when the rate of change is dramatically higher than you expect. A good question to ask, before deploying this kind of technology, might be: can I forsee a chain of events that will cause a high enough rate of state change that auto disaggregation is actually more destabilizing than just not summarizing at all in this network?
A real danger with auto disaggregation, by the way, is using summarization to dramatically reduce table sizes without understanding how a goldilocks failure (what we used to call in telco a mother’s day event, or perhaps a black swan) can cascade into widespread failures. If you’re counting on particular devices in your network only have a dozen or two dozen table entries, but just the right set of failures can cause them to have several thousand entries because of auto disaggregation, what kinds of failures modes should you anticipate? Can you anticipate or mitigate this kind of problem?
The idea of automatically summarizing and disaggregating routes is an interesting study in complexity, state, and optimization. It’s a good brain exercise in thinking through what-if situations, and carefully thinking about when and where to deploy this kind of thing.
What do you think about this idea? When would you deploy it, where, and why? When and where would you be cautious about deploying this kind of technology?
Engineers (and marketing folks) love new technology. Watching an engineer learn or unwrap some new technology is like watching a dog chase a squirrel—the point is not to catch the squirrel, it’s just that the chase is really fun. Join Andrew Wertkin (from BlueCat Networks), Tom Ammon, and Russ White as we discuss the importance of simple, boring technologies, and moderating our love of the new.
What is the first thing almost every training course in routing protocols begin with? Building adjacencies. What is considered the “deep stuff” in routing protocols? Knowing packet formats and processes down to the bit level. What is considered the place where the rubber meets the road? How to configure the protocol.
I’m not trying to cast aspersions at widely available training, but I sense we have this all wrong—and this is a sense I’ve had ever since my first book was released in 1999. It’s always hard for me to put my finger on why I consider this way of thinking about network engineering less-than-optimal, or why we approach training this way.
This, however, is one thing I think is going on here—
The typical program aims to counter the inherent complexity of the decision by providing in-depth information. By providing such extremely detailed and complex information, these interventions try to enable people to make perfect decisions.
We believe that by knowing ever-deeper reaches of detail about a protocol, we are not only more educated engineers, but we will be able to make better decisions in the design and troubleshooting spaces.
To some degree, we think we are managing the complexity of the protocol by “making our knowledge practical”—by knowing the bits, bytes, and configurations. This natural tendency to “dig in,” to learn more detail, turns out to be counterproductive. Continuing from the same article—
The scientific opinion of many psychologists and behavioral scientists suggests the key to time-sensitive decision making in complex and chaotic situations is simplicity, not complexity. Simple-to-remember rules of thumb, or heuristics, speed the cognitive process, enabling faster decisionmaking and action. Recognizing that heuristics have limitations and are not a substitute for basic research and analysis, they nevertheless help break complexity-induced paralysis and support the development of good plans that can achieve timely and acceptable results. The best heuristics capture useful information in an intuitive, easy-to-recall way. Their utility is in assisting decision makers in complex and chaotic situations to make better and timelier decisions.
Knowing why a protocol works the way it does—understanding what it’s doing and why—from an abstract perspective is, I believe, a more important skill for the average network engineer than knowing the bits and bytes—or the configuration.
Abstract correctly—but abstract more. Get back to the basics and know why things work the way they do. It’s easier to fill in the details if you know the how and why.
Recent research into the text of RFCs versus the security of the protocols described came to this conclusion—
This should come as no surprise to network engineers—after all, complexity is the enemy of security. Beyond the novel ways the authors use to understand the shape of the world of RFCs (you should really read the paper; it’s really interesting), this desire to increase security by decreasing the ambiguity of specifications is fascinating. We often think that writing better specifications requires having better requirements, but down this path only lies despair.
Better requirements are the one thing a network engineer can never really hope for.
It’s not just that networks are often used as a sort of “complexity sink,” the place where every hard problem goes to be solved. It’s also the uncertainty of the environment in which the network must operate. What new application will be stuffed on top of the network this week? Will anyone tell the network folks about this new application, or just open a ticket when it doesn’t work right? What about all the changes developers are making to applications right now, and their impact on the network? There are link failures, software failures, hardware failures, and the mean time between mistakes. There is the pace of innovation (which I tend to think is a bit overblown—rule11, after all—we are often talking about new products rather than new ideas).
What the network is supposed to do—just provide IP transport between two devices—turns out to be hard. It’s hard because “just transporting packets” isn’t ever enough. These packets must be delivered consistently (jitter and drops) across an ever-changing landscape.
To this end—
[C]omplexity is most succinctly discussed in terms of functionality and its robustness. Specifically, we argue that complexity in highly organized systems arises primarily from design strategies intended to create robustness to uncertainty in their environments and component parts.
Uncertainty is the key word here. What can we do about all of this?
We can reduce uncertainty. There are three ways to reduce uncertainty. First, you can obfuscate it—this is harmful. Second, you can reduce the scope of the job at hand, throwing some of the uncertainty (and therefore complexity) over the cubicle way. This can be useful in some situations, but remember that the less work you’re doing, the less value you add. Beware of self-commodifying.
Finally, you can manage the uncertainty. This generally means using modularization intelligently to partition off problems into smaller sets. It’s easier to solve a set of well-scope problems with little uncertainty than to solve one big problem with unknowable uncertainty.
This might all sound great in theory, but how do we do this in real life? Where does the rubber hit the road? This is what Ethan and I tried to show in Problems and Solutions—how to understand the problems that need to be solved, and then how to solve each of those problems within a larger system. This is also what many parts of The Art of Network Architecture are about, and then again what Jeff and I wrote about in Navigating Network Complexity.
I know it often seems like it’s not worth learning the theory; it’s so much easier to focus on the day-to-day, the configuration of this device, or the shiny thing that vendor just created. It’s easier to assume that if I can just hide all the complexity behind intent or automation, I can get my weekends back.
The truth is that we’re paid to solve hard problems, and solving hard problems involves complexity. We can either try to cover that up, or we can learn to manage it.
One of the big movements in the networking world is disaggregation—splitting the control plane and other applications that make the network “go” from the hardware and the network operating system. This is, in fact, one of the movements I’ve been arguing in favor of for many years—and I’m not about to change my perspective on the topic. There are many different arguments in favor of breaking the software from the hardware. The arguments for splitting hardware from software and componentizing software are so strong that much of the 5G transition also involves the open RAN, which is a disaggregated stack for edge radio networks.
If you’ve been following my work for any amount of time, you know what comes next: If you haven’t found the tradeoffs, you haven’t looked hard enough.
This article on hardening Linux (you should go read it, I’ll wait ’til you get back) exposes some of the complexities and tradeoffs involved in disaggregation in the area of security. Some further thoughts on hardening Linux here, as well. Two points.
First, disaggregation has serious advantages, but disaggregation is also hard work. With a commercial implementation you wouldn’t necessarily think about these kinds of supply chain issues. This is an example of the state/optimization/surfaces tradeoff. You can optimize your network more fully using disaggregation techniques, but there are going to be more interaction surfaces, and there’s going to be more state to deal with (for instance, the security state on individual devices).
There are several items on this list that also illustrate the state/optimization/surfaces tradeoff. For instance, eBPF is on the list of things to disable … but eBPF is probably going to be crucial to many future network-facing implementations. Anything that’s useful is going to inherently create attack surfaces you need to deal with. Get over it.
Second, just because you don’t think about these issues with a commercial implementation does not mean you don’t need to think about these things—it just means these kinds of things are opaque to you. Rather than trying to do the “right thing” yourself, you are outsourcing this work to a vendor. This is often a rational decision, and even might often be the right decision, but it’s a decision. We often “bury” these kinds of decisions in our thinking, not realizing we are making tradeoffs.
Back in January, I ran into an interesting article called The many lies about reducing complexity:
Reducing complexity sells. Especially managers in IT are sensitive to it as complexity generally is their biggest headache. Hence, in IT, people are in a perennial fight to make the complexity bearable.
Gerben then discusses two ways we often try to reduce complexity. First, we try to simply reduce the number of applications we’re using. We see this all the time in the networking world—if we could only get to a single pane of glass, or reduce the number of management packages we use, or reduce the number of control planes (generally to one), or reduce the number of transport protocols … but reducing the number of protocols doesn’t necessarily reduce complexity. Instead, we can just end up with one very complex protocol. Would it really be simpler to push DNS and HTTP functionality into BGP so we can use a single protocol to do everything?
Second, we try to reduce complexity by hiding it. While this is sometimes effective, it can also lead to unacceptable tradeoffs in performance (we run into the state, optimization, surfaces triad here). It can also make the system more complex if we need to go back and leak information to regain optimal behavior. Think of the OSPF type 4, which just reinjects information lost in building an area summary, or even the complexity involved in the type7 to type 5 process required to create not-so-stubby areas.
It would seem, then, that you really can’t get rid of complexity. You can move it around, and sometimes you can effectively hide it, but you cannot get rid of it.
This is, to some extent, true. Complexity is a reaction to difficult environments, and networks are difficult environments.
Even so, there are ways to actually reduce complexity. The solution is not just hiding information because it’s messy, or munging things together because it requires fewer applications or protocols. You cannot eliminate complexity, but if you think about how information flows through a system you might be able to reduce the amount of complexity, and even create boundaries where state (hence complexity) can be more effectively hidden.
As an instance, I have argued elsewhere that building a DC fabric with distinct overlay and underlay protocols can actually create a simpler overall design than using a single protocol. Another instance might be to really think about where route aggregation takes place—is it really needed at all? Why? Is this the right place to aggregate routes? Is there any way I can change the network design to reduce state leaking through the abstraction?
The problem is there are no clear-cut rules for thinking about complexity in this way. There’s no rule of thumb, there’s no best practices. You just have to think through each individual situation and consider how, where, and why state flows, and then think through the state/optimization/surface tradeoffs for each possible way of reducing the complexity of the system. You have to take into account that local reductions in complexity can cause the overall system to be much more complex, as well, and eventually make the system brittle.
There’s no “pat” way to reduce complexity—that there is, is perhaps one of the biggest lies about complexity in the networking world.
Why are networks so insecure?
One reason is we don’t take network security seriously. We just don’t think of the network as a serious target of attack. Or we think of security as a problem “over there,” something that exists in the application realm, that needs to be solved by application developers. Or we think the consequences of a network security breach as “well, they can DDoS us, and then we can figure out how to move load around, so if we build with resilience (enough redundancy) we’re already taking care of our security issues.” Or we put our trust in the firewall, which sits there like some magic box solving all our problems.
The problem is–none of this is true. In any system where overall security is important, defense-in-depth is the key to building a secure system. No single part of the system bears the “primary responsibility” for “security.” The network is certainly a part of any defense-in-depth scheme that is going to work.
Which means network protocols need to be secure, at least in some sense, as well. I don’t mean “secure” in the sense of privacy—routes are not (generally) personally identifiable information (there are always exceptions, however). But rather “secure” in the sense that they cannot be easily attacked. On-the-wire encryption should prevent anyone from reading the contents of the packet or stream all the time. Network devices like routers and switches should be difficult to break in too, which means the protocols they run must be “secure” in the fuzzing sense—there should be no unexpected outputs because you’ve received an unexpected input.
I definitely do not mean path security of any sort. Making certain a packet (or update or whatever else) has followed a specified path is a chimera in packet switched networks. It’s like trying to nail your choice of multicolored gelatin desert to the wall. Packet switched networks are designed to adapt to changes in the network by rerouting traffic. Get over it.
So why are protocols and network devices so insecure? I recently ran into an interesting piece of research that provides some of the answer. To wit—
Our research saw that ambiguous keywords SHOULD and MAY had the second highest number of occurrences across all RFCs. We’ve also seen that their intended meaning is only to be interpreted as such when written in uppercase (whereas often they are written in lowercase). In addition, around 40% of RFCs made no use of uppercase requirements level keywords. These observations point to inconsistency in use of these keywords, and possibly misunderstanding about their importance in a security context. We saw that RFCs relating to Session Initiation Protocol (SIP) made most use of ambiguous keywords, and had the most number of implementation flaws as seen across SIP-based CVEs. While not conclusive, this suggests that there may be some correlation between the level of ambiguity in RFCs and subsequent implementation security flaws.
In other words, ambiguous language leads to ambiguous implementations which leads to security flaws in protocols.
The solution for this situation might be just this—specify protocols more rigorously. But simple solutions rarely admit reality within their scope. It’s easy to build more precise specifications—so why aren’t our specifications more precise?
In a word: politics.
For every RFC I’ve been involved in drafting, reviewing, or otherwise getting through the IETF, there are two reasons for each MAY or SHOULD therein. The first is someone has thought of a use-case where an implementor or operator might want to do something that would be otherwise not allowed by MUST. In these cases, everyone looks at the proposed MAY or SHOULD, thinks about how not doing it might be useful, and then thinks … “this isn’t so bad, the available functionality is a good thing, and there’s no real problem I can see with making this a MAY or SHOULD.” In other words, we can think of possible worlds where someone might want to do something, so we allow them to do it. Call this the “freedom principle.”
The second reason is that multiple vendors have multiple customers who want to do things different ways. When the two vendors clash in the realm of standards, the result is often a set of interlocking MAYs and SHOULDs that allow two implementors to build solutions that are interoperable in the main, but not along the edges, that satisfy both of their existing customer’s requirements. Call this the “big check principle.”
The problem with these situations is—the specification has an undetermined set of MAYs and SHOULDs that might interlock in unforeseen ways, resulting in unanticipated variances in implementations that ultimately show up as security holes.
Okay—now that I’ve described the problem, what can you do about it? One thing is to simplify. Stop putting everything into a small set of protocols. The more functionality you pour into a protocol or system, the harder it is to secure. Complexity is the enemy of security (and privacy!).
As for the political problems, these are human-scale, which means they are larger than any network you can ever build—but I’ll ponder this more and get back to you if I come up with any answers.