It’s time for a short lecture on complexity.
Networks are complex. This should not be surprising, as building a system that can solve hard problems, while also adapting quickly to changes in the real world, requires complexity—the harder the problem, the more adaptable the system needs to be, the more resulting design will tend to be. Networks are bound to be complex, because we expect them to be able to support any application we throw at them, adapt to fast-changing business conditions, and adapt to real-world failures of various kinds.
There are several reactions I’ve seen to this reality over the years, each of which has their own trade-offs.
The first is to cover the complexity up with abstractions. Here we take a massively complex underlying system and “contain” it within certain bounds so the complexity is no longer apparent. I can’t really make the system simpler, so I’ll just make the system simpler to use. We see this all the time in the networking world, including things like intent driven replacing the command line with a GUI, and replacing the command line with an automation system. The strong point of these kinds of solutions is they do, in fact, make the system easier to interact with, or (somewhat) encapsulate that “huge glob of legacy” into a module so you can interface with it in some way that is not… legacy.
One negative side of these kinds of solutions, however, is that they really don’t address the complexity, they just hide it. Many times hiding complexity has a palliative effect, rather than a real world one, and the final state is worse than the starting state. Imagine someone who has back pain, so they take pain-killers, and then go back to the gym to life even heavier weights than they have before. Covering the pain up gives them the room to do more damage to their bodies—complexity, like pain, is sometimes a signal that something is wrong.
Another negative side effect of this kind of solution is described by the law of leaky abstractions: all nontrivial abstractions leak. I cannot count the number of times engineers have underestimated the amount of information that leaks through an abstraction layer and the negative impacts such leaks will have on the overall system.
The second solution I see people use on a regular basis is to agglutinate multiple solutions into a single solution. The line of thinking here is that reducing the number of moving parts necessarily makes the overall system simpler. This is actually just another form of abstraction, and it normally does not work. For instance, it’s common in data center designs to have a single control plane for both the overlay and underlay (which is different than just not having an overlay!). This will work for some time, but at some level of scale it usually creates more complexity, particularly in trying to find and fix problems, than it solves in reducing configuration effort.
As an example, consider if you could create some form of wheel for a car that contained its own little engine, braking system, and had the ability to “warp” or modify its shape to produce steering effects. The car designer would just provide a single fixed (not moving) attachment point, and let the wheel do all the work. Sounds great for the car designer, right? But the wheel would then be such a complex system that it would be near impossible to troubleshoot or understand. Further, since you have four wheels on the car, you must somehow allow them to communicate, as well as having communication to the driver to know what to do from moment to moment, etc. The simplification achieved by munging all these things into a single component will ultimately be overcome by complexity built around the “do-it-all” system to make the whole system run.
Or imagine a network with a single transport protocol that does everything—host-to-host, connection-oriented, connectionless, encrypted, etc. You don’t have to think about it long to intuitively know this isn’t a good idea.
An example for the reader: Geoff Huston joins the Hedge this week to talk about DNS over HTTPS. Is this an example of munging systems together than shouldn’t be munged together? Or is this a clever solution to a hard problem? Listen to the two episodes and think it through before answering—because I’m not certain there is a clear answer to this question.
Finally, what a lot of people do is toss the complexity over the cubicle wall. Trust me, this doesn’t work in the long run–the person on the other side of the wall has a shovel, too, and they are going to be pushing complexity at you as fast as they can.
There are no easy solutions to solving complexity. The only real way to deal with these problems is by looking at the network as part of a larger system including applications, the business environment, and many other factors. Then figure out what needs to be done, how to divide the work up (where the best abstraction points are), and build replaceable components that can solve each of these problems while leaking the least amount of information, and are internally as simple as possible.
Every other path leads to building more complex, brittle systems.
It was quite difficult to prepare a tub full of bath water at many points in recent history (and it probably still is in some many parts of the world). First, there was the water itself—if you do not have plumbing, then the water must be manually transported, one bucket at a time, from a stream, well, or pump, to the tub. The result, of course, would be someone who was sweaty enough to need the forthcoming bath. Then there is the warming of the water. Shy of building a fire under the tub itself, how can you heat enough water quickly enough to make the eventual bathing experience? According to legend, this resulted in the entire household using the same tub of water to bathe. The last to bathe was always the smallest, the baby. By then, the water would be murky with dirt, which means the child could not be seen in the tub. When the tub was thrown out, then, no-one could tell if the baby was still in there.
But it doesn’t take a dirty tub of water to throw the baby out with the bath. All it really takes is an unwillingness to learn from the lessons of others because, somehow, you have convinced yourself that your circumstances are so different there is nothing to learn. Take, for instance, the constant refrain, “you are not Google.”
I should hope not.
But this phrase, or something similar, is often used to say something like this: you don’t have the problems of any of the hyperscalers, so you should not look to their solutions to find solutions for your problems. An entertaining read on this from a recent blog:
Software engineers go crazy for the most ridiculous things. We like to think that we’re hyper-rational, but when we have to choose a technology, we end up in a kind of frenzy — bouncing from one person’s Hacker News comment to another’s blog post until, in a stupor, we float helplessly toward the brightest light and lay prone in front of it, oblivious to what we were looking for in the first place. —Oz Nova
There is a lot of truth here—you should never choose a system or solution because it solves someone else’s problem. Do not deploy Kafka if you you need the scale Kafka represents. Maybe you don’t need four links between every pair of routers “just to be certain you have enough redundancy.”
On the other hand, there is a real danger here of throwing the baby out with the bathwater—the water is murky with product and project claims, so just abandon the entire mess. To see where the problem is here, let’s look at another large scale system we don’t think about very much any longer: the NASA space program from the mid-1960’s. One of the great things the folks at NASA have always liked to talk about is all the things that came out of the space program. Remember Tang? Or maybe not. It really wasn’t developed for the space program, and it’s mostly sugar and water, but it was used in some of the first space missions, and hence became associated with hanging out in space.
There are a number of other inventions, however, that really did come directly out of research into solving problems folks hanging out in space would have, such as the space pen, freeze-dried ice cream, exercise machines, scratch-resistant eyeglass lenses, cameras on phones, battery powered tools, infrared thermometers, and many others.
Since you are not going to space any time soon, you refuse to use any of these technologies, right?
Do not be silly. Of course you still use these technologies. Because you are smart enough not to throw the baby out with the bathwater, right?
You should apply the same level of care to the solutions Google, Amazon, LinkedIn, Microsoft, and the other hyperscalers. Not everything is going to fit in your environment, of course. On the other hand, some things might fit. And regardless of whether any particular technology fits or not, you can still learn something about how systems work by considering how they are building things to scale to their needs. You can adopt operational processes that make sense based on what they have learned. You can pick out technologies and ways of thinking that make sense.
No, you’re (probably not) Google. On the other hand, we are all building complex networks. The more we can learn from those around us, the better what we build will be. Don’t throw the baby out with the bathwater.
We love layers and abstraction. After all, building in layers and it’s corollary, abstraction, are the foundation of large-scale system design. The only way to build large-scale systems is to divide and conquer, which means building many different component parts with clear and defined interaction surfaces (most often expressed as APIs) and combining these many different parts into a complete system. But abstraction, layering, and modularization have negative aspects as well as positive ones. For instance, according to the State/Optimization/Surface triad, any time we remove state in order to control complexity, we either add an interaction surface (which adds complexity) or we reduce optimization.
Another impact of abstraction, though, is the side effect of Conway’s Law: “organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.” The structure of the organization that designs a system is ultimately baked into the modularization, abstraction, and API schemes of the system itself.
To take a networking instance, many networks use one kind of module for data centers and another for campuses. The style of network built in each place, where the lines are between these different topological locations in the network, the kind of policies expressed at the borders of these networks, and many other factors are baked into the network design based on the organizational design. This is all well and good within an organization, but what happens when you reach outside the organization and purchase hardware and software from another company to build your network?
When you buy systems designed by other organizations, you are importing their organizational structure into your organization. A corollary: the more vertically integrated a system is, or more aggregated, the more of the external organization’s structure you are importing into your organization. It seems like purchasing a strongly integrated system is “pressing the easy button,” but the result is often actually a mess of tech debt from trying to make the organizing principles of the purchased system fit with the internal logic of your organization.
This is why, for instance, some software developers advocate not using open source libraries and frameworks in building internal projects. The idea sounds radical, contrary to the direction of the entire tech culture. The easy button calls; the network is just an undifferentiated bit moving machine, just give me a black box and two lines of configuration, and I’m happy. I don’t want to know how it works.
Until it doesn’t. Or until it doesn’t do what you want for the thousandth time, and you add the thousandth line of configuration to tweak something, or you run into the thousandth assumption in the way the black box works that you just need to change to make your business run better. When you look at the system architecture as a reflection of organizational structure, these little mismatches begin to make sense.
Lessons for the network? It isn’t really possible to build all the bits and pieces you need to build a network. You are not going to be able to write your own network operating system. You are not going to be able to (from scratch) build your own control plane.
But you can be more careful about the organizational structures you are importing. You can understand the internal components and how they connect. You can understand the problems being solved, the general solution being implemented, and how these solutions ultimately fit together to make a whole.
You can insist on having access to the same APIs the system developers use. You can minimize the number of system elements you rely on, such as protocols and nerd knobs. You can, ultimately, disaggregate, treating your software and hardware as two different “things,” each of which has its own lifecycle, purposes, and value.
This is as much about mindset as anything else; the easy button is an abstraction. Abstractions are abstracting something. Don’t just watch the vendor presentation and gawk; dig in and understand.
Simplification is a constant theme not only here, and in my talks, but across the network engineering world right now. But what does this mean practically? Looking at a complex network, how do you begin simplifying?
The first option is to abstract, abstract again, and abstract some more. But before diving into deep abstraction, remember that abstraction is both a good and bad thing. Abstraction can reduce the amount of state in a network, and reduce the speed at which that state changes. Abstraction can cover a multitude of sins in the legacy part of the network, but abstractions also leak!!! In fact, all nontrivial abstractions leak. Following this logic through: all non-trivial abstractions leak; the more non-trivial the abstraction, the more it will leak; the more complexity an abstraction is covering, the less trivial the abstraction will be. Hence: the more complexity you are covering with an abstraction, the more it will leak.
Abstraction, then, is only one part of the solution. You must not only abstract, but you must also simplify the underlying bits of the system you are covering with the abstraction. This is a point we often miss.
Which returns us to our original question. The first answer to the question is this: minimize.
Minimize the number of technologies you are using. Of course, minimization is not so … simple … because it is a series of tradeoffs. You can minimize the number of protocols you are using to build the network, or you can minimize the number of things you are using each protocol for. This is why you layer things, which helps you understand how and where to modularize, focusing different components on different purposes, and then thinking about how those components interact. Ultimately, what you want is precisely the number of modules required to do the job to a specific level of efficiency, and not one module more (or less).
Minimize the kinds of “things” you are using. Try to use one data center topology, one campus topology, one regional topology, etc. Try to use one kind of device (whether virtual or physical) in each “role.” Try to reduce the number of “roles” in the network.
Think of everything, from protocols to “places,” as “modules,” and then try to reduce the number of modules. Modules should be chosen for repeatability, functional division, and optimal abstraction.
The second answer to the original question is: architecture should move slowly, components quickly.
The architecture is not the network, nor even the combination of all the modules.
Think of a building. Every building has bathrooms (I assume). All those bathrooms have sinks (I assume). The sinks need to fit the style of the building. The number of sinks need to match the needs of the building overall. But—the sinks can change rapidly, and in response to the changing architecture of the building, but the building, it’s purpose, and style, change much more slowly. Architecture should change slowly, components more rapidly.
This is another reason to create modules: each module can change as needed, but the architecture of the overall system needs to change more slowly and intentionally. Thinking in systemic terms helps differentiate between the architecture and the components. Each component should fit within the overall architecture, and each component should play a role in shaping the architecture. Does the organization you support rely on deep internal communication across a wide geographic area? Or does it rely on lots of smaller external communications across a narrow geographic area? The style of communication in your organization makes a huge difference in the way the network is built, just like a school or hospital has different needs in terms of sinks than a shopping mall.
So these are, at least, two rules for simplification you can start thinking about how to apply in practical ways: modularize, choose modules carefully, reduce the number of the kinds of modules, and think about what things need to change quickly and what things need to change slowly.
Throwing abstraction at the problem does not, ultimately, solve it. Abstraction must be combined with a lot of thinking about what you are abstracting and why.
Replace “software” with “network,” and think about it. How often do network engineers select the chassis-based system that promises to “never need to be replaced?” How often do we build networks like they will be “in use” 20+ years from now? Now it does happen from time to time; I have heard of devices with many years of uptime, for instance. I have worked on AT&T Brouters in production—essentially a Cisco AGS+ rebranded and resold by AT&T—that were some ten or fifteen years old even back when I worked on them. These things certainly happen, and sometimes they even happen for good reasons.
But knowing such things happen and planning for such things to happen are two different mindsets. At least some of the complexity in networks comes from just this sort of “must make it permanent: thinking:
Many developers like to write code which handles any problem which might appear at any point in the future. In that regard, they are fortune tellers, trying to find a solution for eventual problems. This can work out very well if their predictions are right. Most of the time, however, this flexibility only causes unneeded complexity in the code which gets in the way and does not actually solve any problems. This is not surprising as telling the future is a messy business.
Let’s refactor: many network engineers like to build networks that can handle any problem or application that might appear at any point in the future. I know I’m close to the truth, because I’ve been working on networks since the mid- to late-1980’s.
So now you are reading this and thinking: “but it is important to plan for the future.” You are not wrong—but there is a balance that often is not well thought out. You should not build for the immediate problem ignoring the future; down this path leads technical debt. You should not plan for the distant future, because this injects complexity that does not need to be there.
How do you find the balance? The place to begin is knowing how things work, rather than just how to make them work. If you know how and why things work, then you can see what things might last for a long time, and what might change quickly.
When you are designing a protocol, does it make sense to use TLVs rather than fixed length fields? Protocols last for 20+ years and are used across many different network devices. Protocols are often extended to solve new problems, rather than being replaced wholesale. Hence, it makes sense to use TLVs.
When you are designing a data center or campus network, does it make sense to purchase chassis boxes that are twice as large as you foresee needing over the next three years to future proof the design? Hardware changes are likely to make a device more than three years old easier to replace than upgrade—if you can even get the parts you need in three years. Hence, it makes more sense to plan for the immediate future and leave the crystal ball gazing to someone else.
If you haven’t found the tradeoffs, then you haven’t looked hard enough.
But to look hard enough, you need to go beyond the hype and “future proofing,” beyond how to make things work. You need to ask how and why things work the way they do so you know where to accept complexity to make the design more flexible, and where to limit complexity by planning for today.
A common complaint I hear among network engineers is that the lessons and techniques used by truly huge scale networks simply are not applicable to more “standard scale” networks. The key point, however, is balance—to look for the ideas and concepts that are interesting and at least somewhat novel, and then see how they might be applied to products and systems in all networks. Learning concepts can help you understand design patterns you might encounter almost anywhere. One recent paper, for instance, details Andromeda, a large scale networking system designed and operated by Google, one of the few truly huge networks in the world—
Andromeda is designed around a flexible hierarchy of flow processing paths. Flows are mapped to a programming path dynamically based on feature and performance requirements.
While the paper describes the general compute environment, and the forwarding process on individual nodes, the most interesting part from a network engineering perspective is hoverboard. While this concept behind hoverboard has been implemented in previous systems, it is usually hidden under the covers of a vertically integrated system, and therefore not normally something you see the inner workings of. To understand hoverboard, you have to begin with a little theory about the distribution and management of control plane data in a network.
- Splitting the control plane between reachability, topology, and policy enables some interesting new ways to think about scaling and complexity in network design
- The distribution of policy does not need to be static, or fixed, but rather can be distributed to different places at different times depending on need and efficiency
- The operation of large scale networks has much to teach us about the efficient operation of networks in general
The closer you can implement policy at the edge, the more efficient your use of network bandwidth is. The closer you implement your policy to the edge, however, the more distributed your policy is—and distributed policy tends to be difficult to maintain across time. This is a classic example of the state/optimization/surface triad (described in Navigating Network Complexity, for instance). The more state you add to the network in more places, the more optimal your use of resources will be.
There are a number of different solutions to this problem. One I have often advocated in the past is layering the control plane; one part of the control plane handles topology and reachability, while the other part handles policy implementation. The hoverboard idea is a variant of this idea of layering. In normal IP routing, all traffic from a host passes through a default gateway. Thus, the host only needs to know a minimal amount of control plane information. However, not knowing this information often precludes the host from being able to implement anything other than “blind policy.” Policy, then, is often implemented at the first hop in the network, the default gateway, which often becomes an appliance (like a firewall) to support the policy and forwarding load.
Hoverboard takes a slightly different path. The first-hop router (the default gateway) remains in place, and is the primary point of policy implementation in the network. However, the first-hop router has a back-channel to the host through a controller. The controller manages the policy at all the network edges (the policy overlay in the layered control plane idea above). When the traffic level for a flow reaches a specific level, the first-hop router signals the controller to move the policy from the first-hop to the host itself. Using an illustration from the paper itself—
This balances the distribution of policy against the efficiency of packet forwarding across time. Once a flow has become large enough, or has lasted long enough in time, policies related to that flow can be transferred from the network device to the host. This kind of coordination assumes a number of things, including: the ability to “see” the flows at both the host and the network (or deep telemetry); a layered control plane where reachability, topology, and policy are handled separately; and an overlaying controller that brokers policy onto the network and attached devices.
This kind of vertical integration is difficult to achieve in and environment built by multiple vendors. Some form of standard would be needed to carry the policy from one point to another, for instance—such as a well-designed set of policy descriptions built in a common modeling language… Perhaps something like YANG? 🙂
But while this kind of system would be difficult to deploy in an environment across multiple vendors, and without a single point of control from the systems and software side, this is the kind of thing that could make networks scale much better, be simpler to operate, and allow operators to manage complexity in a way that makes sense.
In a recent podcast, Ivan and Dinesh ask why there is a lot of interest in running link state protocols on data center fabrics. They begin with this point: if you have less than a few hundred switches, it really doesn’t matter what routing protocol you run on your data center fabric. Beyond this, there do not seem to be any problems to be solved that BGP cannot solve, so… why bother with a link state protocol? After all, BGP is much simpler than any link state protocol, and we should always solve all our problems with the simplest protocol possible.
- BGP is both simple and complex, depending on your perspective
- BGP is sometimes too much, and sometimes too little for data center fabrics
- We are danger of treating every problem as a nail, because we have decided BGP is the ultimate hammer
Will these these contentions stand up to a rigorous challenge?
I will begin with the last contention first—BGP is simpler than any link state protocol. Consider the core protocol semantics of BGP and a link state protocol. In a link state protocol, every network device must have a synchronized copy of the Link State Database (LSDB). This is more challenging than BGP’s requirement, which is very distance-vector like; in BGP you only care if any pair of speakers have enough information to form loop-free paths through the network. Topology information is (largely) stripped out, metrics are simple, and shared information is minimized. It certainly seems, on this score, like BGP is simpler.
Before declaring a winner, however, this simplification needs to be considered in light of the State/Optimization/Surface triad.
When you remove state, you are always also reducing optimization in some way. What do you lose when comparing BGP to a link state protocol? You lose your view of the entire topology—there is no LSDB. Perhaps you do not think an LSDB in a data center fabric is all that important; the topology is somewhat fixed, and you probably are not going to need traffic engineering if the network is wired with enough bandwidth to solve all problems. Building a network with tons of bandwidth, however, is not always economically feasible. The more likely reality is there is a balance between various forms of quality of service, including traffic engineering, and throwing bandwidth at the problem. Where that balance is will probably vary, but to always assume you can throw bandwidth at the problem is naive.
There is another cost to this simplification, as well. Complexity is inserted into a network to solve hard problems. The most common hard problem complexity is used to solve is guarding against environmental instability. Again, a data center fabric should be stable; the topology should never change, reachability should never change, etc. We all know this is simply not true, however, or we would be running static routes in all of our data center fabrics. So why aren’t we?
Because data center fabrics, like any other network, do change. And when they do change, you want them to converge somewhat quickly. Is this not what all those ECMP parallel paths are for? In some situations, yes. In others, those ECMP paths actually harm BGP convergence speed. A specific instance: move an IP address from one ToR on your fabric to another, or from one virtual machine to another. In this situation, those ECMP paths are not working for you, they are working against you—this is, in fact, one of the worst BGP convergence scenarios you can face. IS-IS, specifically, will converge much faster than BGP in the case of detaching a leaf node from the graph and reattaching it someplace else.
Complexity can be seen from another perspective, as well. When considering BGP in the data center, we are considering one small slice of the capabilities of the protocol.
in the center of the illustration above there is a small grey circle representing the core features of BGP. The sections of the ten sided figure around it represent the features sets that have been added to BGP over the years to support the many places it is used. When we look at BGP for one specific use case, we see the one “slice,” the core functionality, and what we are building on top. The reality of BGP, from a code base and complexity perspective, is the total sum of all the different features added across the years to support every conceivable use case.
Essentially, BGP has become not only a nail, but every kind of nail, including framing nails, brads, finish nails, roofing nails, and all the other kinds. It is worse than this, though. BGP has also become the universal glue, the universal screw, the universal hook-and-loop fastener, the universal building block, etc.
BGP is not just the hammer with which we turn every problem into a nail, it is a universal hammer/driver/glue gun that is also the universal nail/screw/glue.
When you run BGP on your data center fabric, you are not just running the part you want to run. You are running all of it. The L3VPN part. The eVPN part. The intra-AS parts. The inter-AS parts. All of it. The apparent complexity may appear to be low, because you are only looking at one small slice of the protocol. But the real complexity, under the covers, where attack and interaction surfaces live, is very complex. In fact, by any reasonable measure, BGP might have the simplest set of core functions, but it is the most complicated routing protocol in existence.
In other words, complexity is sometimes a matter of perspective. In this perspective, IS-IS is much simpler. Note—don’t confuse our understanding of a thing with its complexity. Many people consider link state protocols more complex simply because they don’t understand them as well as BGP.
Let me give you an example of the problems you run into when you think about the complexity of BGP—problems you do not hear about, but exist in the real world. BGP uses TCP for transport. So do many applications. When multiple TCP streams interact, complex problems can result, such as the global synchronization of TCP streams. Of course we can solve this with some cool QoS, including WRED. But why do you want your application and control plane traffic interacting in this way in the first place? Maybe it is simpler just to separate the two?
Is BGP really simpler? From one perspective, it is simpler. From another, however, it is more complex.
Is BGP “good enough?” For some applications, it is. For others, however, it might not be.
You should decide what to run on your network based on application and business drivers, rather than “because it is good enough.” Which leads me back to where I often end up: If you haven’t found the trade-offs, you haven’t look hard enough.