Is BGP Good Enough?

In a recent podcast, Ivan and Dinesh ask why there is a lot of interest in running link state protocols on data center fabrics. They begin with this point: if you have less than a few hundred switches, it really doesn’t matter what routing protocol you run on your data center fabric. Beyond this, there do not seem to be any problems to be solved that BGP cannot solve, so… why bother with a link state protocol? After all, BGP is much simpler than any link state protocol, and we should always solve all our problems with the simplest protocol possible.


  • BGP is both simple and complex, depending on your perspective
  • BGP is sometimes too much, and sometimes too little for data center fabrics
  • We are danger of treating every problem as a nail, because we have decided BGP is the ultimate hammer

Will these these contentions stand up to a rigorous challenge?

I will begin with the last contention first—BGP is simpler than any link state protocol. Consider the core protocol semantics of BGP and a link state protocol. In a link state protocol, every network device must have a synchronized copy of the Link State Database (LSDB). This is more challenging than BGP’s requirement, which is very distance-vector like; in BGP you only care if any pair of speakers have enough information to form loop-free paths through the network. Topology information is (largely) stripped out, metrics are simple, and shared information is minimized. It certainly seems, on this score, like BGP is simpler.

Before declaring a winner, however, this simplification needs to be considered in light of the State/Optimization/Surface triad.

When you remove state, you are always also reducing optimization in some way. What do you lose when comparing BGP to a link state protocol? You lose your view of the entire topology—there is no LSDB. Perhaps you do not think an LSDB in a data center fabric is all that important; the topology is somewhat fixed, and you probably are not going to need traffic engineering if the network is wired with enough bandwidth to solve all problems. Building a network with tons of bandwidth, however, is not always economically feasible. The more likely reality is there is a balance between various forms of quality of service, including traffic engineering, and throwing bandwidth at the problem. Where that balance is will probably vary, but to always assume you can throw bandwidth at the problem is naive.

There is another cost to this simplification, as well. Complexity is inserted into a network to solve hard problems. The most common hard problem complexity is used to solve is guarding against environmental instability. Again, a data center fabric should be stable; the topology should never change, reachability should never change, etc. We all know this is simply not true, however, or we would be running static routes in all of our data center fabrics. So why aren’t we?

Because data center fabrics, like any other network, do change. And when they do change, you want them to converge somewhat quickly. Is this not what all those ECMP parallel paths are for? In some situations, yes. In others, those ECMP paths actually harm BGP convergence speed. A specific instance: move an IP address from one ToR on your fabric to another, or from one virtual machine to another. In this situation, those ECMP paths are not working for you, they are working against you—this is, in fact, one of the worst BGP convergence scenarios you can face. IS-IS, specifically, will converge much faster than BGP in the case of detaching a leaf node from the graph and reattaching it someplace else.

Complexity can be seen from another perspective, as well. When considering BGP in the data center, we are considering one small slice of the capabilities of the protocol.

in the center of the illustration above there is a small grey circle representing the core features of BGP. The sections of the ten sided figure around it represent the features sets that have been added to BGP over the years to support the many places it is used. When we look at BGP for one specific use case, we see the one “slice,” the core functionality, and what we are building on top. The reality of BGP, from a code base and complexity perspective, is the total sum of all the different features added across the years to support every conceivable use case.

Essentially, BGP has become not only a nail, but every kind of nail, including framing nails, brads, finish nails, roofing nails, and all the other kinds. It is worse than this, though. BGP has also become the universal glue, the universal screw, the universal hook-and-loop fastener, the universal building block, etc.

BGP is not just the hammer with which we turn every problem into a nail, it is a universal hammer/driver/glue gun that is also the universal nail/screw/glue.

When you run BGP on your data center fabric, you are not just running the part you want to run. You are running all of it. The L3VPN part. The eVPN part. The intra-AS parts. The inter-AS parts. All of it. The apparent complexity may appear to be low, because you are only looking at one small slice of the protocol. But the real complexity, under the covers, where attack and interaction surfaces live, is very complex. In fact, by any reasonable measure, BGP might have the simplest set of core functions, but it is the most complicated routing protocol in existence.

In other words, complexity is sometimes a matter of perspective. In this perspective, IS-IS is much simpler. Note—don’t confuse our understanding of a thing with its complexity. Many people consider link state protocols more complex simply because they don’t understand them as well as BGP.

Let me give you an example of the problems you run into when you think about the complexity of BGP—problems you do not hear about, but exist in the real world. BGP uses TCP for transport. So do many applications. When multiple TCP streams interact, complex problems can result, such as the global synchronization of TCP streams. Of course we can solve this with some cool QoS, including WRED. But why do you want your application and control plane traffic interacting in this way in the first place? Maybe it is simpler just to separate the two?

Is BGP really simpler? From one perspective, it is simpler. From another, however, it is more complex.

Is BGP “good enough?” For some applications, it is. For others, however, it might not be.

You should decide what to run on your network based on application and business drivers, rather than “because it is good enough.” Which leads me back to where I often end up: If you haven’t found the trade-offs, you haven’t look hard enough.

Reaction: Network software quality

Over at IT ProPortal, Dr Greg Law has an article up chiding the networking world for the poor software quality. To wit—

When networking companies ship equipment out containing critical bugs, providing remediation in response to their discovery can be almost impossible. Their engineers back at base often lack the data they need to reproduce the issue as it’s usually kept by clients on premise. An inability to cure a product defect could result in the failure of a product line, temporary or permanent withdrawal of a product, lost customers, reputational damage, and product reengineering expenses, any of which could have a material impact on revenue, margins, and net income.

Let me begin here: Dr. Law, you are correct—we have a problem with software quality. I think the problem is a bit larger than just the networking world—for instance, my family just purchased two new vehicles, a Volvo and a Fiat. Both have Android systems in the center screen. And neither will connect correctly with our Android based phones. It probably isn’t mission critical, like it could be for a network, but it is annoying.

But even given software quality is a widespread issue in our world, it is still true that networks are something of a special case. While networks are often just hidden behind the plug, they play a much larger role in the way the world works than most people realize. Much like the train system at the turn of the century, and the mixed mode transportation systems that enable us to put dinner on the table every night, the network carries most of what really matters in the virtual world, from our money to our medical records.

Given the assessment is correct—and I think it is—what is the answer?

One answer is to simply do better. To fuss at the vendors, and the open source projects, and to make the quality better. The beatings, as they say, will continue until moral improves. If anyone out there thinks this will really work, raise your hands. No, higher. I can’t see you. Or maybe no-one has their hands raised.

What, then, is the solution? I think Dr. Law actually gets at the corner of what the solution needs to be in this line—

The complexity of the network stack though, is higher than ever. An increased number of protocols leads to a more complex architecture, which in turn severely impacts operational efficiency of networks.

For a short review, remember that complexity is required to solve hard problems. Specifically, the one hard problem complexity is designed to solve is environmental uncertainty. Because of this, we are not going to get rid of complexity any time soon. There are too many old applications, and too many old appliances, that no-one willing to let go of. There are too many vendors trying to keep people within their ecosystem, and too many resulting one-off connectors to bridge the gap, that will never be replaced. Complexity isn’t really going to be dramatically reduced until we bite the bullet and take these kinds of organizational and people problems on head on.

In the meantime, what can we do?

Design simpler. Stop stacking tons of layers. Focus on solving problems, rather than deploying technologies. Stop being afraid to rip things out.

If you have read my work in the past, for instance Navigating Network Complexity, or Computer Networking Problems and Solutions, or even The Art of Network Architecture, you know the drill.

We can all cast blame at the vendors, but part of this is on us as network engineers. If you want better quality in your network, the best place to start is with the network you are working on right now, the people who are designing and deploying that network, and the people who make the business decisions.

On the ‘net: The Tradeoffs of Information Hiding

I recently joined Ethan Banks for a Packet Pushers episode around the trade offs of hiding information in the control plane. This was a terrific show; you can listen to it by clicking on the link below.

Today on the Priority Queue, we’re gonna hide some information. Oh, like route summarization? Sure, like route summarization. That’s an example of information hiding. But there’s much more to the story than that. Our guest is Russ White. Russ is a serial networking book author, network architect, RFC writer, patent holder, technical instructor, and much of the motive force behind the early iterations of the CCDE program.

The Universal Fat Tree

Have you ever wondered why spine-and-leaf networks are the “standard” for data center networks? While the answer has a lot to do with trial and error, it turns out there is also a mathematical reason the fat-tree spine-and-leaf is is used almost universally. There often is some mathematical reason for the decisions made in engineering, although we rarely explore those reasons. If it seems to work, there is probably a reason.

The fat-tree design is explored in a paper published in 2015 (available here at the ACM, and now added to my “classic papers” page so there is a local copy as well), using a novel technique to not only explore why the spine-and-leaf fat-tree is so flexible, but even what the ideal ratio of network capacity is at each stage. The idea begins with this basic concept: one kind of network topology can be emulated on top of another physical topology. For instance, you can emulate a toroid topology on top of a hierarchical network, or a spine-and-leaf on top of of hypercube, etc. To use terms engineers are familiar with in a slightly different way, let’s call the physical topology the underlay, and the emulated topology the overlay. This is not too far off, as emulated topologies do need to be a virtual construction on top of the underlay, which means… tunnels, of course.

In emulating one kind of topology over another kind of topology, however, you lose some performance. This performance loss factor is quite important, in that it tells you whether or not it is worth building your actual network using the design of the underlay, or using the overlay. If, for instance, you emulate a toroid design on top of a folded Clos, and you discover the performance is half as good as a real, physical toroid, then you might be better off building a toroid rather than a folded Clos.

This might be difficult to grasp at first, so let’s use a specific example. The only topologies most engineers are familiar with are rings, hierarchical networks, partial mesh, full mesh, and some form of spine-and-leaf. Consider the case of a full mesh network versus a ring topology. Suppose you have some application that, through careful testing, you have determined best runs on a full mesh network. Now when you go to build your physical network, you find that you have the choice between building a real full mesh network, which will be very expensive, or a ring network, which will be very cheap. How much performance will you lose in running this application on the ring topology?

The way to discover the answer is to build a virtual full mesh overlay onto the ring topology. Once you do this, it turns out there are ways to determine how much performance loss you will encounter using this combination of underlay and overlay. Now, take this one step farther—what if you decide not to build the overlay, but simply run the application directly on the ring topology? Minus the tunnel encapsulation and de-encapsulation, the resulting loss in performance should be roughly the same,

Given this, you should be able to calculate the performance loss from emulating every kind of topology on top of every other kind of topology. From here, you can determine if there is one topology that can be used as an underlay for every other kind of topology with a minimal amount of performance loss. It turns out there is: the fat-tree spine-and-leaf topology. So an application that might best work on a full mesh network will run on a fat-tree spine-and-leaf topology with minimal performance loss. The same can be said of an application that runs best on a ring topology, or a hierarchical topology, or a partial mesh, or a… Thus, the fat-tree spine-and-leaf can be called a universal topology.

After coming to this point, the authors wonder if it is possible to determine how the bandwidth should be apportioned in a fat-tree spine-and-leaf to produce optimal emulation results across every possible overlay. This would give designers the ability to understand how to achieve the most optimal universal application performance on a single physical topology. Again, there turns out to be an answer to this question: doubling the bandwidth at each stage in the fabric is the best way to create an underlay that will emulate every other kind of overlay with the minimal amount of performance loss. The authors come to this conclusion by comparing a data center network to an on-chip connection network, which has a provable “best” design based on the physical layout of the wires. Translating the physical space into a monetary cost produces the same result as above: doubling the bandwidth at each stage creates the least expensive network with the most capabilities.

The result, then, is this: the fat-tree spine-and-leaf design is the most effective design for most applications, a universal topology. Further, building such a design where there are two neighbors down for every neighbor up, and doubling the bandwidth at each stage, is the most efficient use of resources in building such a network.

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

  • EIGRP receives a set of queries to which it must respond
  • EIGRP allocates memory for each packet to build the responses
  • Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
  • EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
  • EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

Giving the Monkey a Smaller Club

Over at the ACM blog, there is a terrific article about software design that has direct application to network design and architecture.

The problem is that once you give a monkey a club, he is going to hit you with it if you try to take it away from him.

What do monkeys and clubs have to do with software or network design? The primary point of interaction is security. The club you intend to make your network operator’s life easier is also a club an attacker can use to break into your network, or damage its operation. Clubs are just that way. If you think of the collection of tools as not just tools, but also as an attack surface, you can immediately see the correlation between the available tools and the attack surface. One way to increase security is to reduce the attack surface, and one way to reduce the attack surface is tools, reduce the number of tools—or the club.

The best way to reduce the attack surface of a piece of software is to remove any unnecessary code.

Consider this: the components of any network are actually made up of code. So to translate this to the network engineering world, you can say:

The best way to reduce the attack surface of a network is to remove any unnecessary components.

What kinds of components? Routing protocols, transport protocols, and quality of service mechanisms come immediately to mind, but the number and kind of overlays, the number and kind of virtual networks might be further examples.

There is another issue here that is not security related specifically, but rather resilience related. When you think about network failures, you probably think of bugs in the code, failed connectors, failed hardware, and other such causes. The reality is far different, however—the primary cause of network failures in real life is probably user error in the form of misconfiguration (or misconfiguration spread across a thousand routers through the wonders of DevOps!). The Mean Time Between Mistakes (MTBM) is a much larger deal than most realize. Giving the operator too many knobs to solve a single problem is the equivalent of giving the monkey a club.

Simplicity in network design has many advantages—including giving the monkey a smaller club.

Learning to Ask Questions

One thing I’m often asked in email and in person is: why should I bother learning theory? After all, you don’t install SPF in your network; you install a router or switch, which you then configure OSPF or IS-IS on. The SPF algorithm is not exposed to the user, and does not seem to really have any impact on the operation of the network. Such internal functionality might be neat to know, but ultimately–who cares? Maybe it will be useful in some projected troubleshooting situation, but the key to effective troubleshooting is understanding the output of the device, rather than in understanding what the device is doing.

In other words, there is no reason to treat network devices as anything more than black boxes. You put some stuff in, other stuff comes out, and the vendor takes care of everything in the middle. I dealt with a related line of thinking in this video, but what about this black box argument? Do network engineers really need to know what goes on inside the vendor’s black box?

Let me answer this question with another question. Wen you shift to a new piece of hardware, how do you know what you are trying to configure? Suppose, for instance, that you need to set up BGP route reflectors on a new device, and need to make certain optimal paths are taken from eBGP edge to eBGP edge. What configuration commands would you look for? If you knew BGP as a protocol, you might be able to find the right set of commands without needing to search the documentation, or do an internet search. Knowing how it works can often lead you to knowing where to look and what the commands might be. This can save a tremendous amount of time.

Back up from configuration to making equipment purchasing decisions, or specifying equipment. Again, rather than searching the documentation randomly, if you know what protocol level feature you need the software to implement, you can search for the specific support you are looking for, and know what questions to ask about the possible limitations.

And again, from a more architectural perspective–how do you know what protocol to specify to solve any particular problem if you don’t understand how the protocols actually work?

So from configuration to architecture, knowing how a protocol works can actually help you work faster and smarter by helping you ask the right questions. Just another reason to actually learn the way protocols work, rather than just how to configure them.

Applying Software Agility to Network Design

The paper we are looking at in this post is tangential to the world of network engineering, rather than being directly targeted at network engineering. The thesis of On Understanding Software Agility—A Social Complexity Point of View, is that at least some elements of software development are a wicked problem, and hence need to be managed through complexity. The paper sets the following criteria for complexity—

  • Interaction: made up of a lot of interacting systems
  • Autonomy: subsystems are largely autonomous within specified bounds
  • Emergence: global behavior is unpredictable, but can be explained in subsystem interactions
  • Lack of equilibrium: events prevent the system from reaching a state of equilibrium
  • Nonlinearity: small events cause large output changes
  • Self-organization: self-organizing response to disruptive events
  • Co-evolution: the system and its environment adapt to one another

It’s pretty clear network design and operation would fit into the 7 points made above; the control plane, transport protocols, the physical layer, hardware, and software are all subsystems of an overall system. Between these subsystems, there is clearly interaction, and each subsystem acts autonomously within bounds. The result is a set of systemic behaviors that cannot be predicted from examining the system itself. The network design process is, itself, also a complex system, just as software development is.

Trying to establish computing as an engineering discipline led people to believe that managing projects in computing is also an engineering discipline. Engineering is for the most part based on Newtonian mechanics and physics, especially in terms of causality. Events can be predicted, responses can be known in advance and planning and optimize for certain attributes is possible from the outset. Effectively, this reductionist approach assumes that the whole is the sum of the parts, and that parts can be replaced wherever and whenever necessary to address problems. This machine paradigm necessitates planning everything in advance because the machine does not think. This approach is fundamentally incapable of dealing with the complexity and change that actually happens in projects.

In a network, some simple input into one subsystem of the network can cause major changes in the overall system state. The question is: how should engineers deal with this situation? One solution is to try to nail each system down more precisely, such as building a “single source of truth,” and imposing that single view of the world onto the network. The theory is to change control, ultimately removing the lack of equilibrium, and hence reducing complexity. Or perhaps we can centralize the control plane, moving all the complexity into a single point in the network, making it manageable. Or maybe we can automate all the complexity out of the network, feeding the network “intent,” and having, as a result, a nice clean network experience.

Good luck slaying the complexity dragon in any of these ways. What, then, is the solution? According to Pelrine, the right solution is to replace our Newtonian view of software development with a different model. To make this shift, the author suggests moving to the complexity quadrant of the Cynefin framework of complexity.

The correct way to manage software development, according to Pelrine, is to use a probe/sense/respond model. You probe the software through testing iteratively as you build it, sensing the result, and then responding to the result through modification, etc.

Application to network design

The same process is actually used in the design of networks, once beyond the greenfield stage or initial deployment. Over time, different configurations are tested to solve specific problems, iteratively solving problems while also accruing complexity and ossifying. The problem with network designs is much the same as it with software projects—the resulting network is not ever “torn down,” and rethought from the ground up. The result seems to be that networks will become more complex over time, until they either fail, or they are replaced because the business fails, or some other major event occurs. There needs to be some way to combat this—but how?

The key counter to this process is modularization. By modularizing, which is containing complexity within bounds, you can build using the probe/sense/respond model. There will still be changes in the interaction between the modules within the system, but so long as there the “edges” are held constant, the complexity can be split into two domains: within the module and outside the module. Hence the rule of thumb: separate complexity from complexity.

The traditional hierarchical model, of course, provides one kind of separation by splitting control plane state into smaller chunks, potentially at the cost of optimal traffic flow (remember the state/optimization/surface trilemma here, as it describes the trade-offs). Another form of separation is virtualization, with the attendant costs of interaction surfaces and optimization. A third sort of separation is to split policy from reachability, which attempts to preserve optimization at the cost of interaction surfaces. Disaggregation provides yet another useful separation between complex systems, separating software from hardware, and (potentially) the control plane from the network operating system, and even the monitoring software from the network operating system.

These types of modularization can be used together, of course; topological regions of the network can be separated via control plane state choke points, while either virtualization or splitting policy from reachability can be used within a module to separate complexity from complexity within a module. The amount and kind of separations deployed is entirely dependent on specific requirements as well as the complexity of the overall network. The more complex the overall network is, the more kinds of separation that should be deployed to contain complexity into manageable chunks where possible.

Each module, then, can be replaced with a new one, so long as it provides the same set of services, and any changes in the edge are manageable. Each module can be developed iteratively, by making changes (probing), sensing (measuring the result), and then adjusting the module according to whether or not it fits the requirements. This part would involve using creative destruction (the chaos monkey) as a form of probing, to see how the module and system react to controlled failures.

Nice Theory, but So What?

This might all seem theoretical, but it is actually extremely practical. Getting out of the traditional model of network design, where the configuration is fixed, there is a single source of truth for the entire network, the control plane is tied to the software, the software is tied to the hardware, and policy is tied to the control plane, can open up new ways to build massive networks against very complex requirements while managing the complexity and the development and deployment processes. Shifting from a mindset of controlling complexity by nailing everything down to a single state, and to a mindset of managing complexity by finding logical separation points, and building in blocks, then “growing” each module using the appropriate process, whether iterative or waterfall.

Even is scale is not the goal of your network—you “only” have a couple of hundred network devices, say—these principles can still be applied. First, complexity is not really about scale; it is about requirements. A car is not really any less complex than a large truck, and a motor home (or camper) is likely more complex than either. The differences are not in scale, but in requirements. Second, these principles still apply to smaller networks; the primary question is which forms of separation to deploy, rather than whether complexity needs to be separated from complexity.

Moving to this kind of design model could revolutionize the thinking of the network engineering world.

If you haven’t found the tradeoff…

This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:

Just like how every action has an equal and opposite reaction, each “positive” design decision necessarily creates a “negative” compromise. Insofar as designs necessarily create compromises, those compromises are very much intentional. (And in the same vein, unintentional compromises are a sign of bad design.)

In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.

Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.

Cross posted at CircleID

But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.

What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.

Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.

What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.

In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.

The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.

What Kind of Design?

In this short video I work through two kinds of design, or two different ways of designing a network. Which kind of designer are you? Do you see one as better than the other? Which would you prefer to do, are you right now?

Complexity and the Thin Waist

In recent years, we have become accustomed to—and often accosted by—the phrase software eats the world. It’s become a mantra in the networking world that software defined is the future. full stop This research paper by Microsoft, however, tells a different story. According to Baumann, hardware is the new software. Or, to put it differently, even as software eats the world, hardware is taking over an ever increasing amount of the functionality software is doing. In showing this point, the paper also points out the complexity problems involved in dissolving the thin waist of an architecture.

The specific example used in the paper is the Intel x86 Instruction Set Architecture (ISA). Many years ago, when I was a “youngster” in the information technology field, there were a number of different processor platforms; the processor wars waged in full. There were, primarily, the x86 platform, by Intel, beginning with the 8086, and its subsequent generations, the 8088, 80286, 80386, then the Pentium, etc. On the other side of the world, there were the RISC based processors, the kind stuffed into Apple products, Cisco routers, and Sun Sparc workstations (like the one that I used daily while in Cisco TAC). The argument between these two came down to this: the Intel x86 ISA was messy, somewhat ad-hoc, difficult to work with, sometimes buggy, and provided a lot of functionality at the chip level. The RISC based processors, on the other hand, had a much simpler ISA, leaving more to software.

Twenty years on, and this argument is no longer even an argument. The x86 ISA won hands down. I still remember when the Apple and Cisco product lines moved from RISC based processors to x86 platforms, and when AMD first started making x86 “lookalikes” that could run software designed for an Intel processor.

But remember this—The x86 ISA has always been about the processor taking in more work over time. Baumann’s paper is, in fact, an overview of this process, showing the amount of work the processor has been taking on over time. The example he uses is the twelve new instructions added to the x86 ISA by Intel in 2015–2016 around software security. Twelve new instructions might not sound like a lot, but these particular instructions necessitate the creation of entirely new registers in the CPU, a changed memory page table format, new stack structures (including a new shadow stack that sounds rather complex), new exception handling processes (for interrupts), etc.

The primary point of much of this activity is to make it possible for developers to stop trusting software for specific slices of security, and start trusting hardware, which cannot (in theory) be altered. Of course, so long as the hardware relies on microcode and has some sort of update process, there is always the possibility of an attack on the software embedded in the hardware, but we will leave this problematic point aside for the moment.

Why is Intel adding more to the hardware? According to Baumann—

The slowing pace of Moore’s Law will make it harder to sell CPUs: absent improvements in microarchitecture, they won’t be substantially faster, nor substantially more power efficient, and they will have about the same number of cores at the same price point as prior CPUs. Why would anyone buy a new CPU? One reason to which Intel appears to be turning is features: if the new CPU implements an important ISA extension—say, one required by software because it is essential to security—consumers will have a strong reason to upgrade.

If we see the software market as composed of layers, with applications on top, and the actual hardware on the bottom, we can see the ISA is part of the thin waist in software development. Baumann says: “As the most stable “thin waist” interface in today’s commodity technology stack, the x86 ISA sits at a critical point for many systems.”

The thin waist is very important in the larger effort to combat complexity. Having the simplest thin waist available is one of the most important things you can do in any system to reduce complexity. What is happening here is that the industry went from having many different ISA options to having one dominant ISA. A thin waste, however, always implies commoditization, which in turn means falling profits. So if you are the keeper of the thin waste, you either learn to live with less profit, you branch out into other areas, or you learn to broaden the waist. Intel is doing all of these things, but broadening the waist is definitely on the table.

If the point about complexity is true, then broadening the waist should imply an increase in complexity. Baumann points this out specifically. In the section on implications, he notes a lot of things that relate to complexity, such as increased interactions (surfaces!), security issues, and slower feature cycles (state!). Certainly enough, increasing the thin waist always increases complexity, most of the time in unexpected ways.

So—what is the point of this little trip down memory lane (or the memory hole, perhaps)? This all directly applies to network and protocol design. The end-to-end principle is one of the most time tested ideas in protocol design. Essentially, the network itself should be a thin waist in the application flow, providing minimal services and changing little. Within the network, there should be a set of thin waists where modules meet; choke points where policy can be implemented, state can be reduced, and interaction surfaces can be clearly controlled.

We have, alas, abandoned all of this in our drive to make the network the ne plus ultra of all things. Want a layer two overlay? No problem. Want to stretch the layer two overlay between continents? Well, we can do that too. Want it all to converge really fast, and have high reliability? We can shove ISSU at the problem, and give you no downtime, no dropped packets, ever.

As a result, our networks are complex. In fact, they are too complex. And now we are talking about pushing AI at the problem. The reality is, however, we would be much better off if we actually carefully considered each new thing we throw into the mix of the thin waist of the network itself, and of our individual networks.

The thin waist is always tempting to broaden out—it is so simple, and it seems so easy to add functionality to the thin waist to get where we want to go. The result is never what we think it will be, though. The power of unintended consequences will catch up with us at some point or another.

Like… right about the time you decide to try to use asymmetric traffic flows across a network device that keeps state.

Make the waist thin, and the complexity becomes controllable. This is a principle we have yet to learn and apply in our network and protocol designs.