When you are building a data center fabric, should you run a control plane all the way to the host? This is question I encounter more often as operators deploy eVPN-based spine-and-leaf fabrics in their data centers (for those who are actually deploying scale-out spine-and-leaf—I see a lot of people deploying hybrid sorts of networks designed as “mini-hierarchical” designs and just calling them spine-and-leaf fabrics, but this is probably a topic for another day). Three reasons are generally given for deploying the control plane all on the hosts attached to the fabric: faster down detection, load sharing, and traffic engineering. Let’s consider each of these in turn.
Faster Down Detection. There’s no simple way for ToR switches to determine when the connection to a host has failed, whether the host is single or dual-homed. Somehow the set of routes reachable through the host must be related to the interface state, or some underlying fast hello state (such as BFD), so that if a link fails the ToR knows to pull the correct set of routes from the routing table. It’s simpler to just let the host itself advertise the correct reachability information; when the link fails, the routing session will fail, and the correct routes will automatically be withdrawn.
Load Sharing. While this only applies to hosts with two connections into the fabric (dual-homed hosts), this is still an important use case. If a dual-homed host only has two default routes to work from, the host is blind to network conditions, and can only load share equally across the available paths. Equal load sharing, however, may not be ideal in all situations. If the host is running routing, it is possible to inject more intelligence into the load sharing between the upstream links.
Traffic Engineering. Or traffic shaping, steering, etc. In some cases, traffic engineering requires injecting a label or outer header onto the packet as it enters the fabric. In others, more specific routes might be sent along one path and not another to draw specific kinds of traffic through a more optimal route in the fabric. This kind of traffic engineering is only possible if the control plane is running on the host.
All these reasons are well and good, but they all assume something that should be of great interest to the network designer: which control plane are we talking about?
Most DC fabric designs I see today assume there is a single control plane running on the fabric—generally this single control plane is BGP, and it’s being used both to provide basic IP connectivity through the fabric (the infrastructure underlay control plane) and to provide tunneled overlay reachability (the infrastructure overlay control plane—generally eVPN).
This entangling of the infrastructure underlay and overlay has always seemed, to me, to be less than ideal. When I worked on large-scale transit provider networks in my more youthful days, we intentionally designed networks that separated customer routes from infrastructure routes. This created two separate failure and security domains in the network, as well as dividing the telemetry data in ways that allowed faster troubleshooting of common problems.
The same principles should apply in a DC fabric—after all, the workloads are essentially customers of the fabric, while the basic underlay connectivity counts as infrastructure. The simplest way to adopt this sort of division of labor is the same way large-scale transit providers did (and do)—use two different routing protocols for the underlay and overlay. For instance, IS-IS or RIFT for the underlay and eVPN using BGP for the overlay.
If you move to two layers of control plane, the question above becomes a bit more nuanced—should the overlay control plane run on the hosts? Should the underlay control plane run on the hosts?
For faster down detection—for those hosts that need faster down detection, BFD tied to IGP neighbor state can remove the correct nexthop from the local routing table at a ToR, causing the correct reachable destinations to be withdrawn. Alternatively, the host can run an instance of the overlay control plane, which allows it to advertise and withdraw “customer routes” directly. In neither case is the underlay control plane required to run on the host.
For load sharing and traffic engineering—if something like SRm6, or even other more traditional forms of traffic engineering, the information needed will be carried in the overlay rather than the underlay—so the underlay routing protocol does not need to run on the host.
On the other side of the coin, not running the underlay protocol on the host can help the overall network security posture. Assume a public facing host connected to the fabric is somehow pwned… If the host is running the underlay protocol, its pretty simple to DoS the entire fabric to take it down, or to inject incorrect routing information. If the overlay is configured correctly, however, only the virtual topology which the host has access to can be impacted by an attack—and if microsegmentation is deployed, that damage can be minimized as well.
From a complexity perspective, running the underlay control plane on the host dramatically increases the amount of state the host must maintain; there is no effective filter you can run to reduce state on the host without destroying some of the advantages gained by running the underlay control plane there. On the other hand, the ToR can be configured to filter routing information the host receives, controlling the amount of state the host needs to manage.
Control plane on the host or not? This is one of those questions where properly modularized and layered network design can make a big difference in what the right answer should be.
Many years ago I attended a presentation by Dave Meyers on network complexity—which set off an entire line of thinking about how we build networks that are just too complex. While it might be interesting to dive into our motivations for building networks that are just too complex, I starting thinking about how to classify and understand the complexity I was seeing in all the networks I touched. Of course, my primary interest is in how to build networks that are less complex, rather than just understanding complexity…
This led me to do a lot of reading, write some drafts, and then write a book. During this process, I ended coining what I call the complexity triad—State, Optimization, and Surface. If you read the book on complexity, you can see my views on what the triad consisted of changed through in the writing—I started out with volume (of state), speed (of state), and optimization. Somehow, though, interaction surfaces need to play a role in the complexity puzzle.
First, you create interaction surface when you modularize anything—and you modularize to control state (the scope to set apart failure domains, the speed and volume to enable scaling). Second, adding interaction surfaces adds complexity by creating places where information must be exchanged—which requires protocols and other things. Finally, reducing state through abstraction at an interaction surface is the primary cause of many forms of suboptimal behavior in a control plane, and causes unintended consequences. Since interaction surfaces are so closely tied to state and optimization, then, I added surfaces to the triad, and merged the two kinds of state into one, just state.
I have been thinking through the triad again in the last several weeks for various reasons, and I’m not certain it’s quite right still because I’m not convinced surfaces are really a tradeoff against state and optimization. It seems more accurate to say that state and optimization trade off through interaction surfaces. This does not make it any less of a triad, but it might mean I need to find a little different way to draw it. One way to illustrate it is as a system of moving parts, such as the illustration below.
If you think of the interaction surface between modules 1 and 2—two topological parts, or a virtual topology on top of a physical—then the abstraction is the amount of information allowed to pass between the two modules. For instance, in aggregation the length of the aggregated prefixes, or the aggregated prefix metrics, etc.
When you “turn the crank,” so-to-speak, you adjust the volume, speed (velocity), breadth, or depth of information being passed between the modules—either more or less information, faster or slower, in more places or fewer, or the reaction of the module receiving the state. Every time you turn the crank, however, there is not one reaction but many. Notices optimization 1 will turn in the opposite direction from optimization 2 in the diagram—so turning the crank for 1 to be more optimal will always result in 2 becoming less optimal. There are tens or hundreds of such interactions in any system, and it is impossible for any person to know or understand all of them.
For instance, if you aggregate hundreds of /64’s to tens of /60’s, you reduce the state and optimize by reducing the scope of the failure domain. On the other hand, because you have less specific routing information, traffic is (most likely) going to flow along less-than-optimal paths. If you “turn the crank” by aggregating those same hundreds of /64’s to a 0::0, you will have more “airtight” failure domains or modules, but less optimal traffic flow. Hence …
If you haven’t found the tradeoffs, you haven’t looked hard enough.
What understanding the SOS triad allows you, combined with a fundamental knowledge of how these things work, is to know where to look for the tradeoffs. Maybe it would be better to illustrate the SOS triad with surfaces at the bottom all the time, acting as a sort of fulcrum or balance point between state and optimization… Or maybe a completely different illustration would be better. This is something for me to think about more and figure out.
Complexity interacts with these interaction surfaces as well, of course—the more complex a system becomes, the more complex the interaction surface within the system become or the more of them you have. A key point in design of any kind is balancing the number of interaction surfaces with their complexity, depth, and breath—in other words, where should you modularize, what should each module contain, what sort of state passed between the modules, where does state pass between the modules, etc. Somehow, mentally, you have to factor in the unintended consequences of hiding information (the first corollary to Keith’s Law, in effect), and the law of leaky abstractions (all nontrivial abstractions leak).
This is a far different way of looking at networks and their design than what you learned in any random certification, and its probably not even something you will find in a college textbook. It is quite difficult to apply when you’re down in the configuration of individual devices. But it’s also the key to understanding networks as a system and beginning the process of thinking about where and how to modularize to create the simplest system to solve a given hard problem.
Going back to the beginning, then—one of the reasons we build such complex networks is we do not really think about how the modules fit together. Instead, we use rules-of-thumb and folk wisdom while we mumble about failure domains and “this won’t scale” under our breath. We are so focused on the individual gears becoming commodities that we fail to see the system and all its moving parts—or we somehow think “this is all so easy,” that we build very inefficient systems with brute-force resolutions, often resulting in mass failures that are hard to understand and resolve.
Sorry, there’s no clear point or lesson here… This is just what happens when I’ve been buried in dissertation work all day and suddenly realize I have not written a blog post for this week… But it should give you something to think about.
When I think of complexity, I mostly consider transport protocols and control planes—probably because I have largely worked in these areas from the very beginning of my career in network engineering. Complexity, however, is present in every layer of the networking stack, all the way down to the physical. I recently ran across an interesting paper on complexity in another part of the network I had not really thought about before: the physical plant of a data center fabric.
The paper begins by defining what complexity in the physical infrastructure of a DC fabric looks like. They focus on packaging, or the layout of the switches in the fabric, the bundles of cabling required to wire the topology, and the number and locations of patch panels required. The packaging and patch panels impact the length and complexity of the cable runs (whether optical or copper), which represents a base complexity for the entire topology.
The second thing they consider is the lifecycle of the physical fabric infrastructure. What steps are required to upgrade the fabric from a smaller configuration to a larger one? Or from a lower speed (higher oversubscription) to a higher speed (lower oversubscription)? The result is the ability to put a number on the overall complexity of each topology.
The first class of topologies they consider are spine-and-leaf, such as the Clos, Benes, and butterfly fabrics. They call all kinds of spine-and-leaf fabrics Clos fabrics. Spine-and-leaf fabrics, they note generally have very low cabling complexity because their symmetry encourages consistent bundling and hardware placement. They call the second kind of topology expander fabrics; the most common fabric in this class is the dragonfly. These topologies are more difficult to wire but simpler to scale out because they can be expanded largely by modifying just the edge of the fabric. Their analysis shows these classes of fabric rate equally on their complexity scale.
A side note they don’t consider in the paper—their complexity computation implies that if you are building a fabric with a somewhat fixed range of sizes, and you can preplan the location of spines leaving enough room for the maximum sized fabric on the first day, spine-and-leaf fabrics are less complex than the fancier topologies you might hear about from time to time. Since most data center fabrics do, in fact, fall into these kinds of constraints (given a good day one designer!), this seems to validate the widespread use of butterfly and Clos fabrics for most applications. This feels like a significant result for most common data center fabric designs.
Finally, they describe an interesting topology they call FatClique, which is an interesting blend of spine-and-lead and edge expander topologies; I’ve screen grabbed the image from the paper below.
Overall, it’s well worth spending the time to read the entire paper if you have an in-depth interest in fabric design.The way this topology is described feels very much like a Benes to me, or a butterfly where the fabric routers are replaced by fabrics (making a seven-stage fabric). It’s hard to tell how useful this topology would be in real deployments—but that researchers are looking into alternatives other than the venerable spine-and-leaf is interesting in its own right.
…we have educated generations of computer scientists on the paradigm that analysis of algorithm only means analyzing their computational efficiency. As Wikipedia states: “In computer science, the analysis of algorithms is the process of finding the computational complexity of algorithms—the amount of time, storage, or other resources needed to execute them.” In other words, efficiency is the sole concern in the design of algorithms. … What about resilience? —Moshe Y. Vardi
This quote set me to thinking about how efficiency and resilience might interact, or trade off against one another, in networks. The most obvious extreme cases are two routers connected via a single long-haul link and the highly parallel data center fabrics we build today. Obviously adding a second long-haul link would improve resilience—but at what cost in terms of efficiency? Its also obvious highly meshed data center fabrics have plenty of resilience—and yet they still sometimes fail. Why?
These cases can be described as efficiency extremes. The single link between two distant points is extremely efficient at minimizing cost and complexity; there is only one link to pay for, only one pair of devices to configure, etc. The highly meshed data center fabric, on the other hand, is extremely efficient at rapidly carrying large amounts of data between vast numbers of interconnected devices (east/west traffic flows). Have these optimizations towards one goal resulted in tradeoffs in resilience?
Consider the case of the single long-haul link between two routers. In terms of the state/optimization/surfaces (SOS) tirade, this single pair of routers and single link minimize the amount of control plane state and the breadth of surfaces (there is only point at which the control plane and the physical network intersect, for instance). The tradeoff, however, is a single link failure causes all traffic through the network to stop flowing—the network completely fails to do the work its designed to do. To create resiliency, or rather add a second dimension of optimization to the network, a second link and a second pair of routers need to be added. Adding these, however, will increase the amount of state and the number of interaction surfaces in the network. Another way to put this is the overall system becomes more complex to solve a harder set of problems—inexpensive traffic flow versus minimal cost traffic flow with resilience.
The second case is a little harder to understand—we assume all those parallel links necessarily make the network more resilient. If this is the case, then why do data center fabrics ever fail? In fact, DC fabrics are plagued by one of the hardest kinds of failure to understand and repair—grey failures. Going back to the SOS triad, the massive number of parallel links and devices in a DC fabric, designed to optimize the network for carrying massive amounts of traffic, also add lots of state and interaction surfaces to the network. Increasing the amount of state and interaction surfaces should, in theory, reduce some other form of optimization—in this case resilience through overwhelmed control planes and grey failures.
In the case of a DC fabric, simplification can increase resilience. Since you cannot reduce the number of links and devices, you must think through how and where to abstract information to reduce state. Reducing state, in turn, is bound to reduce the efficiency of traffic flows through the network, so you immediately run into a domino effect of optimization tradeoffs. Highly turned optimization for traffic carrying causes a lack of optimization in resilience; optimizing for resilience reduces the optimization of traffic flow through the network. These kinds of chain reactions are common in the network engineering world. How can you optimize against grey failures? Perhaps simplifying design by using a single kind of optic, rather than having multiple kinds, or finding other ways to cope with the complexity in physical design.
Returning to the original quote—we often build a lot of resilience into network designs, so we do not face the same sorts of problems software designers and implementors do. Quite often the hyper-focus on resilience in network design is a result of a lack of resilience in software design—software designers have thrown the complexity of resilient design over the cubicle wall into the network operator’s lap. This clearly does not seem to be the most efficient way to handle things, as network are vastly more complex because of the absolute resilience they are expected to provide; looking at the software and network as a system might produce a more resilient, and yet simpler, system.
The key, in the meantime, is for network engineers to learn how to ply the tradeoffs, understanding precisely what their goals are—or what they are optimizing for—and how those optimizations trade off against one another.
One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.
To produce the study, the authors took data from Facebook’s ticket logging system over 6 years, from 2011 through 2018. They used language-based systems to classify each event based on severity, kind of remediation, and root cause. Once the events were classified, the researchers plotted and tried to understand the results. For instance, table 2 lists the most common root causes of data center fabric incidents: 17% were maintenance, 13% misconfiguration, 13% hardware, and 12% software defects (bugs).
Given Facebook’s network is completely automated, with a full code review/canary process for validating changes before they are put into production, misconfiguration failures should lower than a manually operated network. That 13% of failures are still accounted for by misconfiguration shows even the best automation program cannot eliminate failures from misconfiguration. This number is also interesting because it implies networks without this degree of automation must have much higher failure rates due to misconfiguration. While the raw number of failures are not given, this seems to provide both an idea of how much improvement automation can create, as well as a sort of “cap” on how much improvement operators can expect by automating.
If misconfiguration causes 13% of all failures, and software defects cause 12%, then 25% of all failures are caused by human error. I don’t know of any other studies of this kind, but 25% sounds about right based on years of experience. Whether this 25% is spread across failures in vendor code and operator configuration, or across operator created code and operator configuration, the percentage of failure seems to remain about the same. It is not likely you can eliminate failures caused by human error, nor are you likely to drive it down more than a couple of percentage points.
Another interesting finding here is larger networks increase the time humans take to resolve incidents. As the size of the network scales up, the MTTR scales up with it. This is intuitive—larger networks tend to have more complex configurations, leading to more time spent trying to chase down and understand a problem. One thing the paper does not discuss, but might be interesting, is how modularization impacts these numbers. Intuitively, containing failures within a module (whether horizontally along topological lines or vertically through virtualization) should decrease the scope in which a network engineer needs to search to find a problem and resolve it. This is, on the other hand, likely to be offset somewhat by the increased complexity and reduction in visibility caused by segmentation—so it’s hard to determine what the overall effect of deeper segmentation in a network might be.
Overall, this is an interesting paper to parse through and understand—there are lots of great insights here for network operators at any scale.
There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.
Let me try to recap some of the arguments I’ve heard used against my assertion.
The theory that enterprise and service provider networks require completely different technologies and implementations is often grounded in scale. Service provider networks are so large that they simply must use different solutions—solutions that you cannot apply to any network running at a smaller scale.
The problem with this line of thinking is it throws the baby out with the bathwater. Google is using automation to run their network? Well, then… you shouldn’t use automation because Google’s problems are not your problems. Microsoft is deploying 100g Ethernet over fiber? Then clearly enterprise networks should be using Token Ring or ARCnet because… Microsoft’s problems are not your problems.
The usual answer is—“I’m not saying we shouldn’t take good ideas when we see them, but we shouldn’t design networks the way someone else does just because.” I don’t see how this clarifies the solution, though—when is it a good idea or a bad one? What is our criterion to decide what to adopt and what not to adopt? Simply saying “X’s problems aren’t your problems” doesn’t really give me any actionable information—or at least I’m not getting it if it’s buried in there someplace.
Instead—maybe—just maybe—we are looking at this all wrong. Maybe there is some other way classify networks that will help us see the problem set better.
I don’t think networks are undifferentiated—I think the enterprise/service provider/hyerpscaler divide is not helpful to understand how different networks are … different, and how to correctly identify an environment and build to it. Reading a classic paper in software design this week—Programs, Life Cycles, and Laws of Software Evolution—brought all this to mind. In writing this paper, Meir Lehman was facing many of the same classification problems, just in software development rather than in building networks.
Rather than saying “enterprise software is different than service provider software”—an assertion absolutely no-one makes—or even “commercial software is different than private software, and developers working in these two areas cannot use the same tools and techniques,” Lehman posits there are three kinds of software systems. He calls these S-Programs, in which the problem and solution can be fully specified; P-Programs, in which the problem can be fully specified, but the program can only be partially specified because of complexity and scale; and E-Programs, where the program itself become part of the world it models. Lehman thinks most software will move towards S-Program status as time moves on—something that hasn’t happened (the reasons are out of scope for this already-too-long-blog-post).
But the classification is useful. For S-Programs, the inputs and outputs can be fully specified, full-on testing can take place before the software is deployed, and lifecycle management is largely about making the software more fully conform to its original conditions. Maybe there are S-Networks, too? Single-purpose networks which are aimed at fulfilling on well-defined thing, and only that thing. Lehman talks about learning how to breaking larger problems into smaller one so the S-Problems can be dealt with separately—is this anything different than separating out the basic problem of providing IP connectivity in a DC fabric underlay, or even providing basic IP connectivity in a transit or campus network, treating it as a separate module with fairly well design goals and measurements?
Lehman talks about P-Programs, where the problem is largely definable, but the solutions end up being more heuristic. Isn’t this similar to a traffic engineering overlay, where we largely know what the goals are, but we don’t necessarily know what specific solution is going to needed at any moment, and the complete set of solutions is just too large to initially calculate? What about E-Programs, where the software becomes a part of the world it models? Isn’t this like the intent-based stuff we’ve been talking about networking for going one 30 years now?
Looking at it another way, isn’t it possible that some networks are largely just S-Networks? And others are largely E-Networks? And that these classifications have nothing to do with whether the network is being built by what we call an “enterprise” or a “service provider?” Isn’t is possible that S-Networks should probably all use the same basic sort of structure and largely be classified as a “commodity,” while E-Networks will all be snowflakes, and largely classified as having high business importance?
Just like I don’t think the OSI model is particularly helpful in teaching and understanding networks any longer, I don’t find the enterprise/service/hyperscaler model very useful in building and operating networks. The service enterprise/service provider divide tends to artificially limit idea transfer when it wants to be transferred, and artificially “hype up” some networks while degrading others—largely based on perceptions of scale.
Scale != complexity. It’s not about service providers and enterprises. It doesn’t matter if Google’s problems are not your problems; borrowing from the hyperscale is not a “bad thing.” It’s just a “thing.” Think clearly about the problem set, understand the problem set, and borrow liberally. There is no such thing as a “service provider technology,” nor is there any such thing as an “enterprise technology.” There are problems, and there are solutions. To be an engineer is to connect the two.