I recently joined Ethan Banks for a Packet Pushers episode around the trade offs of hiding information in the control plane. This was a terrific show; you can listen to it by clicking on the link below.
Today on the Priority Queue, we’re gonna hide some information. Oh, like route summarization? Sure, like route summarization. That’s an example of information hiding. But there’s much more to the story than that. Our guest is Russ White. Russ is a serial networking book author, network architect, RFC writer, patent holder, technical instructor, and much of the motive force behind the early iterations of the CCDE program.
Have you ever wondered why spine-and-leaf networks are the “standard” for data center networks? While the answer has a lot to do with trial and error, it turns out there is also a mathematical reason the fat-tree spine-and-leaf is is used almost universally. There often is some mathematical reason for the decisions made in engineering, although we rarely explore those reasons. If it seems to work, there is probably a reason.
The fat-tree design is explored in a paper published in 2015 (available here at the ACM, and now added to my “classic papers” page so there is a local copy as well), using a novel technique to not only explore why the spine-and-leaf fat-tree is so flexible, but even what the ideal ratio of network capacity is at each stage. The idea begins with this basic concept: one kind of network topology can be emulated on top of another physical topology. For instance, you can emulate a toroid topology on top of a hierarchical network, or a spine-and-leaf on top of of hypercube, etc. To use terms engineers are familiar with in a slightly different way, let’s call the physical topology the underlay, and the emulated topology the overlay. This is not too far off, as emulated topologies do need to be a virtual construction on top of the underlay, which means… tunnels, of course.
In emulating one kind of topology over another kind of topology, however, you lose some performance. This performance loss factor is quite important, in that it tells you whether or not it is worth building your actual network using the design of the underlay, or using the overlay. If, for instance, you emulate a toroid design on top of a folded Clos, and you discover the performance is half as good as a real, physical toroid, then you might be better off building a toroid rather than a folded Clos.
This might be difficult to grasp at first, so let’s use a specific example. The only topologies most engineers are familiar with are rings, hierarchical networks, partial mesh, full mesh, and some form of spine-and-leaf. Consider the case of a full mesh network versus a ring topology. Suppose you have some application that, through careful testing, you have determined best runs on a full mesh network. Now when you go to build your physical network, you find that you have the choice between building a real full mesh network, which will be very expensive, or a ring network, which will be very cheap. How much performance will you lose in running this application on the ring topology?
The way to discover the answer is to build a virtual full mesh overlay onto the ring topology. Once you do this, it turns out there are ways to determine how much performance loss you will encounter using this combination of underlay and overlay. Now, take this one step farther—what if you decide not to build the overlay, but simply run the application directly on the ring topology? Minus the tunnel encapsulation and de-encapsulation, the resulting loss in performance should be roughly the same,
Given this, you should be able to calculate the performance loss from emulating every kind of topology on top of every other kind of topology. From here, you can determine if there is one topology that can be used as an underlay for every other kind of topology with a minimal amount of performance loss. It turns out there is: the fat-tree spine-and-leaf topology. So an application that might best work on a full mesh network will run on a fat-tree spine-and-leaf topology with minimal performance loss. The same can be said of an application that runs best on a ring topology, or a hierarchical topology, or a partial mesh, or a… Thus, the fat-tree spine-and-leaf can be called a universal topology.
After coming to this point, the authors wonder if it is possible to determine how the bandwidth should be apportioned in a fat-tree spine-and-leaf to produce optimal emulation results across every possible overlay. This would give designers the ability to understand how to achieve the most optimal universal application performance on a single physical topology. Again, there turns out to be an answer to this question: doubling the bandwidth at each stage in the fabric is the best way to create an underlay that will emulate every other kind of overlay with the minimal amount of performance loss. The authors come to this conclusion by comparing a data center network to an on-chip connection network, which has a provable “best” design based on the physical layout of the wires. Translating the physical space into a monetary cost produces the same result as above: doubling the bandwidth at each stage creates the least expensive network with the most capabilities.
The result, then, is this: the fat-tree spine-and-leaf design is the most effective design for most applications, a universal topology. Further, building such a design where there are two neighbors down for every neighbor up, and doubling the bandwidth at each stage, is the most efficient use of resources in building such a network.
Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.
Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.
So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.
In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?
Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?
After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.
Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.
Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…
You can see how this quickly developed into a positive feedback loop—
EIGRP receives a set of queries to which it must respond
EIGRP allocates memory for each packet to build the responses
Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
EIGRP SIA timer times out, causing a flood of new queries…
Rinse and repeat until the network fails to converge.
There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.
But rest assured—there are still positive feedback loops lurking in some corner of every network.
On this community roundtable at the Network Collective, we’re talking about building resilient networks with Pete Welcher, Jody Lemoine, and John Herbert. This was a terrific discussion of all those things you might not think about.
What do monkeys and clubs have to do with software or network design? The primary point of interaction is security. The club you intend to make your network operator’s life easier is also a club an attacker can use to break into your network, or damage its operation. Clubs are just that way. If you think of the collection of tools as not just tools, but also as an attack surface, you can immediately see the correlation between the available tools and the attack surface. One way to increase security is to reduce the attack surface, and one way to reduce the attack surface is tools, reduce the number of tools—or the club.
The best way to reduce the attack surface of a piece of software is to remove any unnecessary code.
Consider this: the components of any network are actually made up of code. So to translate this to the network engineering world, you can say:
The best way to reduce the attack surface of a network is to remove any unnecessary components.
What kinds of components? Routing protocols, transport protocols, and quality of service mechanisms come immediately to mind, but the number and kind of overlays, the number and kind of virtual networks might be further examples.
There is another issue here that is not security related specifically, but rather resilience related. When you think about network failures, you probably think of bugs in the code, failed connectors, failed hardware, and other such causes. The reality is far different, however—the primary cause of network failures in real life is probably user error in the form of misconfiguration (or misconfiguration spread across a thousand routers through the wonders of DevOps!). The Mean Time Between Mistakes (MTBM) is a much larger deal than most realize. Giving the operator too many knobs to solve a single problem is the equivalent of giving the monkey a club.
Simplicity in network design has many advantages—including giving the monkey a smaller club.
One thing I’m often asked in email and in person is: why should I bother learning theory? After all, you don’t install SPF in your network; you install a router or switch, which you then configure OSPF or IS-IS on. The SPF algorithm is not exposed to the user, and does not seem to really have any impact on the operation of the network. Such internal functionality might be neat to know, but ultimately–who cares? Maybe it will be useful in some projected troubleshooting situation, but the key to effective troubleshooting is understanding the output of the device, rather than in understanding what the device is doing.
In other words, there is no reason to treat network devices as anything more than black boxes. You put some stuff in, other stuff comes out, and the vendor takes care of everything in the middle. I dealt with a related line of thinking in this video, but what about this black box argument? Do network engineers really need to know what goes on inside the vendor’s black box?
Let me answer this question with another question. Wen you shift to a new piece of hardware, how do you know what you are trying to configure? Suppose, for instance, that you need to set up BGP route reflectors on a new device, and need to make certain optimal paths are taken from eBGP edge to eBGP edge. What configuration commands would you look for? If you knew BGP as a protocol, you might be able to find the right set of commands without needing to search the documentation, or do an internet search. Knowing how it works can often lead you to knowing where to look and what the commands might be. This can save a tremendous amount of time.
Back up from configuration to making equipment purchasing decisions, or specifying equipment. Again, rather than searching the documentation randomly, if you know what protocol level feature you need the software to implement, you can search for the specific support you are looking for, and know what questions to ask about the possible limitations.
And again, from a more architectural perspective–how do you know what protocol to specify to solve any particular problem if you don’t understand how the protocols actually work?
So from configuration to architecture, knowing how a protocol works can actually help you work faster and smarter by helping you ask the right questions. Just another reason to actually learn the way protocols work, rather than just how to configure them.
The paper we are looking at in this post is tangential to the world of network engineering, rather than being directly targeted at network engineering. The thesis of On Understanding Software Agility—A Social Complexity Point of View, is that at least some elements of software development are a wicked problem, and hence need to be managed through complexity. The paper sets the following criteria for complexity—
Interaction: made up of a lot of interacting systems
Autonomy: subsystems are largely autonomous within specified bounds
Emergence: global behavior is unpredictable, but can be explained in subsystem interactions
Lack of equilibrium: events prevent the system from reaching a state of equilibrium
Nonlinearity: small events cause large output changes
Self-organization: self-organizing response to disruptive events
Co-evolution: the system and its environment adapt to one another
It’s pretty clear network design and operation would fit into the 7 points made above; the control plane, transport protocols, the physical layer, hardware, and software are all subsystems of an overall system. Between these subsystems, there is clearly interaction, and each subsystem acts autonomously within bounds. The result is a set of systemic behaviors that cannot be predicted from examining the system itself. The network design process is, itself, also a complex system, just as software development is.
Trying to establish computing as an engineering discipline led people to believe that managing projects in computing is also an engineering discipline. Engineering is for the most part based on Newtonian mechanics and physics, especially in terms of causality. Events can be predicted, responses can be known in advance and planning and optimize for certain attributes is possible from the outset. Effectively, this reductionist approach assumes that the whole is the sum of the parts, and that parts can be replaced wherever and whenever necessary to address problems. This machine paradigm necessitates planning everything in advance because the machine does not think. This approach is fundamentally incapable of dealing with the complexity and change that actually happens in projects.
In a network, some simple input into one subsystem of the network can cause major changes in the overall system state. The question is: how should engineers deal with this situation? One solution is to try to nail each system down more precisely, such as building a “single source of truth,” and imposing that single view of the world onto the network. The theory is to change control, ultimately removing the lack of equilibrium, and hence reducing complexity. Or perhaps we can centralize the control plane, moving all the complexity into a single point in the network, making it manageable. Or maybe we can automate all the complexity out of the network, feeding the network “intent,” and having, as a result, a nice clean network experience.
Good luck slaying the complexity dragon in any of these ways. What, then, is the solution? According to Pelrine, the right solution is to replace our Newtonian view of software development with a different model. To make this shift, the author suggests moving to the complexity quadrant of the Cynefin framework of complexity.
The correct way to manage software development, according to Pelrine, is to use a probe/sense/respond model. You probe the software through testing iteratively as you build it, sensing the result, and then responding to the result through modification, etc.
Application to network design
The same process is actually used in the design of networks, once beyond the greenfield stage or initial deployment. Over time, different configurations are tested to solve specific problems, iteratively solving problems while also accruing complexity and ossifying. The problem with network designs is much the same as it with software projects—the resulting network is not ever “torn down,” and rethought from the ground up. The result seems to be that networks will become more complex over time, until they either fail, or they are replaced because the business fails, or some other major event occurs. There needs to be some way to combat this—but how?
The key counter to this process is modularization. By modularizing, which is containing complexity within bounds, you can build using the probe/sense/respond model. There will still be changes in the interaction between the modules within the system, but so long as there the “edges” are held constant, the complexity can be split into two domains: within the module and outside the module. Hence the rule of thumb: separate complexity from complexity.
The traditional hierarchical model, of course, provides one kind of separation by splitting control plane state into smaller chunks, potentially at the cost of optimal traffic flow (remember the state/optimization/surface trilemma here, as it describes the trade-offs). Another form of separation is virtualization, with the attendant costs of interaction surfaces and optimization. A third sort of separation is to split policy from reachability, which attempts to preserve optimization at the cost of interaction surfaces. Disaggregation provides yet another useful separation between complex systems, separating software from hardware, and (potentially) the control plane from the network operating system, and even the monitoring software from the network operating system.
These types of modularization can be used together, of course; topological regions of the network can be separated via control plane state choke points, while either virtualization or splitting policy from reachability can be used within a module to separate complexity from complexity within a module. The amount and kind of separations deployed is entirely dependent on specific requirements as well as the complexity of the overall network. The more complex the overall network is, the more kinds of separation that should be deployed to contain complexity into manageable chunks where possible.
Each module, then, can be replaced with a new one, so long as it provides the same set of services, and any changes in the edge are manageable. Each module can be developed iteratively, by making changes (probing), sensing (measuring the result), and then adjusting the module according to whether or not it fits the requirements. This part would involve using creative destruction (the chaos monkey) as a form of probing, to see how the module and system react to controlled failures.
Nice Theory, but So What?
This might all seem theoretical, but it is actually extremely practical. Getting out of the traditional model of network design, where the configuration is fixed, there is a single source of truth for the entire network, the control plane is tied to the software, the software is tied to the hardware, and policy is tied to the control plane, can open up new ways to build massive networks against very complex requirements while managing the complexity and the development and deployment processes. Shifting from a mindset of controlling complexity by nailing everything down to a single state, and to a mindset of managing complexity by finding logical separation points, and building in blocks, then “growing” each module using the appropriate process, whether iterative or waterfall.
Even is scale is not the goal of your network—you “only” have a couple of hundred network devices, say—these principles can still be applied. First, complexity is not really about scale; it is about requirements. A car is not really any less complex than a large truck, and a motor home (or camper) is likely more complex than either. The differences are not in scale, but in requirements. Second, these principles still apply to smaller networks; the primary question is which forms of separation to deploy, rather than whether complexity needs to be separated from complexity.
Moving to this kind of design model could revolutionize the thinking of the network engineering world.
This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:
In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.
Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.
But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.
What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.
Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.
What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.
In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.
The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.
In this short video I work through two kinds of design, or two different ways of designing a network. Which kind of designer are you? Do you see one as better than the other? Which would you prefer to do, are you right now?
In recent years, we have become accustomed to—and often accosted by—the phrase software eats the world. It’s become a mantra in the networking world that software defined is the future. full stopThis research paper by Microsoft, however, tells a different story. According to Baumann, hardware is the new software. Or, to put it differently, even as software eats the world, hardware is taking over an ever increasing amount of the functionality software is doing. In showing this point, the paper also points out the complexity problems involved in dissolving the thin waist of an architecture.
The specific example used in the paper is the Intel x86 Instruction Set Architecture (ISA). Many years ago, when I was a “youngster” in the information technology field, there were a number of different processor platforms; the processor wars waged in full. There were, primarily, the x86 platform, by Intel, beginning with the 8086, and its subsequent generations, the 8088, 80286, 80386, then the Pentium, etc. On the other side of the world, there were the RISC based processors, the kind stuffed into Apple products, Cisco routers, and Sun Sparc workstations (like the one that I used daily while in Cisco TAC). The argument between these two came down to this: the Intel x86 ISA was messy, somewhat ad-hoc, difficult to work with, sometimes buggy, and provided a lot of functionality at the chip level. The RISC based processors, on the other hand, had a much simpler ISA, leaving more to software.
Twenty years on, and this argument is no longer even an argument. The x86 ISA won hands down. I still remember when the Apple and Cisco product lines moved from RISC based processors to x86 platforms, and when AMD first started making x86 “lookalikes” that could run software designed for an Intel processor.
But remember this—The x86 ISA has always been about the processor taking in more work over time. Baumann’s paper is, in fact, an overview of this process, showing the amount of work the processor has been taking on over time. The example he uses is the twelve new instructions added to the x86 ISA by Intel in 2015–2016 around software security. Twelve new instructions might not sound like a lot, but these particular instructions necessitate the creation of entirely new registers in the CPU, a changed memory page table format, new stack structures (including a new shadow stack that sounds rather complex), new exception handling processes (for interrupts), etc.
The primary point of much of this activity is to make it possible for developers to stop trusting software for specific slices of security, and start trusting hardware, which cannot (in theory) be altered. Of course, so long as the hardware relies on microcode and has some sort of update process, there is always the possibility of an attack on the software embedded in the hardware, but we will leave this problematic point aside for the moment.
Why is Intel adding more to the hardware? According to Baumann—
The slowing pace of Moore’s Law will make it harder to sell CPUs: absent improvements in microarchitecture, they won’t be substantially faster, nor substantially more power efficient, and they will have about the same number of cores at the same price point as prior CPUs. Why would anyone buy a new CPU? One reason to which Intel appears to be turning is features: if the new CPU implements an important ISA extension—say, one required by software because it is essential to security—consumers will have a strong reason to upgrade.
If we see the software market as composed of layers, with applications on top, and the actual hardware on the bottom, we can see the ISA is part of the thin waist in software development. Baumann says: “As the most stable “thin waist” interface in today’s commodity technology stack, the x86 ISA sits at a critical point for many systems.”
The thin waist is very important in the larger effort to combat complexity. Having the simplest thin waist available is one of the most important things you can do in any system to reduce complexity. What is happening here is that the industry went from having many different ISA options to having one dominant ISA. A thin waste, however, always implies commoditization, which in turn means falling profits. So if you are the keeper of the thin waste, you either learn to live with less profit, you branch out into other areas, or you learn to broaden the waist. Intel is doing all of these things, but broadening the waist is definitely on the table.
If the point about complexity is true, then broadening the waist should imply an increase in complexity. Baumann points this out specifically. In the section on implications, he notes a lot of things that relate to complexity, such as increased interactions (surfaces!), security issues, and slower feature cycles (state!). Certainly enough, increasing the thin waist always increases complexity, most of the time in unexpected ways.
So—what is the point of this little trip down memory lane (or the memory hole, perhaps)? This all directly applies to network and protocol design. The end-to-end principle is one of the most time tested ideas in protocol design. Essentially, the network itself should be a thin waist in the application flow, providing minimal services and changing little. Within the network, there should be a set of thin waists where modules meet; choke points where policy can be implemented, state can be reduced, and interaction surfaces can be clearly controlled.
We have, alas, abandoned all of this in our drive to make the network the ne plus ultra of all things. Want a layer two overlay? No problem. Want to stretch the layer two overlay between continents? Well, we can do that too. Want it all to converge really fast, and have high reliability? We can shove ISSU at the problem, and give you no downtime, no dropped packets, ever.
As a result, our networks are complex. In fact, they are too complex. And now we are talking about pushing AI at the problem. The reality is, however, we would be much better off if we actually carefully considered each new thing we throw into the mix of the thin waist of the network itself, and of our individual networks.
The thin waist is always tempting to broaden out—it is so simple, and it seems so easy to add functionality to the thin waist to get where we want to go. The result is never what we think it will be, though. The power of unintended consequences will catch up with us at some point or another.
Probably not.Nonblocking is a word that is thrown around a lot, particularly in the world of spine and leaf fabric design—but, just like calling a Clos a spine and leaf, we tend to misuse the word nonblocking in ways that are unhelpful. Hence, it is time for a short explanation of the two concepts that might help clear up the confusion. To get there, we need a network—preferably a spine and leaf like the one shown below.
Based on the design of this fabric, is it nonblocking? It would certainly seem so at first blush. Assume every link is 10g, just to make the math easy, and ignore the ToR to server links, as these are not technically a part of the fabric itself. Assume the following four 10g flows are set up—
B through [X1,Y1,Z2] towards A
C through [X1,Y2,Z2] towards A
D through [X1,Y3,Z2] towards A
E through [X1,Y4,Z2] towards A
As there are four different paths between these four servers (B through E) and Z2, which serves as the ToR for A, all 40g of traffic can be delivered through the fabric without dropping or queuing a single packet (assuming, of course, that you can carry the traffic optimally, with no overhead, etc.—or reducing the four 10g flows slightly so they can all be carried over the network in this way). Hence, the fabric appears to be nonblocking.
What happens, however, if F transmits another 10g of traffic towards A at the X4 ToR? Again, even disregarding the link between Z2 and A, the fabric itself cannot carry 50g of data to Z2; the fabric must now block some traffic, either by dropping it, or by queuing it for some period of time. In packet switched networks, this kind of possibility can always be true.
Hence, you can design the fabric and the applications—the entire network-as-a-system—to reduce contention through the intelligent use of traffic engineering, admission policies (such as bandwidth calendaring). You can also manage contention through QoS policies and using flow control mechanisms that will signal senders to slow down when the network is congested.
But you cannot build a nonblocking packet switched network of any realistic size or scope. You can, of course, build a network that has two hosts, and enough bandwidth to support the maximum bandwidth of both hosts. But when you attach a third host, and then a fourth, etc., the problem of building a nonblocking fabric in a packet switched network becomes problematic; it is always possible for two sources to “gang up” on a single destination, overwhelming the capacity of the network.
It is possible, of course, to build a nonblocking fabric—so long as you use a synchronous or circuit switched network. The concept of a nonblocking network, in fact, comes out of the telephone world, where each user must only connect with one other user, and each user uses and has a fixed amount of bandwidth. In this world, it is possible to build a true nonblocking network fabric.
In the world of packet switching, the closest we can come is a noncontending network, which means every host can (in theory) send to every other host on the network at full rate. From there, it is up to the layout of the workloads on the fabric, and the application design, to reduce contention to the point where no blocking is taking place.
This is the kind of content that will be available in Ethan and I’s new book, which should be published around January of 2018, if the current schedule holds.
note to readers: From time to time I like to highlight solid case studies and design guides in the network engineering space; you can find past highlighted resources under design/resources in the menu.
Data center fabrics are built today using spine and leaf fabrics, lots of fiber, and a lot of routers. There has been a lot of research in all-optical solutions to replace current designs with something different; MegaSwitch is a recent paper that illustrates the research, and potentially a future trend, in data center design. The basic idea is this: give every host its own fiber in a ring that reaches to every other host. Then use optical multiplexers to pull off the signal from each ring any particular host needs in order to provide a switchable set of connections in near real time. The figure below will be used to explain.
In the illustration, there are four hosts, each of which is connected to an electrical switch (EWS). The EWS, in turn, connects to an optical switch (OWS). The OWS channels the outbound (transmitted) traffic from each host onto a single ring, where it is carried to every other OWS in the network. The optical signal is terminated at the hop before the transmitter to prevent any loops from forming (so A’s optical signal is terminated at D, for instance, assuming the ring runs clockwise in the diagram).
The receive side is where things get interesting; there are four full fibers feeding a single fiber towards the server, so it is possible for four times as much information to be transmitted towards the server as the server can receive. The reality is, however, that not every server needs to talk to every other server all the time; some form of switching seems to be in order to only carry the traffic towards the server from the optical rings.
To support switching, the OWS is dynamically programmed to only pull traffic from rings the attached host is currently communicating with. The OWS takes the traffic for each server sending to the local host, multiplexes it onto an optical interface, and sends it to the electrical switch, when then sends the correct information to the attached host. The OWS can increase the bandwidth between two servers by assigning more wavelengths on the OWS to EWS link to traffic being pulled off a particular ring, and reduce available bandwidth by assigning fewer wavelengths.
There are a number of possible problems with such a scheme; for instance—
When a host sends its first packet to another host, or needs to send just a small stream, there is a massive amount of overhead in time and resources setting up a new wavelength allocation at the correct OWS. To resolve these problems, the researchers propose having a full mesh of connectivity at some small portion of the overall available bandwidth; they call this basemesh.
This arrangement allows for bandwidth allocation as a per pair of hosts level, but much of the modern data networking world operates on a per flow basis. The researchers suggest this can be resolved by using the physical connectivity as a base for building a set of virtual LANs, and packets can be routed between these various vLANs. This means that traditional routing must stay in place to actually direct traffic to the correct destination in the network, so the EWS devices must either be routers, or there must be some centralized virtual router through which all traffic passes.
Is something like MegaSwitch the future of data center networks? Right now it is hard to tell—all optical fabrics have been a recurring idea in network design, but do not ever seem to have “broken out” as a preferred solution. The idea is attractive, but the complexity of what essentially amounts to a variable speed optical underlay combined with a more traditional routed overlay seems to add a lot of complexity into the mix, and it is hard to say if the complexity is really worth the tradeoff, which primarily seems to be simpler and cheaper cabling.
Perfect and good: one is just an extension of the other, right?
When I was 16 (a long, long, long time ago), I was destined to be a great graphis—a designer and/or illustrator of some note. Things didn’t turn out that way, of course, but the why is a tale for another day. At any rate, in art class that year, I took an old four foot spool end, stretched canvas across it, and painted a piece in acrylic. The painting was a beach sunset, the sun’s oblong shape offsetting the round of the overall painting, with deep reds and yellows in streaks above the beach, which was dark. I painted the image as if the viewer were standing just on the break at the top of the beach, so there was a bit of sea grass scattered around to offset the darkness of the beach.
And, along one side, a rose.
I really don’t know why I included the rose; I think I just wanted to paint one for some reason, and it seemed like a good idea to combine the ideas (the sunset on the beach and the rose). I entered this large painting in a local art contest, and won… nothing.
In discussion with my art teacher later on, I queried her about why she thought the piece had not won anything. It was well done, probably technically one of the best acrylic pieces I’d ever done to that point in my life. It was striking in its size and execution; there are rarely round paintings of any size, much less of that size.
She said: “The rose.”
It wasn’t that roses don’t grow on the break above a beach. Art, after all, often involves placing things that do not belong together, together, to make a point, or to draw the viewer into the image. It wasn’t the execution; the rose was vivid in its softness, basking in the red and yellow golden hour of sunset. Then why?
My art teacher explained that I had put too much that was good into the painting. The sunset was perfect, the rose was perfect. Between the two, however, the viewer didn’t know where to focus; it was just “too much.” In other words: Because the perfect is the enemy of the good.
This is not always true, of course. Sometimes the perfect is just the perfect, and that’s all there is too it. But sometimes—far too often, in fact—when we seek perfection, we fail to consider the tradeoffs, we fail to count the costs, and our failures turn into a failed system, often in ways we do not expect.
There are many examples in the networking industry, of course. IPv6, LISP, eVPN, TRILL, and thousand other protocols have been designed and either struggle to take off, or don’t take off at all. Perhaps putting this in more distinct terms might be helpful.
Perhaps, if you’ve read my work for very long, you’ve seen this diagram before—or one that is very similar. The point of this diagram is that as you move towards reducing state, you move away from the minimal use of resources and the minimal set of interaction surfaces as a matter of course. There is no way to reach a point where there is minimal state in a network without incurring higher costs in some other place, either in resource consumption (such as bandwidth used) or in interaction surfaces (such as having more than one control plane, or using manual configurations throughout the network). Seeking the perfect in one realm will cause you to lose balance; the perfect is the enemy of the good. Here is another diagram that might look familiar to my long time readers—
We often don’t think we’ve reached perfection until we’ve reached the far right side of this chart. Most of the time, when we reach “perfection,” we’ve actually gone way past the point where we are getting any return on our effort, and way into the robust yet fragile part of the chart.
The bottom line?
Remember that you need to look for the tradeoffs, and try to be conscious about where you are in the various complexity scales. Don’t try to make it perfect; try to make it work, try to make it flexible, and try to learn when you’ve gained what can be gained without risking ossification, and hence brittleness.
Your first line of defense to any DDoS, at least on the network side, should be to disperse the traffic across as many resources as you can. Basic math implies that if you have fifteen entry points, and each entry point is capable of supporting 10g of traffic, then you should be able to simply absorb a 100g DDoS attack while still leaving 50g of overhead for real traffic (assuming perfect efficiency, of course—YMMV). Dispersing a DDoS in this way may impact performance—but taking bandwidth and resources down is almost always the wrong way to react to a DDoS attack.
But what if you cannot, for some reason, disperse the attack? Maybe you only have two edge connections, or if the size of the DDoS is larger than your total edge bandwidth combined? It is typically difficult to mitigate a DDoS attack, but there is an escalating chain of actions you can take that often prove useful. Let’s deal with local mitigation techniques first, and then consider some fancier methods.
TCP SYN filtering: A lot of DDoS attacks rely on exhausting TCP open resources. If all inbound TCP sessions can be terminated in a proxy (such as a load balancer), the proxy server may be able to screen out half open and poorly formed TCP open requests. Some routers can also be configured to hold TCP SYNs for some period of time, rather than forwarding them on to the destination host, in order to block half open connections. This type of protection can be put in place long before a DDoS attack occurs.
Limiting Connections: It is likely that DDoS sessions will be short lived, while legitimate sessions will be longer lived. The different may be a matter of seconds, or even milliseconds, but it is often enough to be a detectable difference. It might make sense, then, to prefer existing connections over new ones when resources start to run low. Legitimate users may wait longer to connect when connections are limited, but once they are connected, they are more likely to remain connected. Application design is important here, as well.
Aggressive Aging: In cache based systems, one way to free up depleted resources quickly is to simply age them out faster. The length of time a connection can be held open can often be dynamically adjusted in applications and hosts, allowing connection information to be removed from memory faster when there are fewer connection slots available. Again, this might impact live customer traffic, but it is still a useful technique when in the midst of an actual attack.
Blocking Bogon Sources: While there is a well known list of bogon addresses—address blocks that should never be routed on the global ‘net—these lists should be taken as a starting point, rather than as an ending point. Constant monitoring of traffic patterns on your edge can give you a lot of insight into what is “normal” and what is not. For instance, if your highest rate of traffic normally comes from South America, and you suddenly see a lot of traffic coming from Australia, either you’ve gone viral, or this is the source of the DDoS attack. It isn’t alway useful to block all traffic from a region, or a set of source addresses, but it is often useful to use the techniques listed above more heavily on traffic that doesn’t appear to be “normal.”
There are, of course, other techniques you can deploy against DDoS attacks—but at some point, you are just not going to have the expertise or time to implement every possible counter. This is where appliance and service (cloud) based services come into play. There are a number of appliance based solutions out there to scrub traffic coming across your links, such as those made by Arbor. The main drawback to these solutions is they scrub the traffic after it has passed over the link into your network. This problem can often be resolved by placing the appliance in a colocation facility and directing your traffic through the colo before it reaches your inbound network link.
There is one open source DDoS scrubbing option in this realm, as well, which uses a combination of FastNetMon, InfluxDB, Grefana, Redis, Morgoth, and Bird to create a solution you can run locally on a spun VM, or even bare metal on a self built appliance wired in between your edge router and the rest of the network (in the DMZ). This option is well worth looking at, if not to deploy, but to better understand how the kind of dynamic filtering performed by commercially available appliances works.
If the DDoS must be stopped before it reached your edge link, and you simply cannot handle the volume of the attacks, then the best solution might be a cloud based filtering solution. These tend to be expensive, and they also tend to increase latency for your “normal” user traffic in some way. The way these normally work is the DDoS provider advertises your routes, or redirects your DNS address to their servers. This draws all your inbound traffic into their network, which it is scrubbed using advanced techniques. Once the traffic is scrubbed, it is either tunneled or routed back to your network (depending on how it was captured in the first place). Most large providers offer scrubbing services, and there are several public offerings available independent of any upstream you might choose (such as Verisign’s line of services).