Learning to Ask Questions

One thing I’m often asked in email and in person is: why should I bother learning theory? After all, you don’t install SPF in your network; you install a router or switch, which you then configure OSPF or IS-IS on. The SPF algorithm is not exposed to the user, and does not seem to really have any impact on the operation of the network. Such internal functionality might be neat to know, but ultimately–who cares? Maybe it will be useful in some projected troubleshooting situation, but the key to effective troubleshooting is understanding the output of the device, rather than in understanding what the device is doing.

In other words, there is no reason to treat network devices as anything more than black boxes. You put some stuff in, other stuff comes out, and the vendor takes care of everything in the middle. I dealt with a related line of thinking in this video, but what about this black box argument? Do network engineers really need to know what goes on inside the vendor’s black box?

Let me answer this question with another question. Wen you shift to a new piece of hardware, how do you know what you are trying to configure? Suppose, for instance, that you need to set up BGP route reflectors on a new device, and need to make certain optimal paths are taken from eBGP edge to eBGP edge. What configuration commands would you look for? If you knew BGP as a protocol, you might be able to find the right set of commands without needing to search the documentation, or do an internet search. Knowing how it works can often lead you to knowing where to look and what the commands might be. This can save a tremendous amount of time.

Back up from configuration to making equipment purchasing decisions, or specifying equipment. Again, rather than searching the documentation randomly, if you know what protocol level feature you need the software to implement, you can search for the specific support you are looking for, and know what questions to ask about the possible limitations.

And again, from a more architectural perspective–how do you know what protocol to specify to solve any particular problem if you don’t understand how the protocols actually work?

So from configuration to architecture, knowing how a protocol works can actually help you work faster and smarter by helping you ask the right questions. Just another reason to actually learn the way protocols work, rather than just how to configure them.

Applying Software Agility to Network Design

The paper we are looking at in this post is tangential to the world of network engineering, rather than being directly targeted at network engineering. The thesis of On Understanding Software Agility—A Social Complexity Point of View, is that at least some elements of software development are a wicked problem, and hence need to be managed through complexity. The paper sets the following criteria for complexity—

  • Interaction: made up of a lot of interacting systems
  • Autonomy: subsystems are largely autonomous within specified bounds
  • Emergence: global behavior is unpredictable, but can be explained in subsystem interactions
  • Lack of equilibrium: events prevent the system from reaching a state of equilibrium
  • Nonlinearity: small events cause large output changes
  • Self-organization: self-organizing response to disruptive events
  • Co-evolution: the system and its environment adapt to one another

It’s pretty clear network design and operation would fit into the 7 points made above; the control plane, transport protocols, the physical layer, hardware, and software are all subsystems of an overall system. Between these subsystems, there is clearly interaction, and each subsystem acts autonomously within bounds. The result is a set of systemic behaviors that cannot be predicted from examining the system itself. The network design process is, itself, also a complex system, just as software development is.

Trying to establish computing as an engineering discipline led people to believe that managing projects in computing is also an engineering discipline. Engineering is for the most part based on Newtonian mechanics and physics, especially in terms of causality. Events can be predicted, responses can be known in advance and planning and optimize for certain attributes is possible from the outset. Effectively, this reductionist approach assumes that the whole is the sum of the parts, and that parts can be replaced wherever and whenever necessary to address problems. This machine paradigm necessitates planning everything in advance because the machine does not think. This approach is fundamentally incapable of dealing with the complexity and change that actually happens in projects.

In a network, some simple input into one subsystem of the network can cause major changes in the overall system state. The question is: how should engineers deal with this situation? One solution is to try to nail each system down more precisely, such as building a “single source of truth,” and imposing that single view of the world onto the network. The theory is to change control, ultimately removing the lack of equilibrium, and hence reducing complexity. Or perhaps we can centralize the control plane, moving all the complexity into a single point in the network, making it manageable. Or maybe we can automate all the complexity out of the network, feeding the network “intent,” and having, as a result, a nice clean network experience.

Good luck slaying the complexity dragon in any of these ways. What, then, is the solution? According to Pelrine, the right solution is to replace our Newtonian view of software development with a different model. To make this shift, the author suggests moving to the complexity quadrant of the Cynefin framework of complexity.

The correct way to manage software development, according to Pelrine, is to use a probe/sense/respond model. You probe the software through testing iteratively as you build it, sensing the result, and then responding to the result through modification, etc.

Application to network design

The same process is actually used in the design of networks, once beyond the greenfield stage or initial deployment. Over time, different configurations are tested to solve specific problems, iteratively solving problems while also accruing complexity and ossifying. The problem with network designs is much the same as it with software projects—the resulting network is not ever “torn down,” and rethought from the ground up. The result seems to be that networks will become more complex over time, until they either fail, or they are replaced because the business fails, or some other major event occurs. There needs to be some way to combat this—but how?

The key counter to this process is modularization. By modularizing, which is containing complexity within bounds, you can build using the probe/sense/respond model. There will still be changes in the interaction between the modules within the system, but so long as there the “edges” are held constant, the complexity can be split into two domains: within the module and outside the module. Hence the rule of thumb: separate complexity from complexity.

The traditional hierarchical model, of course, provides one kind of separation by splitting control plane state into smaller chunks, potentially at the cost of optimal traffic flow (remember the state/optimization/surface trilemma here, as it describes the trade-offs). Another form of separation is virtualization, with the attendant costs of interaction surfaces and optimization. A third sort of separation is to split policy from reachability, which attempts to preserve optimization at the cost of interaction surfaces. Disaggregation provides yet another useful separation between complex systems, separating software from hardware, and (potentially) the control plane from the network operating system, and even the monitoring software from the network operating system.

These types of modularization can be used together, of course; topological regions of the network can be separated via control plane state choke points, while either virtualization or splitting policy from reachability can be used within a module to separate complexity from complexity within a module. The amount and kind of separations deployed is entirely dependent on specific requirements as well as the complexity of the overall network. The more complex the overall network is, the more kinds of separation that should be deployed to contain complexity into manageable chunks where possible.

Each module, then, can be replaced with a new one, so long as it provides the same set of services, and any changes in the edge are manageable. Each module can be developed iteratively, by making changes (probing), sensing (measuring the result), and then adjusting the module according to whether or not it fits the requirements. This part would involve using creative destruction (the chaos monkey) as a form of probing, to see how the module and system react to controlled failures.

Nice Theory, but So What?

This might all seem theoretical, but it is actually extremely practical. Getting out of the traditional model of network design, where the configuration is fixed, there is a single source of truth for the entire network, the control plane is tied to the software, the software is tied to the hardware, and policy is tied to the control plane, can open up new ways to build massive networks against very complex requirements while managing the complexity and the development and deployment processes. Shifting from a mindset of controlling complexity by nailing everything down to a single state, and to a mindset of managing complexity by finding logical separation points, and building in blocks, then “growing” each module using the appropriate process, whether iterative or waterfall.

Even is scale is not the goal of your network—you “only” have a couple of hundred network devices, say—these principles can still be applied. First, complexity is not really about scale; it is about requirements. A car is not really any less complex than a large truck, and a motor home (or camper) is likely more complex than either. The differences are not in scale, but in requirements. Second, these principles still apply to smaller networks; the primary question is which forms of separation to deploy, rather than whether complexity needs to be separated from complexity.

Moving to this kind of design model could revolutionize the thinking of the network engineering world.

If you haven’t found the tradeoff…

This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:

Just like how every action has an equal and opposite reaction, each “positive” design decision necessarily creates a “negative” compromise. Insofar as designs necessarily create compromises, those compromises are very much intentional. (And in the same vein, unintentional compromises are a sign of bad design.)

In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.

Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.

Cross posted at CircleID

But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.

What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.

Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.

What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.

In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.

The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.

What Kind of Design?

In this short video I work through two kinds of design, or two different ways of designing a network. Which kind of designer are you? Do you see one as better than the other? Which would you prefer to do, are you right now?

Complexity and the Thin Waist

In recent years, we have become accustomed to—and often accosted by—the phrase software eats the world. It’s become a mantra in the networking world that software defined is the future. full stop This research paper by Microsoft, however, tells a different story. According to Baumann, hardware is the new software. Or, to put it differently, even as software eats the world, hardware is taking over an ever increasing amount of the functionality software is doing. In showing this point, the paper also points out the complexity problems involved in dissolving the thin waist of an architecture.

The specific example used in the paper is the Intel x86 Instruction Set Architecture (ISA). Many years ago, when I was a “youngster” in the information technology field, there were a number of different processor platforms; the processor wars waged in full. There were, primarily, the x86 platform, by Intel, beginning with the 8086, and its subsequent generations, the 8088, 80286, 80386, then the Pentium, etc. On the other side of the world, there were the RISC based processors, the kind stuffed into Apple products, Cisco routers, and Sun Sparc workstations (like the one that I used daily while in Cisco TAC). The argument between these two came down to this: the Intel x86 ISA was messy, somewhat ad-hoc, difficult to work with, sometimes buggy, and provided a lot of functionality at the chip level. The RISC based processors, on the other hand, had a much simpler ISA, leaving more to software.

Twenty years on, and this argument is no longer even an argument. The x86 ISA won hands down. I still remember when the Apple and Cisco product lines moved from RISC based processors to x86 platforms, and when AMD first started making x86 “lookalikes” that could run software designed for an Intel processor.

But remember this—The x86 ISA has always been about the processor taking in more work over time. Baumann’s paper is, in fact, an overview of this process, showing the amount of work the processor has been taking on over time. The example he uses is the twelve new instructions added to the x86 ISA by Intel in 2015–2016 around software security. Twelve new instructions might not sound like a lot, but these particular instructions necessitate the creation of entirely new registers in the CPU, a changed memory page table format, new stack structures (including a new shadow stack that sounds rather complex), new exception handling processes (for interrupts), etc.

The primary point of much of this activity is to make it possible for developers to stop trusting software for specific slices of security, and start trusting hardware, which cannot (in theory) be altered. Of course, so long as the hardware relies on microcode and has some sort of update process, there is always the possibility of an attack on the software embedded in the hardware, but we will leave this problematic point aside for the moment.

Why is Intel adding more to the hardware? According to Baumann—

The slowing pace of Moore’s Law will make it harder to sell CPUs: absent improvements in microarchitecture, they won’t be substantially faster, nor substantially more power efficient, and they will have about the same number of cores at the same price point as prior CPUs. Why would anyone buy a new CPU? One reason to which Intel appears to be turning is features: if the new CPU implements an important ISA extension—say, one required by software because it is essential to security—consumers will have a strong reason to upgrade.

If we see the software market as composed of layers, with applications on top, and the actual hardware on the bottom, we can see the ISA is part of the thin waist in software development. Baumann says: “As the most stable “thin waist” interface in today’s commodity technology stack, the x86 ISA sits at a critical point for many systems.”

The thin waist is very important in the larger effort to combat complexity. Having the simplest thin waist available is one of the most important things you can do in any system to reduce complexity. What is happening here is that the industry went from having many different ISA options to having one dominant ISA. A thin waste, however, always implies commoditization, which in turn means falling profits. So if you are the keeper of the thin waste, you either learn to live with less profit, you branch out into other areas, or you learn to broaden the waist. Intel is doing all of these things, but broadening the waist is definitely on the table.

If the point about complexity is true, then broadening the waist should imply an increase in complexity. Baumann points this out specifically. In the section on implications, he notes a lot of things that relate to complexity, such as increased interactions (surfaces!), security issues, and slower feature cycles (state!). Certainly enough, increasing the thin waist always increases complexity, most of the time in unexpected ways.

So—what is the point of this little trip down memory lane (or the memory hole, perhaps)? This all directly applies to network and protocol design. The end-to-end principle is one of the most time tested ideas in protocol design. Essentially, the network itself should be a thin waist in the application flow, providing minimal services and changing little. Within the network, there should be a set of thin waists where modules meet; choke points where policy can be implemented, state can be reduced, and interaction surfaces can be clearly controlled.

We have, alas, abandoned all of this in our drive to make the network the ne plus ultra of all things. Want a layer two overlay? No problem. Want to stretch the layer two overlay between continents? Well, we can do that too. Want it all to converge really fast, and have high reliability? We can shove ISSU at the problem, and give you no downtime, no dropped packets, ever.

As a result, our networks are complex. In fact, they are too complex. And now we are talking about pushing AI at the problem. The reality is, however, we would be much better off if we actually carefully considered each new thing we throw into the mix of the thin waist of the network itself, and of our individual networks.

The thin waist is always tempting to broaden out—it is so simple, and it seems so easy to add functionality to the thin waist to get where we want to go. The result is never what we think it will be, though. The power of unintended consequences will catch up with us at some point or another.

Like… right about the time you decide to try to use asymmetric traffic flows across a network device that keeps state.

Make the waist thin, and the complexity becomes controllable. This is a principle we have yet to learn and apply in our network and protocol designs.

Nonblocking versus Noncontending

“We use a nonblocking fabric…”

Probably not. Nonblocking is a word that is thrown around a lot, particularly in the world of spine and leaf fabric design—but, just like calling a Clos a spine and leaf, we tend to misuse the word nonblocking in ways that are unhelpful. Hence, it is time for a short explanation of the two concepts that might help clear up the confusion. To get there, we need a network—preferably a spine and leaf like the one shown below.

Based on the design of this fabric, is it nonblocking? It would certainly seem so at first blush. Assume every link is 10g, just to make the math easy, and ignore the ToR to server links, as these are not technically a part of the fabric itself. Assume the following four 10g flows are set up—

  • B through [X1,Y1,Z2] towards A
  • C through [X1,Y2,Z2] towards A
  • D through [X1,Y3,Z2] towards A
  • E through [X1,Y4,Z2] towards A

As there are four different paths between these four servers (B through E) and Z2, which serves as the ToR for A, all 40g of traffic can be delivered through the fabric without dropping or queuing a single packet (assuming, of course, that you can carry the traffic optimally, with no overhead, etc.—or reducing the four 10g flows slightly so they can all be carried over the network in this way). Hence, the fabric appears to be nonblocking.

What happens, however, if F transmits another 10g of traffic towards A at the X4 ToR? Again, even disregarding the link between Z2 and A, the fabric itself cannot carry 50g of data to Z2; the fabric must now block some traffic, either by dropping it, or by queuing it for some period of time. In packet switched networks, this kind of possibility can always be true.

Hence, you can design the fabric and the applications—the entire network-as-a-system—to reduce contention through the intelligent use of traffic engineering, admission policies (such as bandwidth calendaring). You can also manage contention through QoS policies and using flow control mechanisms that will signal senders to slow down when the network is congested.

But you cannot build a nonblocking packet switched network of any realistic size or scope. You can, of course, build a network that has two hosts, and enough bandwidth to support the maximum bandwidth of both hosts. But when you attach a third host, and then a fourth, etc., the problem of building a nonblocking fabric in a packet switched network becomes problematic; it is always possible for two sources to “gang up” on a single destination, overwhelming the capacity of the network.

It is possible, of course, to build a nonblocking fabric—so long as you use a synchronous or circuit switched network. The concept of a nonblocking network, in fact, comes out of the telephone world, where each user must only connect with one other user, and each user uses and has a fixed amount of bandwidth. In this world, it is possible to build a true nonblocking network fabric.

In the world of packet switching, the closest we can come is a noncontending network, which means every host can (in theory) send to every other host on the network at full rate. From there, it is up to the layout of the workloads on the fabric, and the application design, to reduce contention to the point where no blocking is taking place.

This is the kind of content that will be available in Ethan and I’s new book, which should be published around January of 2018, if the current schedule holds.

Design Resource: Shared Workspace Infrastructure

This white paper outlines solutions that can provide secure connectivity for Public Sector agencies over shared wired and wireless network infrastructures. This guide is targeted at network professionals and other personnel who assist in the design of Public Sector office networks and compliments the design patterns and principles issued by GDS Common Technology Services (CTS).

This design guide includes—

  • MPLS over DMVPN
  • Wireless and wired access control
  • 802.1x
  • Federated RADIUS

note to readers: From time to time I like to highlight solid case studies and design guides in the network engineering space; you can find past highlighted resources under design/resources in the menu.

MegaSwitch: an interesting new data center fabric

Data center fabrics are built today using spine and leaf fabrics, lots of fiber, and a lot of routers. There has been a lot of research in all-optical solutions to replace current designs with something different; MegaSwitch is a recent paper that illustrates the research, and potentially a future trend, in data center design. The basic idea is this: give every host its own fiber in a ring that reaches to every other host. Then use optical multiplexers to pull off the signal from each ring any particular host needs in order to provide a switchable set of connections in near real time. The figure below will be used to explain.

In the illustration, there are four hosts, each of which is connected to an electrical switch (EWS). The EWS, in turn, connects to an optical switch (OWS). The OWS channels the outbound (transmitted) traffic from each host onto a single ring, where it is carried to every other OWS in the network. The optical signal is terminated at the hop before the transmitter to prevent any loops from forming (so A’s optical signal is terminated at D, for instance, assuming the ring runs clockwise in the diagram).

The receive side is where things get interesting; there are four full fibers feeding a single fiber towards the server, so it is possible for four times as much information to be transmitted towards the server as the server can receive. The reality is, however, that not every server needs to talk to every other server all the time; some form of switching seems to be in order to only carry the traffic towards the server from the optical rings.

To support switching, the OWS is dynamically programmed to only pull traffic from rings the attached host is currently communicating with. The OWS takes the traffic for each server sending to the local host, multiplexes it onto an optical interface, and sends it to the electrical switch, when then sends the correct information to the attached host. The OWS can increase the bandwidth between two servers by assigning more wavelengths on the OWS to EWS link to traffic being pulled off a particular ring, and reduce available bandwidth by assigning fewer wavelengths.

There are a number of possible problems with such a scheme; for instance—

  • When a host sends its first packet to another host, or needs to send just a small stream, there is a massive amount of overhead in time and resources setting up a new wavelength allocation at the correct OWS. To resolve these problems, the researchers propose having a full mesh of connectivity at some small portion of the overall available bandwidth; they call this basemesh.
  • This arrangement allows for bandwidth allocation as a per pair of hosts level, but much of the modern data networking world operates on a per flow basis. The researchers suggest this can be resolved by using the physical connectivity as a base for building a set of virtual LANs, and packets can be routed between these various vLANs. This means that traditional routing must stay in place to actually direct traffic to the correct destination in the network, so the EWS devices must either be routers, or there must be some centralized virtual router through which all traffic passes.

Is something like MegaSwitch the future of data center networks? Right now it is hard to tell—all optical fabrics have been a recurring idea in network design, but do not ever seem to have “broken out” as a preferred solution. The idea is attractive, but the complexity of what essentially amounts to a variable speed optical underlay combined with a more traditional routed overlay seems to add a lot of complexity into the mix, and it is hard to say if the complexity is really worth the tradeoff, which primarily seems to be simpler and cheaper cabling.

You can read the full MegaSwitch paper here.

The Perfect and the Good

Perfect and good: one is just an extension of the other, right?

When I was 16 (a long, long, long time ago), I was destined to be a great graphis—a designer and/or illustrator of some note. Things didn’t turn out that way, of course, but the why is a tale for another day. At any rate, in art class that year, I took an old four foot spool end, stretched canvas across it, and painted a piece in acrylic. The painting was a beach sunset, the sun’s oblong shape offsetting the round of the overall painting, with deep reds and yellows in streaks above the beach, which was dark. I painted the image as if the viewer were standing just on the break at the top of the beach, so there was a bit of sea grass scattered around to offset the darkness of the beach.

And, along one side, a rose.

I really don’t know why I included the rose; I think I just wanted to paint one for some reason, and it seemed like a good idea to combine the ideas (the sunset on the beach and the rose). I entered this large painting in a local art contest, and won… nothing.

In discussion with my art teacher later on, I queried her about why she thought the piece had not won anything. It was well done, probably technically one of the best acrylic pieces I’d ever done to that point in my life. It was striking in its size and execution; there are rarely round paintings of any size, much less of that size.

She said: “The rose.”

It wasn’t that roses don’t grow on the break above a beach. Art, after all, often involves placing things that do not belong together, together, to make a point, or to draw the viewer into the image. It wasn’t the execution; the rose was vivid in its softness, basking in the red and yellow golden hour of sunset. Then why?

My art teacher explained that I had put too much that was good into the painting. The sunset was perfect, the rose was perfect. Between the two, however, the viewer didn’t know where to focus; it was just “too much.” In other words: Because the perfect is the enemy of the good.

This is not always true, of course. Sometimes the perfect is just the perfect, and that’s all there is too it. But sometimes—far too often, in fact—when we seek perfection, we fail to consider the tradeoffs, we fail to count the costs, and our failures turn into a failed system, often in ways we do not expect.

There are many examples in the networking industry, of course. IPv6, LISP, eVPN, TRILL, and thousand other protocols have been designed and either struggle to take off, or don’t take off at all. Perhaps putting this in more distinct terms might be helpful.

Perhaps, if you’ve read my work for very long, you’ve seen this diagram before—or one that is very similar. The point of this diagram is that as you move towards reducing state, you move away from the minimal use of resources and the minimal set of interaction surfaces as a matter of course. There is no way to reach a point where there is minimal state in a network without incurring higher costs in some other place, either in resource consumption (such as bandwidth used) or in interaction surfaces (such as having more than one control plane, or using manual configurations throughout the network). Seeking the perfect in one realm will cause you to lose balance; the perfect is the enemy of the good. Here is another diagram that might look familiar to my long time readers—

We often don’t think we’ve reached perfection until we’ve reached the far right side of this chart. Most of the time, when we reach “perfection,” we’ve actually gone way past the point where we are getting any return on our effort, and way into the robust yet fragile part of the chart.

The bottom line?

Remember that you need to look for the tradeoffs, and try to be conscious about where you are in the various complexity scales. Don’t try to make it perfect; try to make it work, try to make it flexible, and try to learn when you’ve gained what can be gained without risking ossification, and hence brittleness.

Mitigating DDoS

Your first line of defense to any DDoS, at least on the network side, should be to disperse the traffic across as many resources as you can. Basic math implies that if you have fifteen entry points, and each entry point is capable of supporting 10g of traffic, then you should be able to simply absorb a 100g DDoS attack while still leaving 50g of overhead for real traffic (assuming perfect efficiency, of course—YMMV). Dispersing a DDoS in this way may impact performance—but taking bandwidth and resources down is almost always the wrong way to react to a DDoS attack.

But what if you cannot, for some reason, disperse the attack? Maybe you only have two edge connections, or if the size of the DDoS is larger than your total edge bandwidth combined? It is typically difficult to mitigate a DDoS attack, but there is an escalating chain of actions you can take that often prove useful. Let’s deal with local mitigation techniques first, and then consider some fancier methods.

  • TCP SYN filtering: A lot of DDoS attacks rely on exhausting TCP open resources. If all inbound TCP sessions can be terminated in a proxy (such as a load balancer), the proxy server may be able to screen out half open and poorly formed TCP open requests. Some routers can also be configured to hold TCP SYNs for some period of time, rather than forwarding them on to the destination host, in order to block half open connections. This type of protection can be put in place long before a DDoS attack occurs.
  • Limiting Connections: It is likely that DDoS sessions will be short lived, while legitimate sessions will be longer lived. The different may be a matter of seconds, or even milliseconds, but it is often enough to be a detectable difference. It might make sense, then, to prefer existing connections over new ones when resources start to run low. Legitimate users may wait longer to connect when connections are limited, but once they are connected, they are more likely to remain connected. Application design is important here, as well.
  • Aggressive Aging: In cache based systems, one way to free up depleted resources quickly is to simply age them out faster. The length of time a connection can be held open can often be dynamically adjusted in applications and hosts, allowing connection information to be removed from memory faster when there are fewer connection slots available. Again, this might impact live customer traffic, but it is still a useful technique when in the midst of an actual attack.
  • Blocking Bogon Sources: While there is a well known list of bogon addresses—address blocks that should never be routed on the global ‘net—these lists should be taken as a starting point, rather than as an ending point. Constant monitoring of traffic patterns on your edge can give you a lot of insight into what is “normal” and what is not. For instance, if your highest rate of traffic normally comes from South America, and you suddenly see a lot of traffic coming from Australia, either you’ve gone viral, or this is the source of the DDoS attack. It isn’t alway useful to block all traffic from a region, or a set of source addresses, but it is often useful to use the techniques listed above more heavily on traffic that doesn’t appear to be “normal.”

There are, of course, other techniques you can deploy against DDoS attacks—but at some point, you are just not going to have the expertise or time to implement every possible counter. This is where appliance and service (cloud) based services come into play. There are a number of appliance based solutions out there to scrub traffic coming across your links, such as those made by Arbor. The main drawback to these solutions is they scrub the traffic after it has passed over the link into your network. This problem can often be resolved by placing the appliance in a colocation facility and directing your traffic through the colo before it reaches your inbound network link.

There is one open source DDoS scrubbing option in this realm, as well, which uses a combination of FastNetMon, InfluxDB, Grefana, Redis, Morgoth, and Bird to create a solution you can run locally on a spun VM, or even bare metal on a self built appliance wired in between your edge router and the rest of the network (in the DMZ). This option is well worth looking at, if not to deploy, but to better understand how the kind of dynamic filtering performed by commercially available appliances works.

If the DDoS must be stopped before it reached your edge link, and you simply cannot handle the volume of the attacks, then the best solution might be a cloud based filtering solution. These tend to be expensive, and they also tend to increase latency for your “normal” user traffic in some way. The way these normally work is the DDoS provider advertises your routes, or redirects your DNS address to their servers. This draws all your inbound traffic into their network, which it is scrubbed using advanced techniques. Once the traffic is scrubbed, it is either tunneled or routed back to your network (depending on how it was captured in the first place). Most large providers offer scrubbing services, and there are several public offerings available independent of any upstream you might choose (such as Verisign’s line of services).

A front line defense against DDoS is to place your DNS name, and potentially your entire site, behind a DDoS detection and mitigation DNS service and/or content distribution network. For instance, CloudFlare is a widely used service that not only proxies and caches your web site, it also protect you against DDoS attacks.

Dispersing a DDoS: Initial thoughts on DDoS protection

Distributed Denial of Service is a big deal—huge pools of Internet of Things (IoT) devices, such as security cameras, are compromised by botnets and being used for large scale DDoS attacks. What are the tools in hand to fend these attacks off? The first misconception is that you can actually fend off a DDoS attack. There is no magical tool you can deploy that will allow you to go to sleep every night thinking, “tonight my network will not be impacted by a DDoS attack.” There are tools and services that deploy various mechanisms that will do the engineering and work for you, but there is no solution for DDoS attacks.

One such reaction tool is spreading the attack. In the network below, the network under attack has six entry points.

Assume the attacker has IoT devices scattered throughout AS65002 which they are using to launch an attack. Due to policies within AS65002, the DDoS attack streams are being forwarded into AS65001, and thence to A and B. It would be easy to shut these two links down, forcing the traffic to disperse across five entries rather than two (B, C, D, E, and F). By splitting the traffic among five entry points, it may be possible to simply eat the traffic—each flow is now less than one half the size of the original DDoS attack, perhaps within the range of the servers at these entry points to discard the DDoS traffic.

However—this kind of response plays into the attacker’s hand, as well. Now any customer directly attached to AS65001, such as G, will need to pass through AS65002, from whence the attacker has launched the DDoS, and enter into the same five entry points. How happy do you think the customer at G would be in this situation? Probably not very…

Is there another option? Instead of shutting down these two links, it would make more sense to try to reduce the volume of traffic coming through the links and leave them up. To put it more shortly—if the DDoS attack is reducing the total amount of available bandwidth you have at the edge of your network, it does not make a lot of sense to reduce the available amount of bandwidth at your edge in response. What you want to do, instead, is reapportion the traffic coming in to each edge so you have a better chance of allowing the existing servers to simply discard the DDoS attack.

One possible solution is to prepend the AS path of the anycast address being advertised from one of the service instances. Here, you could add one prepend to the route advertisement from C, and check to see if the attack traffic is spread more evenly across the three sites. As we’ve seen in other posts, however, this isn’t always an effective solution (see these three posts). Of course, if this is an anycast service, we can’t really break up the address space into smaller bits. So what else can be done?

There is a way to do this with BGP—using communities to restrict the scope of the routes being advertised by A and B. For instance, you could begin by advertising the routes to the destinations under attack towards AS65001 with the NO_PEER community. Given that AS65002 is a transit AS (assume it is for the this exercise), AS65001 would accept the routes from A and B, but would not advertise them towards AS65002. This means G would still be able to reach the destinations behind A and B through AS65001, but the attack traffic would still be dispersed across five entry points, rather than two. There are other mechanisms you could use here; specifically, some providers allow you to set a community that tells them not to advertise a route towards a specific AS, whether than AS is a peer or a customer. You should consult with your provider about this, as every provider uses a different set of communities, formatted in slightly different ways—your provider will probably point you to a web page explaining their formatting.

If NO_PEER does not work, it is possible to use NO_ADVERTISE, which blocks the advertisement of the destinations under attack to any of AS65001’s connections of whatever kind. G may well still be able to use the connections to A and B from AS65001 if it is using a default route to reach the Internet at large.

It is, of course, to automate this reaction through a set of scripts—but as always, it is important to keep a short leash on such scripts. Humans need to be alerted to either make the decision to use these communities, or to continue using these communities; it is too easy for a false positive to lead to a real problem.

Of course, this sort of response is also not possible for networks with just one or two connection points to the Internet.

But in all cases, remember that shutting down links the face of DDoS is rarely ever a real solution. You do not want to be reducing your available bandwidth when you are under attack specifically designed to exhaust available bandwidth (or other resources). Rather, if you can, find a way to disperse the attack.

P.S. Yes, I have covered this material before—but I decided to rebuild this post with more in depth information, and to use to kick off a small series on DDoS protection.

The Back Door Feature Problem

In Don’t Forget to Lock the Back Door! A Characterization of IPv6 Network Security Policy, the authors ran an experiment that tested for open ports in IPv4 and IPv6 across a wide swath of the network. What they discovered was interesting—

IPv6 is more open than IPv4. A given IPv6 port is nearly always more open than the same port is in IPv4. In particular, routers are twice as reachable over IPv6 for SSH, Telnet, SNMP, and BGP. While openness on IPv6 is not as severe for servers, we still find thousands of hosts open that are only open over IPv6.

This result really, on reflection, should not be all that surprising. There are probably thousands of networks in the world with “unintentional” deployments of IPv6. The vendor has shipped new products with IPv6 enabled by default, because one large customer has demanded it. Customers who have not even thought about deploying IPv6, however, end up with an unprotected attack surface.

The obvious solution to this problem is—deploy IPv6 intentionally, including security, and these problems will likely go away.

But the obvious solution, as obvious as it might be, is only one step in the right direction. Instead of just attacking the obvious problem, we should think through the process that caused this situation in the first place, and plug the hole in our thinking. The hole in our thinking is, of course, this:

“More features” is always better, so give me more features.

One of the lessons of the hyperscaler, lessons the rest of the market is just beginning to catch sight of off in the distance, is this more features mantra has led, and is still leading, into dangerous territory. Each feature, no matter how small it might seem, opens some new set of vulnerabilities in the code. Whether the vulnerability is a direct attack vector (such as the example in the paper), or just another interaction surface buried someplace in the vendor’s code, it is still a vulnerability.

In other words, simplicity is not just about the networks we design. Simplicity is not just about the user interface, either. It is integral to every product we buy, from the simplest switch to the most complex “silver bullet” appliance. Hiding complexity inside an appliance does not really make it go away; it just hides it.

You should really go read the paper itself, of course—and you should really deploy IPv6 intentionally in your network, as in yesterday. But you should also not fail to see the larger lesson that can be drawn from such studies. Sometimes it is better to have the complexity on the surface, where you can see and manage it, than it is to bury the complexity in an appliance, user interface, or…

Simplicity needs to be greater than skin deep.

IPv6, DHCP, and Unintended Consequences

I ran into an interesting paper on the wide variety of options for assigning addresses, and providing DNS information, in IPv6, over at ERNW. As always, with this sort of thing, it started me thinking about the power of unintended consequences, particularly in the world of standardization. The authors of this paper noticed there are a lot of different options available in the realm of assigning addresses, and providing DNS information, through IPv6.

Alongside these various options, there are a number of different flags that are supposed to tell the host which of these options should, and which shouldn’t, be used, prioritized, etc. The problem is, of course, that many of these flags, and many of the options, are, well, optional, which means they may or may not be implemented across different versions of code and vendor products. Hence, combining various flags with various bits of information can have a seemingly random impact on the IPv6 addresses and DNS information different hosts actually use. Perhaps the most illustrative chart is this one—

Each operating system tested seems to act somewhat differently when presented with all possible flags, and all possible sources of information. As the paper notes, this can cause major security holes. For instance, if an attacker simply brings up a DHCPv6 server on your network, and you’re not already using DHCPv6, the attacker can position itself to be a “man in the middle” for most DNS queries.

What lessons can we, as engineers and network operators, take away from this sort of thing?

First, standards bodies aren’t perfect. Standards bodies are, after all, made up of people (and a lot less people than you might imagine), and people are not perfect. Not only are people not perfect, they are often under pressures of various sorts which can lead to “less than optimal” decisions in many situations, particularly in the case of systems designed by a lot of different people, over a long stretch of time, with different pieces and parts designed to solve particular problems (corner or edge cases), and subjected to the many pressures of actually holding a day job.

Second, this means you need to be an engineer, even if you are relying on standards. In other words, don’t fall back to “but the RFC says…” as an excuse. Do the work of researching why “the RFC says,” find out what implementations do, and consider what alternatives might be. Ultimately, if you call yourself an engineer, be one.

Third, always know what is going on, on your network, and always try to account for negative possibilities, rather than just positive ones. I wonder how many times I have said, “but I didn’t deploy x, so I don’t need to think about how x interacts with my environment.” We never stop to ask if not deploying x leaves me open to security holes or failure modes I have not even considered.

Unintended consequences are, after all, unintended, and hence “Out of sight, out of mind.” But out of sight, and even out of mind, definitely does not mean out of danger.

The One Car

Imagine, for a moment, that you could only have one car. To do everything. No, I don’t mean, “I have access to a moving van through a mover, so I only need a minivan,” I mean one car. Folks who run grocery stores would need to use the same car to stock the shelves as their employees use to shuffle kids to school and back. The only thing about this car is this—it has the ability to add knobs pretty easily. If you need a new feature to meet your needs, you can go to the vendor and ask them to add it—there is an entire process, and it’s likely that the feature will be added at some point.

How does this change the world in which we live? Would it improve efficiency, or decrease it? Would it decrease operational costs (opex) or increase it? And, perhaps, another interesting question: what would this one car look like?

I’m guessing it would look a lot like routers and switches today. A handful of models, with lots of knobs, a complex CLI, and an in depth set of troubleshooting tools to match.

Of course, we actually have many different routers in the world, but compared to the “real car” world? Not really. And it’s not that this is a “bad thing” at some point in the history of a field or industry. The networking industry is, after all, just barely entering middle age in the real world, so maybe it’s okay if we have “one car” right now. And maybe we’re just starting to sort out how to go from “one car” to the specialized variety we see in the automotive world today (don’t just drive by your car dealership to see the variety, drive through your local commercial vehicle lot, too, and then the camper store, and then the off road place, and then the racing shop, and then…).

So—how do we make this transition?

First, we need to stop asking for the “one car.” Rather, we need to recognize that it’s actually okay to have one kind of equipment in one area of our network, and another in another. Second, we need to learn to simplify. We really just don’t need all those nerd knobs in every piece of equipment everywhere. Maybe we can live with fewer features and more specialized “stuff” to do specific things…

And, looking to the cloud, we can see a way forward there, too. If you live in a large city with lots of public transport, and mostly walkable shopping, you’re already “all in” with the cloud… You count on the trucks and cars of others to bring stuff close enough to you that you don’t need a car. If you live in the suburbs, you’re already in “hybrid cloud” mode, relying on the transportation resources of others for some things, and your own lighter weight transportation resources for other things. And if you live out in the country, then you probably have heavier equipment yourself, and just rent the even heavier stuff occasionally…

It’s okay to diversify. We’re an old enough field, now, with enough varying requirements, to let go of our “one car” expectations, and start realizing that maybe a range of solutions more finely tuned, and less finely tunable, is okay.

Tags: |

I2RS and Remote Triggered Black Holes

In our last post, we looked at how I2RS is useful for managing elephant flows on a data center fabric. In this post, I want to cover a use case for I2RS that is outside the data center, along the network edge—remote triggered black holes (RTBH). Rather than looking directly at the I2RS use case, however, it’s better to begin by looking at the process for creating, and triggering, RTBH using “plain” BGP. Assume we have the small network illustrated below—


In this network, we’d like to be able to trigger B and C to drop traffic sourced from 2001:db8:3e8:101::/64 inbound into our network (the cloudy part). To do this, we need a triggering router—we’ll use A—and some configuration on the two edge routers—B and C. We’ll assume B and C have up and running eBGP sessions to D and E, which are located in another AS. We’ll begin with the edge devices, as the configuration on these devices provides the setup for the trigger. On B and C, we must configure—

  • Unicast RPF; loose mode is okay. With loose RPF enabled, any route sourced from an address that is pointing to a null destination in the routing table will be dropped.
  • A route to some destination not used in the network pointing to null0. To make things a little simpler we’ll point a route to 2001:db8:1:1::1/64, a route that doesn’t exist anyplace in the network, to null0 on B and C.
  • A pretty normal BGP configuration.

The triggering device is a little more complex. On Router A, we need—

  • A route map that—
    • matches some tag in the routing table, say 101
    • sets the next hop of routes with this tag to 2001:db8:1:1::1/64
    • set the local preference to some high number, say 200
  • redistribute from static routes into BGP filtered through the route map as described.

With all of this in place, we can trigger a black hole for traffic sourced from 2001:db8:3e8:101::/64 by configuring a static route at A, the triggering router, that points at null0, and has a tag of 101. Configuring this static route will—

  • install a static route into the local routing table at A with a tag of 101
  • this static route will be redistributed into BGP
  • since the route has a tag of 101, it will have a local preference of 200 set, and the next hop set to 2001:db8:1:1::1/64
  • this route will be advertised via iBGP to B and C through normal BGP processing
  • when B receives this route, it will choose it as the best path to 2001:db8:3e8:101::/64, and install it in the local routing table
  • since the next hop on this route is set to 2001:db8:1:1::1/64, and 2001:db8:1:1::1/64 points to null0 as a next hop, uRPF will be triggered, dropping all traffic sourced from 2001:db8:3e8:101::/64 at the AS edge

It’s possible to have regional, per neighbor, or other sorts of “scoped” black hole routes by using different routes pointing to null0 on the edge routers. These are “magic numbers,” of course—you must have a list someplace that tells you which route causes what sort of black hole event at your edge, etc.

Note—this is a terrific place to deploy a DevOps sort of solution. Instead of using an appliance sort of router for the triggering router, you could run a handy copy of Cumulus or snaproute in a VM, and build scripts that build the correct routes in BGP, including a small table in the script that allows you to say something like “black hole 2001:db8:3e8:101::/64 on all edges,” or “black hole 2001:db8:3e8:101::/64 on all peers facing provider X,” etc. This could greatly simplify the process of triggering RTBH.

Now, as a counter, we can look at how this might be triggered using I2RS. There are two possible solutions here. The first is to configure the edge routers as before, using “magic number” next hops pointed at the null0 interface to trigger loose uRPF. In this case, an I2RS controller can simply inject the correct route at each edge eBGP speaker to block the traffic directly into the routing table at each device. There would only need to be one such route; the complexity of choosing which peers the traffic should be black holed on could be contained in a script at the controller, rather than dispersed throughout the entire network. This allows RTBH to be triggered on a per edge eBGP speaker basis with no additional configuration on any individual edge router.

Note the dynamic protocol isn’t being replaced in any way. We’re still receiving our primary routing information from BGP, including all the policies available in that protocol. What we’re doing, though, is removing one specific policy point out of BGP and moving it into a controller, where it can be more closely managed, and more easily automated. This is, of course, the entire point of I2RS—to augment, rather than replace, dynamic routing used as the control plane in a network.

Another option, for those devices that support it, is to inject a route that explicitly filters packets sourced from 2001:db8:3e8:101::/64 directly into the RIB using the filter based RIB model. This is a more direct method, if the edge devices support it.

Either way, the I2RS process is simpler than using BGP to trigger RTBH. It gathers as much of the policy as possible into one place, where it can be automated and managed in a more precise, fine grained way.

Fabric versus Network: What’s the Difference?

We often hear about fabrics, and we often hear about networks—but on paper, an in practice, they often seem to be the same thing. Leaving aside the many realms of vendor hype, what’s really the difference? Poking around on the ‘net, I came across a couple of definitions that seemed useful, at least at first blush. For instance, SDN Search gives provides the following insight

The word fabric is used as a metaphor to illustrate the idea that if someone were to document network components and their relationships on paper, the lines would weave back and forth so densely that the diagram would resemble a woven piece of cloth.

While this is interesting, it gives us more of a “on the paper” answer than what might be called a functional view. The entry at Wikipedia is more operationally based

Switched Fabric or switching fabric is a network topology in which network nodes interconnect via one or more network switches (particularly crossbar switches). Because a switched fabric network spreads network traffic across multiple physical links, it yields higher total throughput than broadcast networks, such as early Ethernet.

Greg has an interesting (though older) post up on the topic, and Brocade has an interesting (and longer) white paper up on the topic. None of these, however, seem to have complete picture. So what is a fabric?

To define a fabric in terms of functionality, I would look at several attributes, including—

  • the regularity and connectiveness of the nodes (network devices) and edges (links)
  • the design of the traffic flow, specifically how traffic is channeled to individually connected devices
  • the performance goals the topology is designed to fulfill in terms of forwarding

You’ll notice that, unlike the definition given by many vendors, I’m not too interested in whether the fabric is treated as “one device” or “many devices.” Many vendors will throw the idea that a fabric must be treated as a single “thing,” unlike a network, which treats each device independently. This is clever marketing, of course, because it differentiates the vendor’s “fabric offering” from home grown (or built from component) fabrics, but that’s primarily what it is—marketing. While it might be a nice feature of any network or fabric to make administration easier, it’s not definitional in the way the performance and design of the network are.

In fact, what’s bound to start happening in the next few years is vendors are going to call overlay systems, or vertically integrated systems, for all sorts of things, like a “campus fabric,” or a “wide area fabric.” Another marketing ploy to watch out for is going to be interplay with the software defined moniker—if it’s “software defined,” is a “fabric.” Balderdash.

Let’s look at the three concepts I outlined above in a little more detail.

Topology Regularity

Fabrics have a well defined, regularly repeating topology. It doesn’t matter if the topology is planar or non-planar, what matters is that the topology is a regularly repeating “small topology” repeated in a larger topology.


In these diagrams, A is a regular topology; note you can take a copy of any four nodes, overlay them on any other four nodes, and see the repeating topology. A is also planar, as no two links cross. B is a nonplanar regular (or repeating) topology. C is not a regular topology, as the “subtopologies” do not repeat between the nodes. D is not a regular topology that is also nonplanar.

The regularity of the topology is a good rule of thumb that will help you quickly pick out most fabrics against most non-fabrics. Why? To understand the answer, we need to look at the rest of the properties of a fabric.

Design of the Traffic Flow

Several of the definitions given in my quick look through the ‘net mentioned this one: in a fabric, traffic is split across many available paths, rather than being pushed onto a smaller number of higher speed paths. The general rule of thumb is—if traffic can be split over a large number of ECMP paths, then you’re probably looking at a fabric, rather than a network. The only way to get a large number of ECMP paths is to create a regularly repeating topology, like the ones shown in A and B above.

Performance Goals

But why does the number of ECMP paths matter? Because—fabric performance is normally quantifiable in somewhat regular mathematical terms. In other words, if you want to understand the performance of a fabric, you don’t need to examine the network topology as a “one off.” Perhaps a better way to say this: fabrics are not snowflakes in terms of performance. You might not know why a particular fabric performs a certain way (theoretically), but you can still know how it’s going to perform under specific conditions.

The most common case of this is the ability to calculate the oversubscription rate on the fabric; whatn amount of traffic can the network switch without contention, given the traffic is evenly distributed across sources and receivers? In a fabric, it’s easy enough to look at the edge ports offered, the bandwidth available to carry traffic at each stage, and determine at what level the fabric is going to introduce buffering as a result of link contention. This is probably the crucial defining characteristic of a fabric from a network design perspective.

Another one that’s interesting, and less often considered, is the maximum or typical jitter through the fabric in the absence of contention. If a fabric is properly designed, and the network devices used to build the fabric don’t mess with your math, you can generally get a pretty good idea of what the minimum and maximum delay will be from any edge port to any other edge port on a fabric. Within the broader class of network topologies, this is generally a matter of measuring the actual delays through the network, rather than a calculation that can be done beforehand.

While some might disagree, these are the crucial differences between “any old network topology” and a fabric from my perspective.