Rethinking BGP on the DC Fabric (part 3)

The fist post on this topic considered some basic definitions and the reasons why I am writing this series of posts. The second considered the convergence speed of BGP on a dense topology such as a DC fabric, and what mechanisms we normally use to improve BGP’s convergence speed. This post considers some of the objections to slow convergence speed—convergence speed is not important, and ECMP with high fanouts will take care of any convergence speed issues. The network below will be used for this discussion.

Two servers are connected to this five-stage butterfly: S1 and S2 Assume, for a moment, that some service is running on both S1 and S2. This service is configured in active-active mode, with all data synchronized between the servers. If some fabric device, such as C7, fails, traffic destined to either S1 or S2 across that device will be very quickly (within tens of milliseconds) rerouted through some other device, probably C6, to reach the same destination. This will happen no matter what routing protocol is being used in the underlay control plane—so why does BGP’s convergence speed matter? Further, if these services are running in the overlay, or they are designed to discover failed servers and adjust accordingly, it would seem like the speed at which the underlay converges just does not matter.

Consider, however, the case where the services running on S1 and S2 are both reachable through an eVPN overlay with tunnel tail-ends landing on the ToR switch through which each server connects to the fabric. Applications accessing these services, for this example, either access the service via a layer 2 MAC address or through a single (anycast) IP address representing the service, rather than any particular instance. To make all of this work, there would be one tunnel tail-end landing on A8, and another landing on E8.

Now what happens if A8 fails? For the duration of the underlay control plane convergence the tunnel tail-end at A8 will appear to be reachable to the overlay. Thus the overlay tunnel will remain up and carrying traffic to a black hole on one of the routers adjacent to A8. In the case of a service reachable via anycast, the application can react in one of two ways—it can fail out operations taking place during the underlay’s convergence, or it can wait. Remember that one second is an eternity in the world of customer-facing services, and that BGP can easily take up to one second to converge in this situation.

A rule of thumb for network design—it’s not the best-case that controls network performance, it’s the worst-case convergence.

The convergence speed of the underlay leaks through to the state of the overlay. The questions that should pop into your mind about right now is—can you be certain this kind of situation cannot happen in your current network, can you be certain it will never happen, and can you be certain this will not have an impact on application performance? I don’t see how the answer to those questions can be yes. The bottom line: convergence speed should be left out of the equation when building a DC fabric. There may be times when you control the applications, and hence can push the complexity of dealing with slow convergence to the application developers—but this seems like a vanishingly small number of cases. Further, is pushing solving for slow convergence to the application developer optimal?

My take on the argument that convergence speed doesn’t matter, then, is that it doesn’t hold up under deeper scrutiny.

as I noted when I started this series—I’m not arguing that we should rip BGP out of every DC fabric … instead, what I’m trying to do is to stir up a conversation and to get my readers to think more deeply about their design choices, and how those design choices work out in the real world

Rethinking BGP on the DC Fabric (part 2)

In my last post on this topic, I laid out the purpose of this series—to start a discussion about whether BGP is the ideal underlay control plane for a DC fabric—and gave some definitions. Here, I’d like to dive into the reasons to not use BGP as a DC fabric underlay control plane—and the first of these reasons is BGP converges very slowly and requires a lot of help to converge at all.

Examples abound. I’ve seen the results of two testbeds in the last several years where a DC fabric was configured with each router (switch, if you prefer) in a separate AS, and some number of routes pushed into the network. In both cases—one large-scale, the other a more moderately scaled network on physical hardware—BGP simply failed to converge. Why? A quick look at how BGP converges might help explain these results.

Assume we are watching the 110::/64 route (attached to A, on the left side of the diagram), at P. What happens when A loses it’s connection to 110::/64? Assuming every router in this diagram is in a different AS, and the AS path length is the only factor determining the best path at every router.

Watching the route to 110::/64 at P, you would see the route move from G to M as the best path, then from M to K, then from K to N, and then finally completely drop out of P’s table. This is called the hunt because BGP “hunts,” apparently trying every path from the current best path to the longest possible path before finally removing the route from the network entirely. BGP isn’t really “hunting;” this is just an artifact of the way BGP speakers receive, process, and send updates through the network.

If you consider a more complex topology, like a five-stage butterfly fabric, you will find there are many (very many) alternate longer-length paths available for BGP to hunt through on a withdraw. Withdrawing thousands of routes at the same time, combined with the impact of the hunt, can put BGP in a state where it simply never converges.

To get BGP to converge, various techniques must be used. For instance, placing all the routers in the spine so they are in the AS, configuring path filters at ToR switches so they are never used as a transit path, etc. Even when these techniques are used, however, BGP can still require a minute or so to perform a withdraw.

This means the BGP configuration cannot be the same on every device—it is determined by where the device is located—which harms repeatability, the BGP configuration must contain complex filters, and messing up the configuration can bring the entire fabric down.

There are several counters to the problem of slow convergence, and the complex configurations required to make BGP converge more quickly, but this post is pushing against its limit … so I’ll leave these until next time.

Rethinking BGP on the DC Fabric

Everyone uses BGP for DC underlays now because … well, just because everyone does. After all, there’s an RFC explaining the idea, every tool in the world supports BGP for the underlay, and every vendor out there recommends some form of BGP in their design documents.

I’m going to swim against the current for the moment and spend a couple of weeks here discussing the case against BGP as a DC underlay protocol. I’m not the only one swimming against this particular current, of course—there are at least three proposals in the IETF (more, if you count things that will probably never be deployed) proposing link-state alternatives to BGP. If BGP is so ideal for DC fabric underlays, then why are so many smart people (at least they seem to be smart) working on finding another solution?

But before I get into my reasoning, it’s probably best to define a few things.

In a properly design data center, there are at least three control planes. The first of these I’ll call the application overlay. This control plane generally runs host-to-host, providing routing between applications, containers, or virtual machines. Kubernetes networking would be an example of an application overlay control plane.

The second of these I’ll call the infrastructure overlay. This is generally going to be eVPN running BGP, most likely with VXLAN encapsulation, and potentially with segment routing for traffic steering support. This control plane will typically run on either workload supporting hosts, providing routing for the hypervisor or internal bridge, or on the Top of Rack (ToR) routers (switches, but who knows what “router” and “switch” even mean any longer?).

Now notice that not all networks will have both application and infrastructure overlays—many data center fabrics will have one or the other. It’s okay for a data center fabric to only have one of these two overlays—whether one or both are needed is really a matter of local application and business requirements. I also expect both of these to use either BGP or some form of controller-based control plane. BGP was originally designed to be an overlay control plane; it only makes sense to use it where an overlay is required.

I’ll call the third control plane the infrastructure underlay. This control plane provides reachability for the tunnel head- and tail-ends. Plain IPv4 or IPv6 transport is supported here; perhaps some might inject MPLS as well.

My argument, over the next couple of weeks, is BGP is not the best possible choice for the infrastructure underlay. What I’m not arguing is every network that runs BGP as the infrastructure underlay needs to be ripped out and replaced, or that BGP is an awful, horrible, no-good choice. I’m arguing there are very good reasons not to use BGP for the infrastructure underlay—that we need to start reconsidering our monolithic assumption that BGP is the “only” or “best” choice.

I’m out of words for this week; I’ll begin the argument proper in my next post… stay tuned.

Technical Debt (or Is Future Proofing Even a Good Idea?)

What, really, is “technical debt?” It’s tempting to say “anything legacy,” but then why do we need a new phrase to describe “legacy stuff?” Even the prejudice against legacy stuff isn’t all that rational when you think about it. Something that’s old might also just be well-tested, or well-worn but still serviceable. Let’s try another tack.

Technical debt, in the software world, can be defined as working on a piece of software for long periods of time by only adding features, and never refactoring or reorganizing the code to meet current conditions. The general idea is that as new features are added on top of the old, two things happen. First, the old stuff becomes a sort of opaque box that no-one understands. Second, the stuff being added to the old increasingly relies on public behavior that might be subject to unintended consequences or leaky abstractions.

To resolve this problem in the software world, software is “refactored.” In refactoring, every use of a public API is examined, including what information is being drawn out, or what the expected inputs and outputs are. The old code is then “discarded,” in a sense, and a new underlying function written that meets the requirements discovered in the existing codebase. The refactoring process allows the calling functions and called functions, the clients and the servers, to “move together.” As the overall understanding of the system changes, the system itself can change with that understanding. This brings the implementation closer to the understanding of the current engineering team.

So technical debt is really the mismatch between the understanding of the current engineering team imposed onto the implementation of a prior engineering team with a different understanding of the requirements (even if the older engineering team is the same people at a different time).

How can we apply this to networking? In the networking world, we actively work against refactoring by future proofing.

Future proofing: building a network that will never need to be replaced. Or, perhaps, never needing to go to your manager and say “I need to buy new hardware,” because the hardware you bought last week does not have the functionality you need any longer. This sounds great in theory, but the problem with theory is it often does not hold up when the glove hits the nose (every has a plan until they are punched in the face). The problem with future proofing is you can’t see the punch that’s coming until the future actually throws it.

If something is future proofed, it either cannot be refactored, or there will be massive resistence to the refactoring process.

Maybe its time we tried to change our view of things so we can refactor networks. What would refactoring a network look like? Maybe examining all the configurations within a particular module, figuring what they do, and then trying to figure out what application or requirement led to that particular bit of configuration. Or, one step higher, looking at every protocol or “feature” (whatever a feature might be), figuring out what purpose it might serve, and then intentionally setting about finding some other way, perhaps a simpler way, to provide that same service.

One key point in this process is to begin by refusing to look at the network as a set of appliances, and starting to see it as a system made up of hardware and software. Mentally disaggregating the software and hardware can allow you to see what can be changed and what cannot, and inject some flexibility into the network refactoring process.

When you refactor a bit of code, what you typically end up with a simpler piece of software that more closely matches the engineering team’s understanding of current requirements and conditions. Aren’t simplicity and coherence goals for operational networks, too?

If so, ehen was the last time you refactored your network?

Strong Reactions and Complexity

In the realm of network design—especially in the realm of security—we often react so strongly against a perceived threat, or so quickly to solve a perceived problem, that we fail to look for the tradeoffs. If you haven’t found the tradeoffs, you haven’t looked hard enough—or, as Dr. Little says, you have to ask what is gained and what is lost, rather than just what is gained. This failure to look at both sides often results in untold amounts of technical debt and complexity being dumped into network designs (and application implementations), causing outages and failures long after these decisions are made.

A 2018 paper on DDoS attacks, A First Joint Look at DoS Attacks and BGP Blackholing in the Wild provides a good example of causing more damage to an attack than the attack itself. Most networks are configured to allow the operator to quickly configure a remote triggered black hole (RTBH) using BGP. Most often, a community is attached to a BGP route that points the next-hop to a local discard route on each eBGP speaker. If used on the route advertising the destination of the attack—the service under attack—the result is the DDoS attack traffic no longer has a destination to flow to. If used on the route advertising the source of the DDoS attack traffic, the result is the DDoS traffic will no pass any reverse-path forwarding policies at the edge of the AS, and hence be dropped. Since most DDoS attacks are reflected, blocking the source traffic still prevents access to some service, generally DNS or something similar.

In either case, then, stopping the DDoS through an RTBH causes damage to services rather than just the attacker. Because of this, remote triggered black holes should really only be used in the most extreme cases, where no other DDoS mitigation strategy will work.

The authors of the Joint Look use publicly avaiable information to determine the answers to several questions. First, what scale of DDoS attacks are RTBHs used against? Second, how long after an attack begins is the RTBH triggered? Third, for how long is the RTBH left in place after the attack has been mitigated?

The answer to the first question should be—the RTBH is only used against the largest-scale attacks. The answer to the second question should be—the RTBH should be put in place very quickly after the attack is detected. The answer to the third question should be—the RTBH should be taken down as soon as the attack has stopped. The researchers found that RTBHs were most often used to mitigate the smallest of DDoS attacks, and almost never to mitigate larger ones. The authors also found that RTBHs were often left in place for hours after a DDoS attack had been mitigated. Both of these imply that current use of RTBH to mitigate DDoS attacks is counterproductive.

How many more design patterns do we follow that are simply counterproductive in the same way? This is not a matter of “following the data,” but rather one of really thinking through what it is you are trying to accomplish, and then how to accomplish that goal with the simplest set of tools available. Think through what it would mean to remove what you have put in, whether you really need to add another layer or protocol, how to minimize configuration, etc.

If you want your network to be less complex, examine the tradeoffs realistically.

Hints and Principles: Applied to Networks

While software design is not the same as network design, there is enough overlap for network designers to learn from software designers. A recent paper published by Butler Lampson, updating a paper he wrote in 1983, is a perfect illustration of this principle. The paper is caleld Hints and Principles for Computer System Design. I’m not going to write a full review here–you should really go read the paper for yourself–but rather just point out some useful bits of the paper.

The first really useful point of this paper is Lampson breaks down the entire field of software design into three basic questions: What, How, and When (or who)? Each of these corresponds to the goals, techniques, and processes used to design and develop software. These same questions and answers apply to network design–if you are missing one of these three areas, then you are probably missing some important set of questions you have not answered yet. Each of these is also represented by an acronym: what? is STEADY, how? is AID, and when? is ART. Let’s look at a couple of these in a little more detail to see how Lampson’s system works.

STEADY stands for simple, timely, efficient, adaptable, dependable, and yummy. Simple is just what it sounds like –reduce complexity. I’m not entirely on board with Lampson’s description of simplicity, which seems to focus on abstraction–abstraction is one useful tool, but anyone who reads my work regularly knows I’m rather more careful about abstraction than most because it involves often-unexamined tradeoffs. Timely primarily relates to “is there a market for this,” in software design; for networks it might be better put as “does the business need this now or later?” Efficient is one of those tradeoffs involved in abstraction–what I might call one of the various ways of optimizing a system. Adaptable means just what it sounds like–are you creating technial debt that must be resolved later? Dependable could be translated to resilience in network design, but it would also relate to many aspects of security, and even the jitter and delay elements in application support.

Yummy is one many network engineers will not be familiar with, but is worth considering. If I’m reading Lampson right here, another way to say this might be “easy to consume.” Why do you want your customers to be able to consume the network easily? Because you do not want them running off and using the cloud (for instance) because they find committing and understanding resources in your network so difficult. We have, for far to long, assumed that “easy to consume” in the network design world means “just plug it into the wall.” It’s not that simple.

The second one, AID, stands for approximate, incremental, and divide & conquer. These are, again, easily adaptable to network design. You don’t need to make the design perfect the first time. In fact, as a young artist one thing that was drilled into my head was that the perfect was the enemy of the good–it’s better to get it approximately right, right now, than perfectly right ten years down the road (when no-one cares any longer). Incremental speaks to modularization, scale-out, and lifecycle management, for instance.

While not every principle here can be applied, a lot of them can. Having them listed out in an easy-to-remember format like this is a great design aid–learn these, and use them.

Reducing Complexity through Interaction Surfaces

A recent paper on network control and management (which includes Jennifer Rexford on the author list—anything with Jennifer on the author list is worth reading) proposes a clean slate 4d approach to solving much of the complexity we encounter in modern networks. While the paper is interesting, it’s very unlikely we will ever see a clean slate design like the one described, not least because there will always be differences between what the proper splits are—what should go where.

There is one section of the paper that eloquently speaks to current architecture, however. The authors describe a situation where routing and packet filters are used together to prevent one set of hosts from reaching another set of hosts. Changes in the network, however, cause the packet filters to be bypassed, opening up communications between these two sets of hosts.

This is exactly the problem we so often face in network engineering today—overlapping systems used to solve a single problem do not pay attention to the same signals or information to do their jobs. So here’s a thought about an obvious way to reduce the complexity of your network—try to use one tool to do one job. Before the days of automation, this was much harder to do. There was no way to distribute QoS configurations, for instance, or access lists, much less what might be considered an “easy way.” Because of this, it made some kind of sense to use routing protocols as a sort of distributed database and policy engine to move filters and the like around.

Today, however, we have automation. Because of this, it makes more sense to use automation to manage as much data plane policy as you can, leaving the routing protocol to do its job—provide reachability across an ever-changing network. There are still things, like traffic steering and prefix distribution rules, which should stay inside routing. But when you put routing filters in place to solve a data plane problem, it might be worth thinking about whether that is the right thing to do any longer.

Automation, in this case, can change everything.