CONTENT TYPE
Rethinking BGP on the DC Fabric (part 4)
Before I continue, I want to remind you what the purpose of this little series of posts is. The point is not to convince you to never use BGP in the DC underlay ever again. There’s a lot of BGP deployed out there, and there are lot of tools that assume BGP in the underlay. I doubt any of that is going to change. The point is to make you stop and think!
Why are we deploying BGP in this way? Is this the right long-term solution? Should we, as a community, be rethinking our desire to use BGP for everything? Are we just “following the crowd” because … well … we think it’s what the “cool kids” are doing, or because “following the crowd” is what we always seem to do?
In my last post, I argued that BGP converges much more slowly than the other options available for the DC fabric underlay control plane. The pushback I received was two-fold. First, the overlay converges fast enough; the underlay convergence time does not really factor into overall convergence time. Second, there are ways to fix things.
If the first pushback is always true—the speed of the underlay control plane convergence does not matter—then why have an underlay control plane at all? Why not just use a single, merged, control plane for both underlay and overlay? Or … to be a little more shocking, if the speed at which the underlay control plane converges does not matter, why not just configure the entire underlay using … static routes?
The reason we use a dynamic underlay control plane is because we need this foundational connectivity for something. So long as we need this foundational connectivity for something, then that something is always going to be better if it is faster rather than slower.
The second pushback is more interesting. Essentially—because we work on virtual things rather than physical ones, just about anything can be adapted to serve any purpose. I can, for instance, replace BGP’s bestpath algorithm with Dijkstra’s SPF, and BGP’s packet format with a more straight-forward TLV format emulating a link-state protocol, and then say, “see, now BGP looks just like a link-state protocol … I made BGP work really well on a DC fabric.”
Yes, of course you can do these things. Somewhere along the way we became convinced that we are being really clever when we adapt a protocol to do something it wasn’t designed to do, but I’m not certain this is a good way of going about building reliable systems.
Okay, back to the point … the next reason we should rethink BGP on the DC fabric is because it is complex to configure when its being used as an IGP. In my last post, when discussing the configuration required to make BGP converge, I noted AS numbers and AS Path filters must be laid out in a very specific way, following where each device is located in the fabric. The MRAI must be taken down to some minimum on every device (either 0 or 1 second), and individual peers must be configured.
Further, if you are using a version of BGP that follows the IETF’s BCPs for the protocol, you must configure some sort of filter (generally a permit all) to get a BGP speaker to advertise anything to an eBGP peer. If you’re using iBGP, you need to configure route reflectors and tell BGP to advertise multiple paths.
There are two ways to solve this problem. First, you can automate all this configuration—of course! I am a huge fan of automation. It’s an important tool because it can make your network consistent and more secure.
But I’m also realistic enough to know that adding the complexity of an automation system on top of a too-complex system to make things simpler is probably not a really good idea. To give a visual example, consider the possibility of automatically wiping your mouth while eating soup.

Yes, automation can be taken too far. A good rule of thumb might be: automation works best on systems intentionally designed to be simple enough to automate. In this case, perhaps it would be simpler to just use a protocol more directly designed so solve the problem at hand, rather than trying to automate our way out of the problem.
Second, you can modify BGP to be a better fit for use as an IGP in various ways. This post has already run far too long, however, so … I’ll hold off on talking about this until the next post.
The Hedge 71: Nick Russo and Automating Productivity
When we think of automation—and more broadly tooling—we tend to think of automating the configuration, monitoring, and (possibly) the monitoring of a network. On the other hand, a friend once observed that when interviewing coders, the first thing he asked was about the tools they had developed and used for making themselves more efficient. This “self-tooling” process turns out to be important not just to be more efficient at work, but to use time more effectively in general. Join Nick Russo, Eyvonne Sharp, Tom Ammon, and Russ White as we discuss self-tooling.
On the ‘net: The Art of Conviction
I was recently a guest on The Art of Conviction podcast, where we covered a bit of my background, some of the challenges I’ve faced in getting where I am, and then we moved into a discussion around my recently finished dissertation. I’m working to find places to publish more in the area of worldview and culture; I’ll point to those here as I can find a “home” for that side of my life.
You can find the recording here.
Beyond my episode, The Art of Conviction is a fascinating podcast; you should really subscribe and listen in.
Rethinking BGP on the DC Fabric (part 3)
The fist post on this topic considered some basic definitions and the reasons why I am writing this series of posts. The second considered the convergence speed of BGP on a dense topology such as a DC fabric, and what mechanisms we normally use to improve BGP’s convergence speed. This post considers some of the objections to slow convergence speed—convergence speed is not important, and ECMP with high fanouts will take care of any convergence speed issues. The network below will be used for this discussion.

Two servers are connected to this five-stage butterfly: S1 and S2 Assume, for a moment, that some service is running on both S1 and S2. This service is configured in active-active mode, with all data synchronized between the servers. If some fabric device, such as C7, fails, traffic destined to either S1 or S2 across that device will be very quickly (within tens of milliseconds) rerouted through some other device, probably C6, to reach the same destination. This will happen no matter what routing protocol is being used in the underlay control plane—so why does BGP’s convergence speed matter? Further, if these services are running in the overlay, or they are designed to discover failed servers and adjust accordingly, it would seem like the speed at which the underlay converges just does not matter.
Consider, however, the case where the services running on S1 and S2 are both reachable through an eVPN overlay with tunnel tail-ends landing on the ToR switch through which each server connects to the fabric. Applications accessing these services, for this example, either access the service via a layer 2 MAC address or through a single (anycast) IP address representing the service, rather than any particular instance. To make all of this work, there would be one tunnel tail-end landing on A8, and another landing on E8.
Now what happens if A8 fails? For the duration of the underlay control plane convergence the tunnel tail-end at A8 will appear to be reachable to the overlay. Thus the overlay tunnel will remain up and carrying traffic to a black hole on one of the routers adjacent to A8. In the case of a service reachable via anycast, the application can react in one of two ways—it can fail out operations taking place during the underlay’s convergence, or it can wait. Remember that one second is an eternity in the world of customer-facing services, and that BGP can easily take up to one second to converge in this situation.
A rule of thumb for network design—it’s not the best-case that controls network performance, it’s the worst-case convergence.
The convergence speed of the underlay leaks through to the state of the overlay. The questions that should pop into your mind about right now is—can you be certain this kind of situation cannot happen in your current network, can you be certain it will never happen, and can you be certain this will not have an impact on application performance? I don’t see how the answer to those questions can be yes. The bottom line: convergence speed should be left out of the equation when building a DC fabric. There may be times when you control the applications, and hence can push the complexity of dealing with slow convergence to the application developers—but this seems like a vanishingly small number of cases. Further, is pushing solving for slow convergence to the application developer optimal?
My take on the argument that convergence speed doesn’t matter, then, is that it doesn’t hold up under deeper scrutiny.
as I noted when I started this series—I’m not arguing that we should rip BGP out of every DC fabric … instead, what I’m trying to do is to stir up a conversation and to get my readers to think more deeply about their design choices, and how those design choices work out in the real world
The Hedge 70: FR Routing Update
FR Routing is a widely used and supported open source routing stack. In this episode of the Hedge, Alistair Woodman, Quentin Young, Donald Sharp, Tom Ammon, and Russ White discuss recent updates, additions to the CI/CD system, the release process, and operating system support. If you’re looking for a good open source, containerized routing stack for everything from route servers to DC fabrics and labbing to production, you should check out FR Routing.
It is Easier to Move a Problem than Solve it (RFC1925, Rule 6)

Early on in my career as a network engineer, I learned the value of sharing. When I could not figure out why a particular application was not working correctly, it was always useful to blame the application. Conversely, the application owner was often quite willing to share their problems with me, as well, by blaming the network.
A more cynical way of putting this kind of sharing is the way RFC 1925, rule 6 puts is: “It is easier to move a problem around than it is to solve it.”
Of course, the general principle applies far beyond sharing problems with your co-workers. There are many applications in network and protocol design, as well. Perhaps the most widespread case deployed in networks today is the movement to “let the controller solve the problem.” Distributed routing protocols are hard? That’s okay, just implement routing entirely on a controller. Understanding how to deploy individual technologies to solve real-world problems is hard? Simple—move the problem to the controller. All that’s needed is to tell the controller what we intend to do, and the controller can figure the rest out. If you have problems solving any problem, just call it Software Defined Networking, or better yet Intent Based Networking, and the problems will melt away into the controller.
Pay no attention to the complexity of the code on the controller, or those pesky problems with CAP Theorem, or any protests on the part of long-term engineering staff who talk about total system complexity. They are probably just old curmudgeon who are protecting their territory in order to ensure they have a job tomorrow. Once you’ve automated the process you can safely ignore how the process works; the GUI is always your best guide to understanding the overall complexity of the system.
Examples of moving the problem abound in network engineering. For instance, it is widely known that managing customers is one of the hardest parts of operating a network. Customers move around, buy new hardware, change providers, and generally make it very difficult for providers by losing their passwords and personally identifying information (such as their Social Security Number in the US). To solve this problem, RFC8567 suggests moving the problem of storing enough information to uniquely identify each person into the Domain Name System. Moving the problem from the customer to DNS clearly solves the problem of providers (and governments) being able to track individuals on a global basis. The complexity and scale of the DNS system is not something to be concerned with, as DNS “just works,” and some method of protecting the privacy of individuals in such a system can surely be found. After all, it’s just software.
If the DNS system becomes too complex, it is simple enough to relieve DNS of the problem of mapping IP addresses to the names of hosts. Instead, each host can be assigned a single host that is used regardless of where it is attached to the network, and hence uniquely identifies the host throughout its lifetime. Such a system is suggested in RFC2100 and appears to be widely implemented in many networks already, at least from the perspective of application developers. Because DNS is “too slow,” application developers find it easier to move the problem DNS is supposed to solve into the routing system by assigning permanent IP addresses.
Another great example of moving a problem rather than solving it is RFC3215, Electricity over IP. Every building in the world, from houses to commercial storefronts, must have multiple cabling systems installed in order to support multiple kinds of infrastructure. If RFC3215 were widely implemented, a single kind of cable (or even fiber optics, if you want your electricity faster) can be installed in all buildings, and power carried over the IP network running on these cables (once the IP network is up and running). Many ancillary problems could be solved with such a scheme—for instance, power usage could be measured based on a per-packet system, rather than the sloppier kilowatt-hour system currently in use. Any bootstrap problems can be referred to the controller. After all, it’s just software.
The bottom line is this: when you cannot figure out how to solve a problem, just move it to some other system, especially if that system is “just software,” so the problem now becomes “software defined. This is also especially useful if moving the problem can be accomplished by claiming the result is a form of automation.
Moving problems around is always much easier than solving them.
Rethinking BGP on the DC Fabric (part 2)
In my last post on this topic, I laid out the purpose of this series—to start a discussion about whether BGP is the ideal underlay control plane for a DC fabric—and gave some definitions. Here, I’d like to dive into the reasons to not use BGP as a DC fabric underlay control plane—and the first of these reasons is BGP converges very slowly and requires a lot of help to converge at all.
Examples abound. I’ve seen the results of two testbeds in the last several years where a DC fabric was configured with each router (switch, if you prefer) in a separate AS, and some number of routes pushed into the network. In both cases—one large-scale, the other a more moderately scaled network on physical hardware—BGP simply failed to converge. Why? A quick look at how BGP converges might help explain these results.

Assume we are watching the 110::/64 route (attached to A, on the left side of the diagram), at P. What happens when A loses it’s connection to 110::/64? Assuming every router in this diagram is in a different AS, and the AS path length is the only factor determining the best path at every router.
Watching the route to 110::/64 at P, you would see the route move from G to M as the best path, then from M to K, then from K to N, and then finally completely drop out of P’s table. This is called the hunt because BGP “hunts,” apparently trying every path from the current best path to the longest possible path before finally removing the route from the network entirely. BGP isn’t really “hunting;” this is just an artifact of the way BGP speakers receive, process, and send updates through the network.
If you consider a more complex topology, like a five-stage butterfly fabric, you will find there are many (very many) alternate longer-length paths available for BGP to hunt through on a withdraw. Withdrawing thousands of routes at the same time, combined with the impact of the hunt, can put BGP in a state where it simply never converges.
To get BGP to converge, various techniques must be used. For instance, placing all the routers in the spine so they are in the AS, configuring path filters at ToR switches so they are never used as a transit path, etc. Even when these techniques are used, however, BGP can still require a minute or so to perform a withdraw.
This means the BGP configuration cannot be the same on every device—it is determined by where the device is located—which harms repeatability, the BGP configuration must contain complex filters, and messing up the configuration can bring the entire fabric down.
There are several counters to the problem of slow convergence, and the complex configurations required to make BGP converge more quickly, but this post is pushing against its limit … so I’ll leave these until next time.
