Rethinking BGP on the DC Fabric (part 3)

The fist post on this topic considered some basic definitions and the reasons why I am writing this series of posts. The second considered the convergence speed of BGP on a dense topology such as a DC fabric, and what mechanisms we normally use to improve BGP’s convergence speed. This post considers some of the objections to slow convergence speed—convergence speed is not important, and ECMP with high fanouts will take care of any convergence speed issues. The network below will be used for this discussion.

Two servers are connected to this five-stage butterfly: S1 and S2 Assume, for a moment, that some service is running on both S1 and S2. This service is configured in active-active mode, with all data synchronized between the servers. If some fabric device, such as C7, fails, traffic destined to either S1 or S2 across that device will be very quickly (within tens of milliseconds) rerouted through some other device, probably C6, to reach the same destination. This will happen no matter what routing protocol is being used in the underlay control plane—so why does BGP’s convergence speed matter? Further, if these services are running in the overlay, or they are designed to discover failed servers and adjust accordingly, it would seem like the speed at which the underlay converges just does not matter.

Consider, however, the case where the services running on S1 and S2 are both reachable through an eVPN overlay with tunnel tail-ends landing on the ToR switch through which each server connects to the fabric. Applications accessing these services, for this example, either access the service via a layer 2 MAC address or through a single (anycast) IP address representing the service, rather than any particular instance. To make all of this work, there would be one tunnel tail-end landing on A8, and another landing on E8.

Now what happens if A8 fails? For the duration of the underlay control plane convergence the tunnel tail-end at A8 will appear to be reachable to the overlay. Thus the overlay tunnel will remain up and carrying traffic to a black hole on one of the routers adjacent to A8. In the case of a service reachable via anycast, the application can react in one of two ways—it can fail out operations taking place during the underlay’s convergence, or it can wait. Remember that one second is an eternity in the world of customer-facing services, and that BGP can easily take up to one second to converge in this situation.

A rule of thumb for network design—it’s not the best-case that controls network performance, it’s the worst-case convergence.

The convergence speed of the underlay leaks through to the state of the overlay. The questions that should pop into your mind about right now is—can you be certain this kind of situation cannot happen in your current network, can you be certain it will never happen, and can you be certain this will not have an impact on application performance? I don’t see how the answer to those questions can be yes. The bottom line: convergence speed should be left out of the equation when building a DC fabric. There may be times when you control the applications, and hence can push the complexity of dealing with slow convergence to the application developers—but this seems like a vanishingly small number of cases. Further, is pushing solving for slow convergence to the application developer optimal?

My take on the argument that convergence speed doesn’t matter, then, is that it doesn’t hold up under deeper scrutiny.

as I noted when I started this series—I’m not arguing that we should rip BGP out of every DC fabric … instead, what I’m trying to do is to stir up a conversation and to get my readers to think more deeply about their design choices, and how those design choices work out in the real world

Rethinking BGP on the DC Fabric (part 2)

In my last post on this topic, I laid out the purpose of this series—to start a discussion about whether BGP is the ideal underlay control plane for a DC fabric—and gave some definitions. Here, I’d like to dive into the reasons to not use BGP as a DC fabric underlay control plane—and the first of these reasons is BGP converges very slowly and requires a lot of help to converge at all.

Examples abound. I’ve seen the results of two testbeds in the last several years where a DC fabric was configured with each router (switch, if you prefer) in a separate AS, and some number of routes pushed into the network. In both cases—one large-scale, the other a more moderately scaled network on physical hardware—BGP simply failed to converge. Why? A quick look at how BGP converges might help explain these results.

Assume we are watching the 110::/64 route (attached to A, on the left side of the diagram), at P. What happens when A loses it’s connection to 110::/64? Assuming every router in this diagram is in a different AS, and the AS path length is the only factor determining the best path at every router.

Watching the route to 110::/64 at P, you would see the route move from G to M as the best path, then from M to K, then from K to N, and then finally completely drop out of P’s table. This is called the hunt because BGP “hunts,” apparently trying every path from the current best path to the longest possible path before finally removing the route from the network entirely. BGP isn’t really “hunting;” this is just an artifact of the way BGP speakers receive, process, and send updates through the network.

If you consider a more complex topology, like a five-stage butterfly fabric, you will find there are many (very many) alternate longer-length paths available for BGP to hunt through on a withdraw. Withdrawing thousands of routes at the same time, combined with the impact of the hunt, can put BGP in a state where it simply never converges.

To get BGP to converge, various techniques must be used. For instance, placing all the routers in the spine so they are in the AS, configuring path filters at ToR switches so they are never used as a transit path, etc. Even when these techniques are used, however, BGP can still require a minute or so to perform a withdraw.

This means the BGP configuration cannot be the same on every device—it is determined by where the device is located—which harms repeatability, the BGP configuration must contain complex filters, and messing up the configuration can bring the entire fabric down.

There are several counters to the problem of slow convergence, and the complex configurations required to make BGP converge more quickly, but this post is pushing against its limit … so I’ll leave these until next time.

The Hedge 69: Container Networking Done Right

Everyone who’s heard me talk about container networking knows I think it’s a bit of a disaster. This is what you get, though, when someone says “that’s really complex, I can discard the years of experience others have in designing this sort of thing and build something a lot simpler…” The result is usually something that’s more complex. Alex Pollitt joins Tom Ammon and I to discuss container networking, and new options that do container networking right.

download

Rethinking BGP on the DC Fabric

Everyone uses BGP for DC underlays now because … well, just because everyone does. After all, there’s an RFC explaining the idea, every tool in the world supports BGP for the underlay, and every vendor out there recommends some form of BGP in their design documents.

I’m going to swim against the current for the moment and spend a couple of weeks here discussing the case against BGP as a DC underlay protocol. I’m not the only one swimming against this particular current, of course—there are at least three proposals in the IETF (more, if you count things that will probably never be deployed) proposing link-state alternatives to BGP. If BGP is so ideal for DC fabric underlays, then why are so many smart people (at least they seem to be smart) working on finding another solution?

But before I get into my reasoning, it’s probably best to define a few things.

In a properly design data center, there are at least three control planes. The first of these I’ll call the application overlay. This control plane generally runs host-to-host, providing routing between applications, containers, or virtual machines. Kubernetes networking would be an example of an application overlay control plane.

The second of these I’ll call the infrastructure overlay. This is generally going to be eVPN running BGP, most likely with VXLAN encapsulation, and potentially with segment routing for traffic steering support. This control plane will typically run on either workload supporting hosts, providing routing for the hypervisor or internal bridge, or on the Top of Rack (ToR) routers (switches, but who knows what “router” and “switch” even mean any longer?).

Now notice that not all networks will have both application and infrastructure overlays—many data center fabrics will have one or the other. It’s okay for a data center fabric to only have one of these two overlays—whether one or both are needed is really a matter of local application and business requirements. I also expect both of these to use either BGP or some form of controller-based control plane. BGP was originally designed to be an overlay control plane; it only makes sense to use it where an overlay is required.

I’ll call the third control plane the infrastructure underlay. This control plane provides reachability for the tunnel head- and tail-ends. Plain IPv4 or IPv6 transport is supported here; perhaps some might inject MPLS as well.

My argument, over the next couple of weeks, is BGP is not the best possible choice for the infrastructure underlay. What I’m not arguing is every network that runs BGP as the infrastructure underlay needs to be ripped out and replaced, or that BGP is an awful, horrible, no-good choice. I’m arguing there are very good reasons not to use BGP for the infrastructure underlay—that we need to start reconsidering our monolithic assumption that BGP is the “only” or “best” choice.

I’m out of words for this week; I’ll begin the argument proper in my next post… stay tuned.

The Hedge 66: Tyler McDaniel and BGP Peer Locking

Tyler McDaniel joins Eyvonne, Tom, and Russ to discuss a study on BGP peerlocking, which is designed to prevent route leaks in the global Internet. From the study abstract:

BGP route leaks frequently precipitate serious disruptions to interdomain routing. These incidents have plagued the Internet for decades while deployment and usability issues cripple efforts to mitigate the problem. Peerlock, introduced in 2016, addresses route leaks with a new approach. Peerlock enables filtering agreements between transit providers to protect their own networks without the need for broad cooperation or a trust infrastructure.

download

Technologies that Didn’t: ARCnet

In the late 1980’s, I worked at a small value added reseller (VAR) around New York City. While we deployed a lot of thinnet (RG58 coax based Ethernet for those who don’t know what thinnet is), we also had multiple customers who used ARCnet.

Back in the early days of personal computers like the Amiga 500, the 8086 based XT (running at 4.77MHz), and the 8088 based AT, all networks were effectively wide area, used to connect PDP-11’s and similar gear between college campuses and research institutions. ARCnet was developed in 1976, and became popular in the early 1980’s, because it was, at that point, the only available local area networking solution for personal computers.

ARCnet was not an accidental choice in the networks I supported at the time. While thinnet was widely available, it required running coax cable. The only twisted pair Ethernet standard available at the time required new cables to be run through buildings, which could often be an expensive proposition. For instance, one of the places that relied heavily on ARCnet was a legal office in a small town in north-central New Jersey. This law office had started out in an older home over a shop in the square of a smaller town—a truly historic building well over a hundred years old. As the law office grew, they purchased adjacency buildings, and created connecting corridors through closets and existing halls by carefully opening up passages between the buildings. The basements of the buildings were more-or-less connected anyway, so the original telephone cabling was tied together to create a unified system.

When the law office decided to bring email and shared printers up on Novell Netware, they called in the VAR I worked for to figure out how to make it all work. The problem we encountered was the building had been insulated at some point with asbestos fiber filling in the walls. Wiring on the surface of the walls and baseboards was rejected because it would destroy the historical character of the building. Running through the walls would only be possible if the asbestos was torn out—this would be removing the walls, again encountering major problems with the historical nature of the building.

The solution? ARCnet can run on the wiring used for plain old telephone circuits. Not very fast, of course; the original specification was 2.5Mbit/s. On the other hand, it was fast enough for printers and email before the days of huge image files and cute cat videos. ARCnet could also run in a “star” configuration, which means with a centralized hub (which we would today call a switch), and each host attached as a spoke or point on the star. This kind of wiring had just been introduced for Ethernet, and so was considered novel, but not widely deployed.

ARCnet deployed to well over ten thousand networks globally (a lot of networks for that time period), and then was rapidly replaced by Ethernet. The official reason for this rapid replacement was the greater speed of Ethernet—but as I noted above, most of the applications for networks in those days did not really make use of all that bandwidth, even in larger networks. Routers were not a “thing” at this time, but you could still connect several hundred hosts onto a single ARCnet or Ethernet segment and expect it to work with the common traffic requirements of the day.

At the small VAR I worked at, we had another reason for replacing ARCnet: it blew up too much. The cables over which POTs services run is unshielded, and hence liable to induced high voltage spikes from other sources. For instance, we had to be quite intentional about not using a POTs lines located within a certain distance of the older wiring in the buildings where it was deployed; a voltage spike could not only cause the network to “blank out” for some amount of time, it could actually cause enough voltage on the wires to destroy the network interface cards. We purchased ARCnet interface cards by the case, it seemed. After any heavy thunderstorm, the entire shop went from one ARCnet customer to another replacing cards. At some point, replacing cases of interface cards becomes more expensive than performing asbestos mitigation, or even just running the shielded cable Ethernet on twisted pair requires. It becomes cheaper to replace ARCnet than it does to keep it running.

An interesting twist to this story—there is current work in the Ethernet working group of the IEEE to make Ethernet run on … the cabling used for very old POTs services. This is effectively the same use case ARCnet filled for many VARs in the late 1980’s. The difference, today, is that much more is understood about how to build electronics that can support high voltage spikes while still being able to discriminate a signal on a poor transmission medium. Much of this work has been done for the wireless world already.

So ARCnet failed more because it was a technology ahead of its time, in terms of its use case, but in line with its time, in its physical and electronic design.

 

Current Work in BGP Security

I’ve been chasing BGP security since before the publication of the soBGP drafts, way back in the early 2000’s (that’s almost 20 years for those who are math challenged). The most recent news largely centers on the RPKI, which is used to ensure the AS originating an advertisements is authorized to do so (or rather “owns” the resource or prefix). If you are not “up” on what the RPKI does, or how it works, you might find this old blog post useful—its actually the tenth post in a ten post series on the topic of BGP security.

Recent news in this space largely centers around the ongoing deployment of the RPKI. According to Wired, Google and Facebook have both recently adopted MANRS, and are adopting RPKI. While it might not seem like autonomous systems along the edge adopting BGP security best practices and the RPKI system can make much of a difference, but the “heavy hitters” among the content providers can play a pivotal role here by refusing to accept routes that appear to be hijacked. This not only helps these providers and their customers directly—a point the Wired article makes—this also helps the ‘net in a larger way by blocking attackers access to at least some of the “big fish” in terms of traffic.

Leslie Daigle, over at the Global Cyber Alliance—an organization I’d never heard of until I saw this—has a post up explaining exactly how deploying the RPKI in an edge AS can make a big difference in the service level from a customer’s perspective. Leslie is looking for operators who will fill out a survey on the routing security measures they deploy. If you operate a network that has any sort of BGP presence in the default-free zone (DFZ), it’s worth taking a look and filling the survey out.

One of the various problems with routing security is just being able to see what’s in the RPKI. If you have a problem with your route in the global table, you can always go look at a route view server or looking glass (a topic I will cover in some detail in an upcoming live webinar over on Safari Books Online—I think it’s scheduled for February right now). But what about the RPKI? RIPE NCC has released a new tool called the JDR:

Just like RP software, JDR interprets certificates and signed objects in the RPKI, but instead of producing a set of Verified ROA Payloads (VRPs) to be fed to a router, it annotates everything that could somehow cause trouble. It will go out of its way to try to decode and parse objects: even if a file is clearly violating the standards and should be rejected by RP software, JDR will try to process it and present as much troubleshooting information to the end-user afterwards.

You can find the JDR here.

Finally, the folks at APNIC, working with NLnet Labs, have taken a page from the BGP playbook and proposed an opaque object for the RPKI, extending it beyond “just prefixes.” They’ve created a new Resource Tagged Attestations, or RTAs, which can carry “any arbitrary file.” They have a post up explaining the rational and work here.