In a recent podcast, Ivan and Dinesh ask why there is a lot of interest in running link state protocols on data center fabrics. They begin with this point: if you have less than a few hundred switches, it really doesn’t matter what routing protocol you run on your data center fabric. Beyond this, there do not seem to be any problems to be solved that BGP cannot solve, so… why bother with a link state protocol? After all, BGP is much simpler than any link state protocol, and we should always solve all our problems with the simplest protocol possible.
- BGP is both simple and complex, depending on your perspective
- BGP is sometimes too much, and sometimes too little for data center fabrics
- We are danger of treating every problem as a nail, because we have decided BGP is the ultimate hammer
Will these these contentions stand up to a rigorous challenge?
I will begin with the last contention first—BGP is simpler than any link state protocol. Consider the core protocol semantics of BGP and a link state protocol. In a link state protocol, every network device must have a synchronized copy of the Link State Database (LSDB). This is more challenging than BGP’s requirement, which is very distance-vector like; in BGP you only care if any pair of speakers have enough information to form loop-free paths through the network. Topology information is (largely) stripped out, metrics are simple, and shared information is minimized. It certainly seems, on this score, like BGP is simpler.
Before declaring a winner, however, this simplification needs to be considered in light of the State/Optimization/Surface triad.
When you remove state, you are always also reducing optimization in some way. What do you lose when comparing BGP to a link state protocol? You lose your view of the entire topology—there is no LSDB. Perhaps you do not think an LSDB in a data center fabric is all that important; the topology is somewhat fixed, and you probably are not going to need traffic engineering if the network is wired with enough bandwidth to solve all problems. Building a network with tons of bandwidth, however, is not always economically feasible. The more likely reality is there is a balance between various forms of quality of service, including traffic engineering, and throwing bandwidth at the problem. Where that balance is will probably vary, but to always assume you can throw bandwidth at the problem is naive.
There is another cost to this simplification, as well. Complexity is inserted into a network to solve hard problems. The most common hard problem complexity is used to solve is guarding against environmental instability. Again, a data center fabric should be stable; the topology should never change, reachability should never change, etc. We all know this is simply not true, however, or we would be running static routes in all of our data center fabrics. So why aren’t we?
Because data center fabrics, like any other network, do change. And when they do change, you want them to converge somewhat quickly. Is this not what all those ECMP parallel paths are for? In some situations, yes. In others, those ECMP paths actually harm BGP convergence speed. A specific instance: move an IP address from one ToR on your fabric to another, or from one virtual machine to another. In this situation, those ECMP paths are not working for you, they are working against you—this is, in fact, one of the worst BGP convergence scenarios you can face. IS-IS, specifically, will converge much faster than BGP in the case of detaching a leaf node from the graph and reattaching it someplace else.
Complexity can be seen from another perspective, as well. When considering BGP in the data center, we are considering one small slice of the capabilities of the protocol.
in the center of the illustration above there is a small grey circle representing the core features of BGP. The sections of the ten sided figure around it represent the features sets that have been added to BGP over the years to support the many places it is used. When we look at BGP for one specific use case, we see the one “slice,” the core functionality, and what we are building on top. The reality of BGP, from a code base and complexity perspective, is the total sum of all the different features added across the years to support every conceivable use case.
Essentially, BGP has become not only a nail, but every kind of nail, including framing nails, brads, finish nails, roofing nails, and all the other kinds. It is worse than this, though. BGP has also become the universal glue, the universal screw, the universal hook-and-loop fastener, the universal building block, etc.
BGP is not just the hammer with which we turn every problem into a nail, it is a universal hammer/driver/glue gun that is also the universal nail/screw/glue.
When you run BGP on your data center fabric, you are not just running the part you want to run. You are running all of it. The L3VPN part. The eVPN part. The intra-AS parts. The inter-AS parts. All of it. The apparent complexity may appear to be low, because you are only looking at one small slice of the protocol. But the real complexity, under the covers, where attack and interaction surfaces live, is very complex. In fact, by any reasonable measure, BGP might have the simplest set of core functions, but it is the most complicated routing protocol in existence.
In other words, complexity is sometimes a matter of perspective. In this perspective, IS-IS is much simpler. Note—don’t confuse our understanding of a thing with its complexity. Many people consider link state protocols more complex simply because they don’t understand them as well as BGP.
Let me give you an example of the problems you run into when you think about the complexity of BGP—problems you do not hear about, but exist in the real world. BGP uses TCP for transport. So do many applications. When multiple TCP streams interact, complex problems can result, such as the global synchronization of TCP streams. Of course we can solve this with some cool QoS, including WRED. But why do you want your application and control plane traffic interacting in this way in the first place? Maybe it is simpler just to separate the two?
Is BGP really simpler? From one perspective, it is simpler. From another, however, it is more complex.
Is BGP “good enough?” For some applications, it is. For others, however, it might not be.
You should decide what to run on your network based on application and business drivers, rather than “because it is good enough.” Which leads me back to where I often end up: If you haven’t found the trade-offs, you haven’t look hard enough.
The Internet has changed dramatically over the last ten years; more than 70% of the traffic over the Internet is now served by ten Autonomous Systems (AS’), causing the physical topology of the Internet to be reshaped into more of a hub-and-spoke design, rather than the more familiar scale-free design (I discussed this in a post over at CircleID in the recent past, and others have discussed this as well). While this reshaping might be seen as a success in delivering video content to most Internet users by shortening the delivery route between the server and the user, the authors of the paper in review today argue this is not enough.
Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA, 418-431. DOI: https://doi.org/10.1145/3098822.3098853
Why is this not enough? The authors point to two problems in the routing protocol tying the Internet together: BGP. First, they state that BGP is not capacity-aware. It is important to remember that BGP is focused on policy, rather than capacity; the authors of this paper state they have found many instances where the preferred path, based on BGP policy, is not able to support the capacity required to deliver video services. Second, they state that BGP is not performance-aware. The selection criteria used by BGP, such as MED and Local Pref, do not correlate with performance.
Based on these points, the authors argue traffic needs to be routed more dynamically, in response to capacity and performance, to optimize efficiency. The paper presents the system Facebook uses to perform this dynamic routing, which they call Edge Fabric. As I am more interested in what this study reveals about the operation of the Internet than the solution Facebook has proposed to the problem, I will focus on the problem side of this paper. Readers are invited to examine the entire paper at the link above, or here, to see how Facebook is going about solving this problem.
The paper begins by examining the Facebook edge; as edges go, Facebook’s is fairly standard for a hyperscale provider. Facebook deploys Points of Presence, which are essentially private Content Delivery Network (CDN) compute and edge pushed to the edge, and hence as close to users as possible. To provide connectivity between these CDN nodes and their primary data center fabrics, Facebook uses transit provided through peering across the public ‘net. The problem Facebook is trying to solve is not the last mile connectivity, but rather the connectivity between these CDN nodes and their data center fabrics.
The authors begin with the observation that if left to its own decision process, BGP will evenly distribute traffic across all available peers, even though each peer is actually different levels of congestion. This is not a surprising observation. In fact, there was at least one last mile provider that used their ability to choose an upstream based on congestion in near real time. This capability was similar to the concept behind Performance Based Routing (PfR), developed by Cisco, which was then folded into DMVPN, and thus became part of the value play of most Software Defined Wide Area Network (SD-WAN) solutions.
The authors then note that BGP relies on rough proxies to indicate better performing paths. For instance, the shortest AS Path should, in theory, be shortest physical or logical path, as well, and hence the path with the lowest end-to-end time. In the same way, local preference is normally set to prefer peer connections rather than upstream or transit connections. This should mean traffic will take a shorter path through a peer connected to the destination network, rather than a path up through a transit provider, then back down to the connected network. This should result in traffic passing through less lightly loaded last mile provider networks, rather than more heavily used transit provider networks. The authors present research showing these policies can often harm performance, rather than enhancing it; sometimes it is better to push traffic to a transit peer, rather than to a directly connected peer.
How often are destination prefixes constrained by BGP into a lower performing path? The authors provide this illustration—
The percentage of impacted destination prefixes is, by Facebook’s measure, high. But what kind of solution might be used to solve this problem?
Note that no solution that uses static metrics for routing traffic will be able to solve these problems. What is required, if you want to solve these problems, is to measure the performance of specific paths to given destinations in near real time, and somehow adjust routing to take advantage of higher performance paths regardless of what the routing protocol metrics indicate. In other words, the routing protocol needs to find the set of possible loop-free paths, and some other system must choose which path among this set should be used to forward traffic. This is a classic example of the argument for layered control planes (such as this one).
Facebook’s solution to this problem is to overlay an SDN’ish solution on top of BGP. Their solution does not involve tunneling, like many SD-WAN solutions. Rather, they adjust the BGP metrics in near real time based on current measured congestion and performance. The paper goes on to describe their system, which only uses standard BGP metrics to steer traffic onto higher performance paths through the ‘net.
A few items of note from this research.
First, note that many of the policies set up by providers are not purely shorthand for performance; they actually represent a price/performance tradeoff. For instance, the use of local preference to send traffic to peers, rather than transits, is most often an economic decision. Providers, particularly edge providers, normally configure settlement-free peering with peers, and pay for traffic sent to an upstream transit provider. Directing more traffic at an upstream, rather than a peer, can have a significant financial impact. Hyperscalers, like Facebook, don’t often see these financial impacts, as they are purchasing connectivity from the provider. Over time, forcing providers to use more expensive links for performance reasons could increase cost, but in this situation the costs are not immediately felt, so the cost/performance feedback loop is somewhat muted.
Second, there is a fair amount of additional complexity in pulling this bit of performance out of the network. While it is sometimes worth adding complexity to increase complexity, this is not always true. It likely is for many hyperscalers, who’s business relies largely on engagement. Given there is a directly provable link between engagement and speed, every bit of performance makes a large difference. But this is simply not true of all networks.
Third, you can replicate this kind of performance-based routing in your network by creating a measurement system. You can then use the communities operators providers allow their customers to use to shape the direction of traffic flows to optimize traffic performance. This might not work in all cases, but it might give you a fair start on a similar system—if this kind of wrestling match with performance is valuable in your environment.
Another option might be to use an SD-WAN solution, which should have the measurement and traffic shaping capabilities “built in.”
Fourth, there is a real possibility of building a system that fails in the face of positive feedback loops or reduces performance in the face of negative feedback loops. Hysteresis, the tendency to cause a performance problem in the process of reacting to a performance problem, must be carefully considered when designing such as system, as well.
The Bottom Line
Statically defined metrics in dynamic control planes cannot provide optimal performance in near real time. Building a system that can involves a good bit of additional complexity—complexity that is often best handled in a layered control plane.
Are these kinds of tools suitable for a network other than Facebook? In the right situation, the answer is clearly yes. But heed the tradeoffs. If you haven’t found the tradeoff, you haven’t looked hard enough.
BGP is one of the foundational protocols that make the Internet “go;” as such, it is a complex intertwined system of different kinds of functionality bundled into a single set of TLVs, attributes, and other functionality. Because it is so widely used, however, BGP tends to gain new capabilities on a regular basis, making the Interdomain Routing (IDR) working group in the Internet Engineering Task Force (IETF) one of the consistently busiest, and hence one of the hardest to keep up with. In this post, I’m going to spend a little time talking about one area in which a lot of work has been taking place, the building and maintenance of peering relationships between BGP speakers.
The first draft to consider is Mitigating the Negative Impact of Maintenance through BGP Session Culling, which is a draft in an operations working group, rather than the IDR working group, and does not make any changes to the operation of BGP. Rather, this draft considers how BGP sessions should be torn down so traffic is properly drained, and the peering shutdown has the minimal effect possible. The normal way of shutting down a link for maintenance would be to for administrators to shut down BGP on the link, wait for traffic to subside, and then take the link down for maintenance. However, many operators simply do not have the time or capability to undertake scheduled shutdowns of BGP speakers. To resolve this problem, graceful shutdown capability was added to BGP in RFC8326. Not all implementations support graceful shutdown, however, so this draft suggests an alternate way to shut down BGP sessions, allowing traffic to drain, before a link is shut down: use link local filtering to block BGP traffic on the link, which will cause any existing BGP sessions to fail. Once these sessions have failed, traffic will drain off the link, allowing it to be safely shut down for maintenance. The draft discusses various timing issues in using this technique to reduce the impact of link removal due to maintenance (or other reasons).
Graceful shutdown, itself, is also in line to receive some new capabilities through Extended BGP Administrative Shutdown Communication. This draft is rather short, as it simply allows an operator to send a short freeform message (presumably in text format) along with the standard BGP graceful shutdown notification. This message can be printed on the console, or saved to syslog, to provide an operator with more information about why a particular BGP has been shut down, whether it coming back up again, how long the shutdown is expected to last, etc.
Graceful Restart (GR) is a long available feature in many BGP implementations that aims to prevent the disruption of traffic flow; the original purpose was to handle a route processor restart in a router where the line cards could continue forwarding traffic based on local forwarding tables (the FIB), including cases where one route processor fails, causing the router switches to a backup route processor in the same chassis. Over time, GR began to be applied to NOTIFICATION messages in BGP. For instance, if a BGP speaker receives a malformed message, it is required (by the BGP RFCs) to send a NOTIFICATION, which will cause the BGP session to be torn down and restarted. GR has been adapted to these situations, so traffic flow is either not impacted, or minimally impacted through the NOTIFICATION/session restart process. This same processing takes place for a hold timer timeout in BGP.
The problem is that only one of the two speakers in a restarting pair will normally retain its local forwarding information. The sending speaker will normally flush its local routing tables, and with them its local forwarding tables, on sending a BGP NOTIFICATION. Notification Message support for BGP Graceful Restart changes this processing, allowing both speakers to enter the “receiving speaker” mode, so both speakers would retain their local forwarding information. A signal is provided to allow the sending speaker to indicate the sessions should be hard reset, rather than gracefully reset, if needed.
Finally, BGP allows speakers to send a route with a next hop other than themselves; this is called a third party next hop, and is illustrated in the figure below.
In this network, router C’s best path to 2001:db8:3e8:100::/64 might be through A, but the operator may prefer this traffic pass through B. While it is possible to change the preferences so C chooses the path through B, there are some situations where it is better for A to advertise C as the next hop towards the destination (for instance, a route server would not normally advertise itself as the nexthop towards a destination). The problem with this situation is that B might not have the same capabilities as a BGP speaker as A. If B, for instance, cannot forward for IPv6, the situation shown in the illustration would clearly not work.
To resolve this, BGP Next-Hop dependent capabilities allows a speaker to advertise the capabilities of these alternate next hops to peered BGP speakers.
I was recently invited to a webinar for the RIPE NCC about the future of BGP security. The entire series is well worth watching; I was in the final session, which was a panel discussion on where we are now, and where we might go to make BGP security better.
Yet another protocol episode over at the Network Collective. This time, Nick, Jordan, Eyvonne and I talk about BGP security.
Nick Russo and I stopped by the Network Collective last week to talk about BGP traffic engineering—and in the process I confused BGP deterministic MED and always compare MED. I’ve embedded the video below.