Research: Facebook’s Edge Fabric
The Internet has changed dramatically over the last ten years; more than 70% of the traffic over the Internet is now served by ten Autonomous Systems (AS’), causing the physical topology of the Internet to be reshaped into more of a hub-and-spoke design, rather than the more familiar scale-free design (I discussed this in a post over at CircleID in the recent past, and others have discussed this as well). While this reshaping might be seen as a success in delivering video content to most Internet users by shortening the delivery route between the server and the user, the authors of the paper in review today argue this is not enough.
Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA, 418-431. DOI: https://doi.org/10.1145/3098822.3098853
Why is this not enough? The authors point to two problems in the routing protocol tying the Internet together: BGP. First, they state that BGP is not capacity-aware. It is important to remember that BGP is focused on policy, rather than capacity; the authors of this paper state they have found many instances where the preferred path, based on BGP policy, is not able to support the capacity required to deliver video services. Second, they state that BGP is not performance-aware. The selection criteria used by BGP, such as MED and Local Pref, do not correlate with performance.
Based on these points, the authors argue traffic needs to be routed more dynamically, in response to capacity and performance, to optimize efficiency. The paper presents the system Facebook uses to perform this dynamic routing, which they call Edge Fabric. As I am more interested in what this study reveals about the operation of the Internet than the solution Facebook has proposed to the problem, I will focus on the problem side of this paper. Readers are invited to examine the entire paper at the link above, or here, to see how Facebook is going about solving this problem.
The paper begins by examining the Facebook edge; as edges go, Facebook’s is fairly standard for a hyperscale provider. Facebook deploys Points of Presence, which are essentially private Content Delivery Network (CDN) compute and edge pushed to the edge, and hence as close to users as possible. To provide connectivity between these CDN nodes and their primary data center fabrics, Facebook uses transit provided through peering across the public ‘net. The problem Facebook is trying to solve is not the last mile connectivity, but rather the connectivity between these CDN nodes and their data center fabrics.
The authors begin with the observation that if left to its own decision process, BGP will evenly distribute traffic across all available peers, even though each peer is actually different levels of congestion. This is not a surprising observation. In fact, there was at least one last mile provider that used their ability to choose an upstream based on congestion in near real time. This capability was similar to the concept behind Performance Based Routing (PfR), developed by Cisco, which was then folded into DMVPN, and thus became part of the value play of most Software Defined Wide Area Network (SD-WAN) solutions.
The authors then note that BGP relies on rough proxies to indicate better performing paths. For instance, the shortest AS Path should, in theory, be shortest physical or logical path, as well, and hence the path with the lowest end-to-end time. In the same way, local preference is normally set to prefer peer connections rather than upstream or transit connections. This should mean traffic will take a shorter path through a peer connected to the destination network, rather than a path up through a transit provider, then back down to the connected network. This should result in traffic passing through less lightly loaded last mile provider networks, rather than more heavily used transit provider networks. The authors present research showing these policies can often harm performance, rather than enhancing it; sometimes it is better to push traffic to a transit peer, rather than to a directly connected peer.
How often are destination prefixes constrained by BGP into a lower performing path? The authors provide this illustration—
The percentage of impacted destination prefixes is, by Facebook’s measure, high. But what kind of solution might be used to solve this problem?
Note that no solution that uses static metrics for routing traffic will be able to solve these problems. What is required, if you want to solve these problems, is to measure the performance of specific paths to given destinations in near real time, and somehow adjust routing to take advantage of higher performance paths regardless of what the routing protocol metrics indicate. In other words, the routing protocol needs to find the set of possible loop-free paths, and some other system must choose which path among this set should be used to forward traffic. This is a classic example of the argument for layered control planes (such as this one).
Facebook’s solution to this problem is to overlay an SDN’ish solution on top of BGP. Their solution does not involve tunneling, like many SD-WAN solutions. Rather, they adjust the BGP metrics in near real time based on current measured congestion and performance. The paper goes on to describe their system, which only uses standard BGP metrics to steer traffic onto higher performance paths through the ‘net.
A few items of note from this research.
First, note that many of the policies set up by providers are not purely shorthand for performance; they actually represent a price/performance tradeoff. For instance, the use of local preference to send traffic to peers, rather than transits, is most often an economic decision. Providers, particularly edge providers, normally configure settlement-free peering with peers, and pay for traffic sent to an upstream transit provider. Directing more traffic at an upstream, rather than a peer, can have a significant financial impact. Hyperscalers, like Facebook, don’t often see these financial impacts, as they are purchasing connectivity from the provider. Over time, forcing providers to use more expensive links for performance reasons could increase cost, but in this situation the costs are not immediately felt, so the cost/performance feedback loop is somewhat muted.
Second, there is a fair amount of additional complexity in pulling this bit of performance out of the network. While it is sometimes worth adding complexity to increase complexity, this is not always true. It likely is for many hyperscalers, who’s business relies largely on engagement. Given there is a directly provable link between engagement and speed, every bit of performance makes a large difference. But this is simply not true of all networks.
Third, you can replicate this kind of performance-based routing in your network by creating a measurement system. You can then use the communities operators providers allow their customers to use to shape the direction of traffic flows to optimize traffic performance. This might not work in all cases, but it might give you a fair start on a similar system—if this kind of wrestling match with performance is valuable in your environment.
Another option might be to use an SD-WAN solution, which should have the measurement and traffic shaping capabilities “built in.”
Fourth, there is a real possibility of building a system that fails in the face of positive feedback loops or reduces performance in the face of negative feedback loops. Hysteresis, the tendency to cause a performance problem in the process of reacting to a performance problem, must be carefully considered when designing such as system, as well.
The Bottom Line
Statically defined metrics in dynamic control planes cannot provide optimal performance in near real time. Building a system that can involves a good bit of additional complexity—complexity that is often best handled in a layered control plane.
Are these kinds of tools suitable for a network other than Facebook? In the right situation, the answer is clearly yes. But heed the tradeoffs. If you haven’t found the tradeoff, you haven’t looked hard enough.