Hosts Roopa Prabhu and Pete Lumbis are joined by a special guest to the podcast, Russ White! The group come together virtually to discuss what we should think about when it comes to routing protocols in the datcenter. What are the tradeoffs when using traditional protocols like OSPF or BGP? What about new protocols like RIFT or a hybrid approach with things like BGP-link state? Spoiler alert: it depends.
The world of provider interconnection is a little … “mysterious” … even to those who work at transit providers. The decision of who to peer with, whether such peering should be paid, settlement-free, open, and where to peer is often cordoned off into a separate team (or set of teams) that don’t seem to leak a lot of information. A recent paper on current interconnection practices published in ACM SIGCOMM sheds some useful light into this corner of the Internet, and hence is useful for those just trying to understand how the Internet really works.
To write the paper, the authors sent requests to fill out a survey through a wide variety of places, including NOG mailing lists and blogs. They ended up receiving responses from all seven regions (based on the RIRs, who control and maintain Internet numbering resources like AS numbers and IP addresses), 70% from ISPs, 14% from content providers, and 7% from “Enterprise” and infrastructure operators. Each of these kinds of operators will have different interconnection needs—I would expect ISPs to engage in more settlement-free peering (with roughly equal traffic levels), content providers to engage in more open (settlement-free connections with unequal traffic levels), IXs to do mostly local peering (not between regions), and “enterprises” to engage mostly in paid peering. The survey also classified respondents by their regional footprint (how many regions they operate in) and size (how many customers they support).
The survey focused on three facets of interconnection: time required to form a connection, the reasons given for interconnecting, and parameters included in the peering agreement. These largely describe the status quo in peering—interconnections as they are practiced today. As might be expected, connections at IXs are the quickest to form. Since IXs are normally set up to enable peering; it makes sense that the preset processes and communications channels enabled by an IX would make the peering process a lot faster. According to the survey results, the most common timeframe to complete peering is days, with about a quarter taking weeks.
Apparently, the vast majority (99%!) of peering arrangements are by “handshake,” which means there is no legal contract behind them. This is one reason Network Operator Groups (NOGs) are so important (a topic of discussion in the Hedge 31, dropping next week); the peering workshops are vital in building and keeping the relationships behind most peering arrangements.
On-demand connectivity is a new trend in inter-AS peering. For instance, interxion recently worked with LINX and several other IXs to develop a standard set of APIs allowing operators to peer with one another in a standard way, often reducing the technical side of the peering process to minutes rather than hours (or even days). Companies are moving into this space, helping operators understand who they should peer with, and building pre-negotiated peering contracts with many operators. While current operators seem to be aware of these options, they do not seem to be using these kinds of services yet.
While this paper is interesting, it does leave many corners of the inter-AS peering world un-exposed. For instance—I would like to know how correct my assumptions are about the kinds of peering used by each of the different classes of providers is, and whether there are regional differences in the kinds of peering. While its interesting to survey the reasons providers pursue peering, it would be interesting to understand the process of making a peering determination more fully. What kinds of tools are available, and how are they used? These would be useful bits of information for an operator who only connects to the Internet, rather than being part of the Internet infrastructure (perhaps a “non-infrastructure operator,” rather than “enterprise”) in understanding how their choice of upstream provider can impact the performance of their applications and network.
Path Computation Element (PCE) is designed to allow the computation of paths for MPLS and GMPLS Point to Point and
Point to Multi-point Traffic Engineered LSPs. Adrian Farrel, who was involved in the early work on designing an specifying PCE, joins us in this episode of the History of Networking to describe the purposes, process, and challenges involved. You can read more about Adrian on his personal home page, and about PCE on the IETF WG page.
There was a time when Software Defined Networking was going to take over the entire networking world—just like ATM, FDDI, and … so many others before. Whatever happened to SDN, anyway? What is its enduring legacy in the world of network engineering? Terry Slattery, Tom Ammon, and Russ White gather at the hedge to have a conversation about whatever happened to SDN?
Multicast is, at best, difficult to deploy in large scale networks—PIM sparse and BIDIR are both complex, adding large amounts of state to intermediate devices. In the worst case, there is no apparent way to deploy any existing version of PIM, such as large-scale spine and leaf networks (variations on the venerable Clos fabric). BEIR, described in RFC8279, aims to solve the per-device state of traditional multicast.
In this network, assume A has some packet that needs to be delivered to T, V, and X. A could generate three packets, each one addressed to one of the destinations—but replicating the packet at A is wastes network resources on the A->B link, at least. Using PIM, these three destinations could be placed in a multicast group (a multicast address can be created that describes T, V, and X as a single destination). After this, a reverse shortest path tree can be calculated from each of the destinations in the group towards the source, A, and the correct forwarding state (the outgoing interface list) be installed at each of the routers in the network (or at least along the correct paths). This, however, adds a lot of state to the network.
BIER provides another option.
Returning to the example, if A is the sender, B is called the Bit-Forwarding Ingress Router, or BFIR. When B receives this packet from A, it determines T, V, and X are the correct destinations, and examines its local routing table to determine T is connected to M, V, is connected to N, and X is connected to R. Each of these are called a Bit-Forwarding Egress Router (BFER) in the network; each has a particular bit set aside in a bit field. For instance, if the network operator has set up an 8-bit BIER bit field, M might be bit 3, N might be bit 4, and R might be bit 6.
Given this, A will examine its local routing table to find the shortest path to M, N, and R. It will find the shortest path to M is through D, the shortest path to N is through C, and the shortest path to R is through C. A will then create two copies of the packet (it will replicate the packet). It will encapsulate one copy in a BIER header, setting the third bit in the header, and send the packet on to D. It will encapsulate the second copy into a BIER header, as well, setting the fourth and sixth bits in the header, and send the packet on to C.
D will receive the packet, determine the destination is M (because the third bit is set in the BIER header), look up the shortest path to M in its local routing table, and forward the packet. When M receives the packet, it pops the BIER header, finds T is the final destination address, and forwards the packet.
When C receives the packet, it finds there are two destinations, R and N, based on the bits set in the BIER header. The shortest path to N is through H, so it clears bit 6 (representing R as a destination), and forwards the packet to H. When H receives the packet, it finds N is the destination (because bit 4 is set in the BIER header), looks up N in its local routing table, and forwards the packet along the shortest path towards N. N will remove the BIER header and forward the packet to V. C will also forward the packet to G after clearing bit 4 (representing N as a destination BFER). G will examine the BIER header, find that R is a destination, examine its local routing table for the shortest path to R, and forward the packet. R, on receiving the packet, will strip the BIER header and forward the packet to X.
While the BFIR must know how to translate a single packet into multiple destinations, either through a group address, manual configuration, or some other signaling mechanism, the state at the remaining routers in the network is a simple table of BIER bitfield mappings to BFER destination addresses. It does not matter how large the tree gets, the amount of state is held to the number of BFERs at the network edge. This is completely different from any version of PIM, where the state at every router along the path depends on the number of senders and receivers.
The tradeoff is the hardware of the BIER Forwarding Routers (BFRs) must be able to support looking up the destination based on the bit set in the BIER bitfield in the header. Since the size of the BIER bitfield is variable, this is… challenging. One option is to use an MPLS label stack as a way to express the BIER header, which leverages existing MPLS implementations. Still, though, MPLS label depth is often limited, the ability to forward to multiple destinations is challenging, and the additional signaling required is more than trivial.
BIER is an interesting technology; it seems ideal for use in highly meshed data center fabrics, and even in longer haul and wireless networks where bandwidth is at a premium. Whether or not the hardware challenges can be overcome is a question yet to be answered.
The Internet has changed dramatically over the last ten years; more than 70% of the traffic over the Internet is now served by ten Autonomous Systems (AS’), causing the physical topology of the Internet to be reshaped into more of a hub-and-spoke design, rather than the more familiar scale-free design (I discussed this in a post over at CircleID in the recent past, and others have discussed this as well). While this reshaping might be seen as a success in delivering video content to most Internet users by shortening the delivery route between the server and the user, the authors of the paper in review today argue this is not enough.
Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA, 418-431. DOI: https://doi.org/10.1145/3098822.3098853
Why is this not enough? The authors point to two problems in the routing protocol tying the Internet together: BGP. First, they state that BGP is not capacity-aware. It is important to remember that BGP is focused on policy, rather than capacity; the authors of this paper state they have found many instances where the preferred path, based on BGP policy, is not able to support the capacity required to deliver video services. Second, they state that BGP is not performance-aware. The selection criteria used by BGP, such as MED and Local Pref, do not correlate with performance.
Based on these points, the authors argue traffic needs to be routed more dynamically, in response to capacity and performance, to optimize efficiency. The paper presents the system Facebook uses to perform this dynamic routing, which they call Edge Fabric. As I am more interested in what this study reveals about the operation of the Internet than the solution Facebook has proposed to the problem, I will focus on the problem side of this paper. Readers are invited to examine the entire paper at the link above, or here, to see how Facebook is going about solving this problem.
The paper begins by examining the Facebook edge; as edges go, Facebook’s is fairly standard for a hyperscale provider. Facebook deploys Points of Presence, which are essentially private Content Delivery Network (CDN) compute and edge pushed to the edge, and hence as close to users as possible. To provide connectivity between these CDN nodes and their primary data center fabrics, Facebook uses transit provided through peering across the public ‘net. The problem Facebook is trying to solve is not the last mile connectivity, but rather the connectivity between these CDN nodes and their data center fabrics.
The authors begin with the observation that if left to its own decision process, BGP will evenly distribute traffic across all available peers, even though each peer is actually different levels of congestion. This is not a surprising observation. In fact, there was at least one last mile provider that used their ability to choose an upstream based on congestion in near real time. This capability was similar to the concept behind Performance Based Routing (PfR), developed by Cisco, which was then folded into DMVPN, and thus became part of the value play of most Software Defined Wide Area Network (SD-WAN) solutions.
The authors then note that BGP relies on rough proxies to indicate better performing paths. For instance, the shortest AS Path should, in theory, be shortest physical or logical path, as well, and hence the path with the lowest end-to-end time. In the same way, local preference is normally set to prefer peer connections rather than upstream or transit connections. This should mean traffic will take a shorter path through a peer connected to the destination network, rather than a path up through a transit provider, then back down to the connected network. This should result in traffic passing through less lightly loaded last mile provider networks, rather than more heavily used transit provider networks. The authors present research showing these policies can often harm performance, rather than enhancing it; sometimes it is better to push traffic to a transit peer, rather than to a directly connected peer.
How often are destination prefixes constrained by BGP into a lower performing path? The authors provide this illustration—
The percentage of impacted destination prefixes is, by Facebook’s measure, high. But what kind of solution might be used to solve this problem?
Note that no solution that uses static metrics for routing traffic will be able to solve these problems. What is required, if you want to solve these problems, is to measure the performance of specific paths to given destinations in near real time, and somehow adjust routing to take advantage of higher performance paths regardless of what the routing protocol metrics indicate. In other words, the routing protocol needs to find the set of possible loop-free paths, and some other system must choose which path among this set should be used to forward traffic. This is a classic example of the argument for layered control planes (such as this one).
Facebook’s solution to this problem is to overlay an SDN’ish solution on top of BGP. Their solution does not involve tunneling, like many SD-WAN solutions. Rather, they adjust the BGP metrics in near real time based on current measured congestion and performance. The paper goes on to describe their system, which only uses standard BGP metrics to steer traffic onto higher performance paths through the ‘net.
A few items of note from this research.
First, note that many of the policies set up by providers are not purely shorthand for performance; they actually represent a price/performance tradeoff. For instance, the use of local preference to send traffic to peers, rather than transits, is most often an economic decision. Providers, particularly edge providers, normally configure settlement-free peering with peers, and pay for traffic sent to an upstream transit provider. Directing more traffic at an upstream, rather than a peer, can have a significant financial impact. Hyperscalers, like Facebook, don’t often see these financial impacts, as they are purchasing connectivity from the provider. Over time, forcing providers to use more expensive links for performance reasons could increase cost, but in this situation the costs are not immediately felt, so the cost/performance feedback loop is somewhat muted.
Second, there is a fair amount of additional complexity in pulling this bit of performance out of the network. While it is sometimes worth adding complexity to increase complexity, this is not always true. It likely is for many hyperscalers, who’s business relies largely on engagement. Given there is a directly provable link between engagement and speed, every bit of performance makes a large difference. But this is simply not true of all networks.
Third, you can replicate this kind of performance-based routing in your network by creating a measurement system. You can then use the communities operators providers allow their customers to use to shape the direction of traffic flows to optimize traffic performance. This might not work in all cases, but it might give you a fair start on a similar system—if this kind of wrestling match with performance is valuable in your environment.
Another option might be to use an SD-WAN solution, which should have the measurement and traffic shaping capabilities “built in.”
Fourth, there is a real possibility of building a system that fails in the face of positive feedback loops or reduces performance in the face of negative feedback loops. Hysteresis, the tendency to cause a performance problem in the process of reacting to a performance problem, must be carefully considered when designing such as system, as well.
The Bottom Line
Statically defined metrics in dynamic control planes cannot provide optimal performance in near real time. Building a system that can involves a good bit of additional complexity—complexity that is often best handled in a layered control plane.
Are these kinds of tools suitable for a network other than Facebook? In the right situation, the answer is clearly yes. But heed the tradeoffs. If you haven’t found the tradeoff, you haven’t looked hard enough.
While the network engineering world tends to use the word resilience to describe a system that will support rapid change in the real world, another word often used in computer science is robustness. What makes a system robust or resilient? If you ask a network engineer this question, the most likely answer you will get is something like there is no single point of failure. This common answer, however, does not go “far enough” in describing resilience. For instance, it is at least sometimes the case that adding more redundancy into a network can actually harm MTTR. A simple example: adding more links in parallel can cause the control plane to converge more slowly; at some point, the time to converge can be reduced enough to offset the higher path availability.
In other cases, automating the response to a change in the network can harm MTTR. For instance, we often nail a static route up and redistribute that, rather than redistributing live routing information between protocols. Experience shows that sometimes not reacting automatically is better than reacting automatically.
This post will look at a paper that examines robustness more deeply, Robustness in Complexity Systems, by Steven Gribble. While this is an older paper—it was written in 2000—it remains a worthwhile read for the lessons in distributed system design. The paper is based on the deployment of a cluster based Distributed Data Structure (DDS). A more convenient way for readers to think of this is as a distributed database. Several problems discovered when building and deploying this DDS are considered, including—
- A problem with garbage collection, which involved timeouts. The system was designed to allocate memory as needed to form and synchronize records. After the record had been synchronized, any unneeded memory would be released, but would not be reallocated immediately by some other process. Rather, a garbage collection routine would coalesce memory into larger blocks where possible, rearranging items and placing memory back into available pools. This process depends on a timer. What the developers discovered is their initial “guess” at a a good timer was ultimately an order of a magnitude too small, causing some of the nodes to “fall behind” other nodes in their synchronization. Once a node fell behind, the other nodes in the system were required to “take up the slack,” causing them to fail at some point in the future. This kind of cascading failure, triggered by a simple timer setting, is common in a distributed system.
- A problem with a leaky abstraction from TCP into the transport. The system was designed to attempt to connect on TCP, and used fairly standard timeouts for building TCP connections. However, a firewall in the network was set to disallow inbound TCP sessions. Another process connecting on TCP was passing through this firewall, causing the TCP session connection to fail, and, in turn, causing the TCP stack on the nodes to block for 15 minutes. This interaction of different components caused nodes to fall out of availability for long periods of time.
Gribble draws several lessons from these, and other, outages in the system.
First, he states that for a system to be truly robust, it must use some form of admission control. The load on the system, in other words, must somehow be controlled so more work cannot be given than the system can support. This has been a contentious issue in network engineering. While circuit switched networks can control the amount of work offered to the network (hence a Clos can be non-blocking in a circuit switched network), admission control in a packet switched network is almost impossible. The best you can do is some form of Quality of Service marking and dropping, such as traffic shaping or traffic policing, along the edge. This does highlight the importance of such controls, however.
Second, he states that systems must be systematically overprovisioned. This comes back to the point about redundant links. The caution above, however, still applies; systematic overprovisioning needs to be balanced against other tools to build a robust system. Far too often, overprovisioning is treated as “the only tool in the toolbox.”
Third, he states introspection must be built into the system. The system must be designed to be monitorable from its inception. In network engineering, this kind of thinking is far too often taken to say “everything must be measurable.” This does not go far enough. Network engineers need to think about not only how to measure, but also what they expect normal to look like, and how to tell when “normal” is no longer “normal.” The system must be designed within limits. Far too often, we just build “as large as possible,” and run it to see what happens.
Fourth, Gribbles says adaptivity must be provided through a closed control loop. This is what we see in routing protocols, in the sense that the control plane reacts to topology changes in a specific way, or rather within a specific state machine. Learning this part of the network is a crucial, but often skimmed over, part of network engineering.
This is an excellent paper, well worth reading for those who are interested in classic work around the area of robustness and distributed systems.