OTHER ROUTING
The Hedge 20: Whatever Happened to Software Defined Networking
There was a time when Software Defined Networking was going to take over the entire networking world—just like ATM, FDDI, and … so many others before. Whatever happened to SDN, anyway? What is its enduring legacy in the world of network engineering? Terry Slattery, Tom Ammon, and Russ White gather at the hedge to have a conversation about whatever happened to SDN?
BIER Basics
Multicast is, at best, difficult to deploy in large scale networks—PIM sparse and BIDIR are both complex, adding large amounts of state to intermediate devices. In the worst case, there is no apparent way to deploy any existing version of PIM, such as large-scale spine and leaf networks (variations on the venerable Clos fabric). BEIR, described in RFC8279, aims to solve the per-device state of traditional multicast.
In this network, assume A has some packet that needs to be delivered to T, V, and X. A could generate three packets, each one addressed to one of the destinations—but replicating the packet at A is wastes network resources on the A->B link, at least. Using PIM, these three destinations could be placed in a multicast group (a multicast address can be created that describes T, V, and X as a single destination). After this, a reverse shortest path tree can be calculated from each of the destinations in the group towards the source, A, and the correct forwarding state (the outgoing interface list) be installed at each of the routers in the network (or at least along the correct paths). This, however, adds a lot of state to the network.
BIER provides another option.
Returning to the example, if A is the sender, B is called the Bit-Forwarding Ingress Router, or BFIR. When B receives this packet from A, it determines T, V, and X are the correct destinations, and examines its local routing table to determine T is connected to M, V, is connected to N, and X is connected to R. Each of these are called a Bit-Forwarding Egress Router (BFER) in the network; each has a particular bit set aside in a bit field. For instance, if the network operator has set up an 8-bit BIER bit field, M might be bit 3, N might be bit 4, and R might be bit 6.
Given this, A will examine its local routing table to find the shortest path to M, N, and R. It will find the shortest path to M is through D, the shortest path to N is through C, and the shortest path to R is through C. A will then create two copies of the packet (it will replicate the packet). It will encapsulate one copy in a BIER header, setting the third bit in the header, and send the packet on to D. It will encapsulate the second copy into a BIER header, as well, setting the fourth and sixth bits in the header, and send the packet on to C.
D will receive the packet, determine the destination is M (because the third bit is set in the BIER header), look up the shortest path to M in its local routing table, and forward the packet. When M receives the packet, it pops the BIER header, finds T is the final destination address, and forwards the packet.
When C receives the packet, it finds there are two destinations, R and N, based on the bits set in the BIER header. The shortest path to N is through H, so it clears bit 6 (representing R as a destination), and forwards the packet to H. When H receives the packet, it finds N is the destination (because bit 4 is set in the BIER header), looks up N in its local routing table, and forwards the packet along the shortest path towards N. N will remove the BIER header and forward the packet to V. C will also forward the packet to G after clearing bit 4 (representing N as a destination BFER). G will examine the BIER header, find that R is a destination, examine its local routing table for the shortest path to R, and forward the packet. R, on receiving the packet, will strip the BIER header and forward the packet to X.
While the BFIR must know how to translate a single packet into multiple destinations, either through a group address, manual configuration, or some other signaling mechanism, the state at the remaining routers in the network is a simple table of BIER bitfield mappings to BFER destination addresses. It does not matter how large the tree gets, the amount of state is held to the number of BFERs at the network edge. This is completely different from any version of PIM, where the state at every router along the path depends on the number of senders and receivers.
The tradeoff is the hardware of the BIER Forwarding Routers (BFRs) must be able to support looking up the destination based on the bit set in the BIER bitfield in the header. Since the size of the BIER bitfield is variable, this is… challenging. One option is to use an MPLS label stack as a way to express the BIER header, which leverages existing MPLS implementations. Still, though, MPLS label depth is often limited, the ability to forward to multiple destinations is challenging, and the additional signaling required is more than trivial.
BIER is an interesting technology; it seems ideal for use in highly meshed data center fabrics, and even in longer haul and wireless networks where bandwidth is at a premium. Whether or not the hardware challenges can be overcome is a question yet to be answered.
Research: Facebook’s Edge Fabric
The Internet has changed dramatically over the last ten years; more than 70% of the traffic over the Internet is now served by ten Autonomous Systems (AS’), causing the physical topology of the Internet to be reshaped into more of a hub-and-spoke design, rather than the more familiar scale-free design (I discussed this in a post over at CircleID in the recent past, and others have discussed this as well). While this reshaping might be seen as a success in delivering video content to most Internet users by shortening the delivery route between the server and the user, the authors of the paper in review today argue this is not enough.
Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA, 418-431. DOI: https://doi.org/10.1145/3098822.3098853
Why is this not enough? The authors point to two problems in the routing protocol tying the Internet together: BGP. First, they state that BGP is not capacity-aware. It is important to remember that BGP is focused on policy, rather than capacity; the authors of this paper state they have found many instances where the preferred path, based on BGP policy, is not able to support the capacity required to deliver video services. Second, they state that BGP is not performance-aware. The selection criteria used by BGP, such as MED and Local Pref, do not correlate with performance.
Based on these points, the authors argue traffic needs to be routed more dynamically, in response to capacity and performance, to optimize efficiency. The paper presents the system Facebook uses to perform this dynamic routing, which they call Edge Fabric. As I am more interested in what this study reveals about the operation of the Internet than the solution Facebook has proposed to the problem, I will focus on the problem side of this paper. Readers are invited to examine the entire paper at the link above, or here, to see how Facebook is going about solving this problem.
The paper begins by examining the Facebook edge; as edges go, Facebook’s is fairly standard for a hyperscale provider. Facebook deploys Points of Presence, which are essentially private Content Delivery Network (CDN) compute and edge pushed to the edge, and hence as close to users as possible. To provide connectivity between these CDN nodes and their primary data center fabrics, Facebook uses transit provided through peering across the public ‘net. The problem Facebook is trying to solve is not the last mile connectivity, but rather the connectivity between these CDN nodes and their data center fabrics.
The authors begin with the observation that if left to its own decision process, BGP will evenly distribute traffic across all available peers, even though each peer is actually different levels of congestion. This is not a surprising observation. In fact, there was at least one last mile provider that used their ability to choose an upstream based on congestion in near real time. This capability was similar to the concept behind Performance Based Routing (PfR), developed by Cisco, which was then folded into DMVPN, and thus became part of the value play of most Software Defined Wide Area Network (SD-WAN) solutions.
The authors then note that BGP relies on rough proxies to indicate better performing paths. For instance, the shortest AS Path should, in theory, be shortest physical or logical path, as well, and hence the path with the lowest end-to-end time. In the same way, local preference is normally set to prefer peer connections rather than upstream or transit connections. This should mean traffic will take a shorter path through a peer connected to the destination network, rather than a path up through a transit provider, then back down to the connected network. This should result in traffic passing through less lightly loaded last mile provider networks, rather than more heavily used transit provider networks. The authors present research showing these policies can often harm performance, rather than enhancing it; sometimes it is better to push traffic to a transit peer, rather than to a directly connected peer.
How often are destination prefixes constrained by BGP into a lower performing path? The authors provide this illustration—
The percentage of impacted destination prefixes is, by Facebook’s measure, high. But what kind of solution might be used to solve this problem?
Note that no solution that uses static metrics for routing traffic will be able to solve these problems. What is required, if you want to solve these problems, is to measure the performance of specific paths to given destinations in near real time, and somehow adjust routing to take advantage of higher performance paths regardless of what the routing protocol metrics indicate. In other words, the routing protocol needs to find the set of possible loop-free paths, and some other system must choose which path among this set should be used to forward traffic. This is a classic example of the argument for layered control planes (such as this one).
Facebook’s solution to this problem is to overlay an SDN’ish solution on top of BGP. Their solution does not involve tunneling, like many SD-WAN solutions. Rather, they adjust the BGP metrics in near real time based on current measured congestion and performance. The paper goes on to describe their system, which only uses standard BGP metrics to steer traffic onto higher performance paths through the ‘net.
A few items of note from this research.
First, note that many of the policies set up by providers are not purely shorthand for performance; they actually represent a price/performance tradeoff. For instance, the use of local preference to send traffic to peers, rather than transits, is most often an economic decision. Providers, particularly edge providers, normally configure settlement-free peering with peers, and pay for traffic sent to an upstream transit provider. Directing more traffic at an upstream, rather than a peer, can have a significant financial impact. Hyperscalers, like Facebook, don’t often see these financial impacts, as they are purchasing connectivity from the provider. Over time, forcing providers to use more expensive links for performance reasons could increase cost, but in this situation the costs are not immediately felt, so the cost/performance feedback loop is somewhat muted.
Second, there is a fair amount of additional complexity in pulling this bit of performance out of the network. While it is sometimes worth adding complexity to increase complexity, this is not always true. It likely is for many hyperscalers, who’s business relies largely on engagement. Given there is a directly provable link between engagement and speed, every bit of performance makes a large difference. But this is simply not true of all networks.
Third, you can replicate this kind of performance-based routing in your network by creating a measurement system. You can then use the communities operators providers allow their customers to use to shape the direction of traffic flows to optimize traffic performance. This might not work in all cases, but it might give you a fair start on a similar system—if this kind of wrestling match with performance is valuable in your environment.
Another option might be to use an SD-WAN solution, which should have the measurement and traffic shaping capabilities “built in.”
Fourth, there is a real possibility of building a system that fails in the face of positive feedback loops or reduces performance in the face of negative feedback loops. Hysteresis, the tendency to cause a performance problem in the process of reacting to a performance problem, must be carefully considered when designing such as system, as well.
The Bottom Line
Statically defined metrics in dynamic control planes cannot provide optimal performance in near real time. Building a system that can involves a good bit of additional complexity—complexity that is often best handled in a layered control plane.
Are these kinds of tools suitable for a network other than Facebook? In the right situation, the answer is clearly yes. But heed the tradeoffs. If you haven’t found the tradeoff, you haven’t looked hard enough.
Research: Robustness in Complex Systems
While the network engineering world tends to use the word resilience to describe a system that will support rapid change in the real world, another word often used in computer science is robustness. What makes a system robust or resilient? If you ask a network engineer this question, the most likely answer you will get is something like there is no single point of failure. This common answer, however, does not go “far enough” in describing resilience. For instance, it is at least sometimes the case that adding more redundancy into a network can actually harm MTTR. A simple example: adding more links in parallel can cause the control plane to converge more slowly; at some point, the time to converge can be reduced enough to offset the higher path availability.
In other cases, automating the response to a change in the network can harm MTTR. For instance, we often nail a static route up and redistribute that, rather than redistributing live routing information between protocols. Experience shows that sometimes not reacting automatically is better than reacting automatically.
This post will look at a paper that examines robustness more deeply, Robustness in Complexity Systems, by Steven Gribble. While this is an older paper—it was written in 2000—it remains a worthwhile read for the lessons in distributed system design. The paper is based on the deployment of a cluster based Distributed Data Structure (DDS). A more convenient way for readers to think of this is as a distributed database. Several problems discovered when building and deploying this DDS are considered, including—
- A problem with garbage collection, which involved timeouts. The system was designed to allocate memory as needed to form and synchronize records. After the record had been synchronized, any unneeded memory would be released, but would not be reallocated immediately by some other process. Rather, a garbage collection routine would coalesce memory into larger blocks where possible, rearranging items and placing memory back into available pools. This process depends on a timer. What the developers discovered is their initial “guess” at a a good timer was ultimately an order of a magnitude too small, causing some of the nodes to “fall behind” other nodes in their synchronization. Once a node fell behind, the other nodes in the system were required to “take up the slack,” causing them to fail at some point in the future. This kind of cascading failure, triggered by a simple timer setting, is common in a distributed system.
- A problem with a leaky abstraction from TCP into the transport. The system was designed to attempt to connect on TCP, and used fairly standard timeouts for building TCP connections. However, a firewall in the network was set to disallow inbound TCP sessions. Another process connecting on TCP was passing through this firewall, causing the TCP session connection to fail, and, in turn, causing the TCP stack on the nodes to block for 15 minutes. This interaction of different components caused nodes to fall out of availability for long periods of time.
Gribble draws several lessons from these, and other, outages in the system.
First, he states that for a system to be truly robust, it must use some form of admission control. The load on the system, in other words, must somehow be controlled so more work cannot be given than the system can support. This has been a contentious issue in network engineering. While circuit switched networks can control the amount of work offered to the network (hence a Clos can be non-blocking in a circuit switched network), admission control in a packet switched network is almost impossible. The best you can do is some form of Quality of Service marking and dropping, such as traffic shaping or traffic policing, along the edge. This does highlight the importance of such controls, however.
Second, he states that systems must be systematically overprovisioned. This comes back to the point about redundant links. The caution above, however, still applies; systematic overprovisioning needs to be balanced against other tools to build a robust system. Far too often, overprovisioning is treated as “the only tool in the toolbox.”
Third, he states introspection must be built into the system. The system must be designed to be monitorable from its inception. In network engineering, this kind of thinking is far too often taken to say “everything must be measurable.” This does not go far enough. Network engineers need to think about not only how to measure, but also what they expect normal to look like, and how to tell when “normal” is no longer “normal.” The system must be designed within limits. Far too often, we just build “as large as possible,” and run it to see what happens.
Fourth, Gribbles says adaptivity must be provided through a closed control loop. This is what we see in routing protocols, in the sense that the control plane reacts to topology changes in a specific way, or rather within a specific state machine. Learning this part of the network is a crucial, but often skimmed over, part of network engineering.
This is an excellent paper, well worth reading for those who are interested in classic work around the area of robustness and distributed systems.
History of Networking: Tony P on BIER
Deconfusing the Static Route
Configuring a static route is just like installing an entry directly in the routing table (or the RIB).
I have been told this many times in my work as a network engineer by operations people, coders, designers, and many other folks. The problem is that it is, in some routing table implementations, too true. To understand, it is best to take a short tour through how a typical RIB interacts with a routing protocol. Assume BGP, or IS-IS, learns about a new route that needs to be installed in the RIB:
- The RIB into which the route needs to be installed is somehow determined. This might be through some sort of special tagging, or perhaps each routing process has a separate RIB into which it is installing routes, etc.. In any case, the routing process must determine which RIB the route should be installed in.
- Look the next hop up in the RIB, to determine if it is reachable. A route cannot be installed if there is no next hop through which to forward the traffic towards the described destination.
- Call the RIB interface to install the route.
The last step results in one of two possible reactions. The first is that the local RIB code compares any existing route to the new route, using the administrative distance and other factors (internal IS-IS routes should be preferred over external routes, eBGP routes should be preferred over iBGP routes, etc.) to decide which route should “win.” This process can be quite complex, however, as the rules are different for each protocol, and can change between protocols. In order to prevent long switch statements that need to be maintained in parallel with the routing protocol code, many RIB implementations use a set of call back functions to determine whether the existing route, or the new route, should be preferred.
In this diagram—
- The IS-IS process receives an LSP, calculates a new route based on the information (using SPF), and installs the route into the routing table.
- The RIB calls back to the owner of the current route, BGP, handing the new route to BGP for comparison with the route currently installed in the RIB.
- BGP finds the local copy of the route (rather than the version installed in the RIB) based on the supplied information, and determines the IS-IS route should win over the current BGP route. It sends this information to the RIB.
Using a callback system of this kind allows the “losing” routing protocol to determine if the new route should replace the current route. This might seem to be slower, but the reduced complexity in the RIB code is most often worth the tradeoff in time.
The static route is, in some implementations, and exception to this kind of processing. For instance, in old Cisco IOS code, the static route code was part of the RIB code. When you configured a static route, the code just created a new routing table entry and a few global variables to keep track of the manual configuration. FR Routing’s implementation of the static route is like this today; you can take a look at the zebra_static.c
file in the FR Routing code base to see how static routes are implemented
However, there is current work being done to separate the static route from the RIB code; to create a completely different static process, so static routes are processed in the same way as a route learned from any other process.
Even though many implementations manage static routes as part of the RIB, however, you should still think of the static route as being like any other route, installed by any other process. Thinking about the static route as a “special case” causes many people to become confused about how routing really works. For instance—
Routing is really just another kind of configuration. After all, I can configure a route directly in the routing table with a static route.
I’ve also heard some folks say something like—
Software defined networks are just like DevOps. Configuring static routes through a script is just like using a southbound interface like I2RS or OpenFlow to install routing information.
The confusion here stems from the idea that static routes directly manipulate the RIB. This is an artifact of the way the code is structured, however, rather than a fact. The processing for static routes are contained in the RIB code, but that just means the code for determining which route out of a pair wins, etc., is all part of the RIB code itself, rather than residing in a separate process. The source of the routing information is different—a human configuring the device—but the actuall processing is no different, even if that processing is mixed into the RIB code.
What about a controller that screen scrapes the CLI for link status to discover reachable destinations, calculates a set of best paths through the network based on this information, and then uses static routes to build a routing table in each device? This can be argued to be a “form” of SDN, but the key element is the centralized calculation of loop free paths, rather than the static routes.
The humble static route has caused a lot of confusion for network engineers, but clearly separating the function from the implementation can not only help understand how and why static routes work, but the different components of the routing system, and what differentiates SDNs from distributed control planes.
Responding to Readers: Questions on Microloops
Two different readers, in two different forums, asked me some excellent questions about some older posts on mircoloops. Unfortunately I didn’t take down the names or forums when I noted the questions, but you know who you are! For this discussion, use the network show below.
In this network, assume all link costs are one, and the destination is the 100::/64 Ipv6 address connected to A at the top. To review, a microloop will form in this network when the A->B link fails:
- B will learn about the link failure
- B will send an updated router LSP or LSA towards D, with the A->B link removed
- At about the same time, B will recalculate its best path to 100::/64, so its routing and forwarding tables now point towards D as the best path
- D, in the meantime, receives the updated information, runs SPF, and installs the new routing information into its forwarding table, with the new path pointing towards E
Between the third and fourth steps, B will be using D as its best path, while D is using B as its best path. Hence the microloop. The first question about microloops was—
Would BFD help prevent the microloop (or perhaps make it last a shorter period of time)?
Consider what happens if we have BFD running between A and B in this network. While A and B discover the failure faster—perhaps a lot faster—none of the other timing points change. No matter how fast A and B discover the link failure, B is still going to take some time to flood the change to the topology to D, and D is going to take some time to compute a new set of shortest paths, and install them into its local routing and forwarding tables.
Essentially, if you look at convergence as a four step process—
- Discovery
- Reporting
- Calculation
- Installation
Then you can see the microloop forms because different routers are “doing” steps 3 and 4 at different times. Discovery, which is what BFD is aimed at, is not going to change this dynamic. The second question was—
Can microloops occur on a link coming up?
For this one, let’s start in a different place. Assume, for a moment, that the A->B link is down. Now someone goes in and configures the A->B link to make it operational. At the moment the link comes up, B’s shortest path to 100::/64 is through D. When the new link comes up, it will learn about the new link, calculated a new shortest path tree, and then install the new route through A towards 100::/64. E will also need to calculate a new route, using B as its next hop.
The key point to consider is this: who tells E about this new path to the destination? It is B. To form a microloop, we need D to install a route through B towards 100::/64 before B does. This is theoretically possible in this situation, but unlikely, because D is dependent on B for information about this new path. Things would need to be pretty messed up for B to learn about the new path first, but not recalculate its shortest path tree and install the route before D can. So—while it is possible, it is not likely.
Thanks for sending these terrific questions in.