OTHER ROUTING
Thoughts on Open/R
Since Facebook has released their Open/R routing platform, there has been a lot of chatter around whether or not it will be a commercial success, whether or not every hyperscaler should use the protocol, whether or not this obsoletes everything in routing before this day in history, etc., etc. I will begin with a single point.
[time-span]
If you haven’t found the tradeoffs, you haven’t looked hard enough.
Design is about tradeoffs. Protocol design is no different than any other design. Hence, we should expect that Open/R makes some tradeoffs. I know this might be surprising to some folks, particularly in the crowd that thinks every new routing system is going to be a silver bullet that solved every problem from the past, that the routing singularity has now occurred, etc. I’ve been in the world of routing since the early 1990’s, perhaps a bit before, and there is one thing I know for certain: if you understand the basics, you would understand there is no routing singularity, and there never will be—at least not until someone produces a quantum wave routing protocol.
Ther reality is you always face one of two choices in routing: build a protocol specifically tuned to a particular set of situations, which means application requirements, topologies, etc., or build a general purpose protocol that “solves everything,” at some cost. BGP is becoming the latter, and is suffering for it. Open/R is an instance of the former.
Which means the interesting question is: what are they solving for, and how? Once you’ve answered this question, you can then ask: would this be useful in my network?
A large number of the points, or features, highlighted in the first blog post are well known routing constructions, so we can safely ignore them. For instance: IPv6 link local only, graceful restart, draining and undraining nodes, exponential backoff, carrying random information in the protocol, and link status monitoring. These are common features of many protocols today, so we don’t need to discuss them. There are a couple of interesting features, however, worth discussing.
Dynamic Metrics. EIGRP once had dynamic metrics, and they were removed. This simple fact always makes me suspicious when I see dynamic metrics touted as a protocol feature. Looking at the heritage of Open/R, however, dynamic metrics were probably added for one specific purpose: to support wireless networks. This functionality is, in fact, provided through DLEP, and supported in OLSR, MANET extended OSPF, and a number of other MANET control planes. Support DLEP and dynamic metrics based on radio information was discussed at the BABEL working group at the recent Singapore IETF, in fact, and the BABEL folks are working on integration dynamic metrics for wireless. So this feature not only makes sense in the wireless world, it’s actually much more widespread than might be apparent if you are looking at the world from an “Enterprise” point of view.
But while this is useful, would you want this in your data center fabric? I’m not certain you would. I would argue dynamic metrics are actually counter productive in a fabric. What you want, instead, is basic reachability provided by the distributed control plane (routing protocol), and some sort of controller that sits on top using an overlay sort of mechanism to do traffic engineering. You don’t want this sort of policy stuff in a routing protocol in a contained environment like a fabric.
Which leads us to our second point: The API for the controller. This is interesting, but not strictly new. Openfabric, for instance, already postulates such a thing, and the entire I2RS working group in the IETF was formed to build such an interface (though it has strayed far from this purpose, as usual with IETF working groups). The really interesting thing, though, is this: this southbound interface is built into the routing protocol itself. This design decision makes a lot of sense in a wireless network, but, again, I’m not certain it does in a fabric.
Why not? It ties the controller architecture, including the southbound interface, to the routing protocol. This reduced component flexibility, which means it is difficult to replace one piece without replacing the other. If you wanted to replace the basic functionality of Open/R without replacing the controller architecture at some point int he future, you must hack your way around this problem. In a monolithic system like Facebook, this might be okay, but in most other network environments, it’s not. In other words, this is a rational decision for Open/R, but I’m not certain it can, or should, be generalized.
This leads to a third observation: This is a monolithic architecture. While in most implementations, there is a separate RIB, FIB, and interface into the the forwarding hardware, Open/R combines all these things into a single system. In any form of Linux based network operating system, for instance, the routing processes install routes into Zebra, which then installs routes into the kernel and notifies processes about routes through the Forwarding Plane Manager (FPM). Some external process (switchd in Cumulus Linux, SWSS in SONiC), then carry this routing information into the hardware.
Open/R, from the diagrams in the blog post, pushes all of this stuff, including the southbound interface from the controller, into a different set of processes. The traditional lines are blurred, which means the entire implementation acts as a single “thing.” You are not going to take the BGP implementation from snaproute or FR Routing and run it on top of Open/R without serious modification, nor are you going to run Open/R on ONL or SONiC or Cumulus Linux without serious modification (or at least a lot of duplication of effort someplace).
This is probably an intentional decision on the part of Open/R’s designers—it is designed to be an “all in one solution.” You RPM it to a device, with nothing else, and it “just works.” This makes perfect sense in the wireless environment, particularly for Facebook. Whether or not it makes perfect sense in a fabric depends—does this fit into the way you manage boxes today? Do you plan on using boxes Facebook will support, or roll your own drivers as needed for different chipsets, or hope the SAI support included in Open/R is enough? Will you ever need segment routing, or some other capability? How will those be provided for in the Open/R model, given it is an entire stack, and does not interact with any other community efforts?
Finally, there are a number of interesting points that are not discussed in the publicly available information. For instance, this controller—what does it look like? What does it do? How would you do traffic engineering with this sytem? Segment routing, MPLS—none of the standard ways of providing virtualization are mentioned at all. Dynamic metrics simply are not enough in a fabric. How is the flooding of information actually done? In the past, I’ve been led to believe this is based on ZeroMQ—is this still true? How optimal is ZeroMQ for flooding information? What kind of telemetry can you get out of this, and is it carried in the protocol, or in a separate system? I assume they want to carry telemetry as opaque information flooded by the protocol, but does it really make sense to do this?
Overall, Open/R is interesting. It’s a single protocol designed to operate optimally in a small range of environments. As such, it has some interesting features, and it makes some very specific design choices. Are those design choices optimal for more general cases, or even other specific problem spaces? I would argue the architecture, in particular, is going to be problematic in terms of long term maintenance and growth. This can modified over time, of course, but then we are left with a collection of ideas that are available in many other protocols, making the idea much less interesting.
Is it interesting? Yes. Is it the routing singularity? No. As engineers, we should take it for what it is worth—a chance to see how other folks are solving the problems they are facing in day-to-day operation, and thinking about how some of those lessons might be applied in our own world. I don’t think the folks at Facebook would argue any more than this, either.
On the ‘web: Openfabric and the trash can of the Internet
I sat with Greg Ferro over at Packet Pushers for a few minutes at the Prague IETF. We talked about Openfabric and how we are overusing BGP in many ways, as well as other odds and ends.
Open Source Routing at NANOG
If you want to find out more about open source and disaggregated routing for large scale network design, take a look at the recent webinar on this topic over at ipspace.net.
Anycast and Latency
One of the things I hear from time to time is how smaller Internet facing service deployments, with just a few instances, cannot really benefit from anycast. Particularly in the active-active data center use case, where customers can connect to one data center or another, the cost of advertising the service as an anycast, and the resulting requirement to keep the backend databases tightly synchronized, is often played as a eating a lot of complexity for the simplicity of having a single address in the DNS system, and hence not losing customer interaction time while the DNS records are timing out so the customer can reconnect to the service.
There is, in fact, some interesting recent research in this area. The research is directed at the DNS root servers themselves, probably because they are publicly accessible, and a well known system that has relied on anycast for many years (so the operators of the root DNS servers are probably well versed in the ways of anycast). One interesting chart from the post over at APNIC’s blog is—
The C root has 8 servers, while the L root has around 144 (according to the article pointed to above). Why is it that the C and L roots both show about the same performance, from an RTT perspective, even though they have wildly different numbers of servers serving an anycast address? The most likely explanation lies in the problem of diminishing returns; once you get past some number of servers in an anycast cluster, the gain from adding “one more server” really isn’t all that great.
How many servers? The authors of the paper say the magic number is 12.
But—it is important to point out that root servers serve a (largely) global audience, and so the “customers” of each of these anycast addresses is bound to be widely dispersed. For narrower audiences (geographically), the problem of diminishing returns will likely set in more quickly, perhaps with three or four servers servicing the anycast address.
The bottom line? Geographic dispersion against your customer base probably has more impact on the effectiveness of anycast in spreading load than the number of servers services the anycast address, above some minimal number (probably around three or four). If you are running any sort of service that needs high availability, even if its is grounded in TCP rather than UDP or QUIC, it is well worth looking at anycast to provide faster service and higher availability.
One the ‘net: The Network Collective and Choosing a Routing Protocol
The Network Collective is a new and very interesting video cast of various people sitting around a virtual table talking about topics of interest to network engineers. I was on the second episode last night, and the video is already (!) posted this morning. You should definitely watch this one!
In episode 2 our panel discusses some key differences between routing protocols and the details that should be considered before choosing to implement one over another. Is there any difference between IGP routing protocols at this point? When does it make sense to run BGP in an enterprise network? Is IS-IS an old and decaying protocol, or something you should viably consider? Russ White, Kevin Myers, and the co-hosts of Network Collective tackle these questions and more.
Reaction: Openflow and Software Based Switching
Over at the Networking Nerd, Tom has an interesting post up about openflow—this pair of sentences, in particular, caught my eye—
The side effect of OpenFlow is that it proved that networking could be done in software just as easily as it could be done in hardware. Things that we thought we historically needed ASICs and FPGAs to do could be done by a software construct.
I don’t think this is quite right, actually… When I first started working in network engineering (wheels were square then, and dirt hadn’t yet been invented—but we did have solar flares that caused bit flips in memory), we had all software based switching. The Cisco 7200, I think, was the ultimate software based switching box, although the little 2ru 4500 (get your head out of the modern router line, think really old stuff here!) had a really fast processor, and hence could process packets really quickly. These were our two favorite lab boxes, in fact. But in the early 1990’s, the SSE was introduced, soldered on to an SSP blade that slid into a 7500 chassis.
The rest, as they say, is history. The networking world went to chips designed to switch packets (whether custom ASICs or merchant silicon), and hasn’t even begun to think about looking back. I think Tom’s two sentences mash up several different things into one large ball of thread that need to be disentangled just a little.
A good place to start is here: most commercially available packet switching silicon has some form of openflow available. The specific example Tom gives—implementing a filter across multiple switches in a network—actually doesn’t relate to software based switching as much as it does bringing policy out of the distributed control plane and into a controller. So if openflow really hasn’t change the nature of switching, then what has it changed?
I think what Tom is really getting at here is a shift in our perception of the nature of a router. In times past, we treated the entire router as something like an appliance. There was an operating system which contained an implementation of the distributed control plane, and there was hardware, which contained the switching hardware, LEDs, fans, connectors, and other stuff. Today, we think of the router as being at least two components, and moving to more.
There is the control plane, which is essentially an application riding on top of the operating system. There is the hardware “carrier,” which is essentially the sheet metal, LEDs, fans, etc. There is the switching hardware. And there is an interface between the switching hardware and the control plane.
Openflow, I think, primarily made us start to think about the relationship between these parts; it made us stop thinking of the router as an appliance, and start thinking of the router as a collection of parts. This led to the entire disaggregation movement, and tangentially the white box movement, we have today.
So, in a sense, Tom is correct—what we thought of as needing to be done by a monolithic “thing,” or “appliance,” we now think of as being done by a collection of parts that we can wrap together ourselves. This isn’t a matter of “what used to be done in hardware is now done in software,” although it is a rethinking of the relationship between hardware and software.
Regardless of whether or not Openflow ultimately succeeds or fails in it’s original form, or it’s later hyped form, or in some other form, we should all at least look back on Openflow as helping to bring about a major shift in the way we see networks. In this way, at least, Openflow has not failed.
I2RS and Remote Triggered Black Holes
In our last post, we looked at how I2RS is useful for managing elephant flows on a data center fabric. In this post, I want to cover a use case for I2RS that is outside the data center, along the network edge—remote triggered black holes (RTBH). Rather than looking directly at the I2RS use case, however, it’s better to begin by looking at the process for creating, and triggering, RTBH using “plain” BGP. Assume we have the small network illustrated below—
In this network, we’d like to be able to trigger B and C to drop traffic sourced from 2001:db8:3e8:101::/64 inbound into our network (the cloudy part). To do this, we need a triggering router—we’ll use A—and some configuration on the two edge routers—B and C. We’ll assume B and C have up and running eBGP sessions to D and E, which are located in another AS. We’ll begin with the edge devices, as the configuration on these devices provides the setup for the trigger. On B and C, we must configure—
- Unicast RPF; loose mode is okay. With loose RPF enabled, any route sourced from an address that is pointing to a null destination in the routing table will be dropped.
- A route to some destination not used in the network pointing to null0. To make things a little simpler we’ll point a route to 2001:db8:1:1::1/64, a route that doesn’t exist anyplace in the network, to null0 on B and C.
- A pretty normal BGP configuration.
The triggering device is a little more complex. On Router A, we need—
- A route map that—
- matches some tag in the routing table, say 101
- sets the next hop of routes with this tag to 2001:db8:1:1::1/64
- set the local preference to some high number, say 200
- redistribute from static routes into BGP filtered through the route map as described.
With all of this in place, we can trigger a black hole for traffic sourced from 2001:db8:3e8:101::/64 by configuring a static route at A, the triggering router, that points at null0, and has a tag of 101. Configuring this static route will—
- install a static route into the local routing table at A with a tag of 101
- this static route will be redistributed into BGP
- since the route has a tag of 101, it will have a local preference of 200 set, and the next hop set to 2001:db8:1:1::1/64
- this route will be advertised via iBGP to B and C through normal BGP processing
- when B receives this route, it will choose it as the best path to 2001:db8:3e8:101::/64, and install it in the local routing table
- since the next hop on this route is set to 2001:db8:1:1::1/64, and 2001:db8:1:1::1/64 points to null0 as a next hop, uRPF will be triggered, dropping all traffic sourced from 2001:db8:3e8:101::/64 at the AS edge
It’s possible to have regional, per neighbor, or other sorts of “scoped” black hole routes by using different routes pointing to null0 on the edge routers. These are “magic numbers,” of course—you must have a list someplace that tells you which route causes what sort of black hole event at your edge, etc.
Note—this is a terrific place to deploy a DevOps sort of solution. Instead of using an appliance sort of router for the triggering router, you could run a handy copy of Cumulus or snaproute in a VM, and build scripts that build the correct routes in BGP, including a small table in the script that allows you to say something like “black hole 2001:db8:3e8:101::/64 on all edges,” or “black hole 2001:db8:3e8:101::/64 on all peers facing provider X,” etc. This could greatly simplify the process of triggering RTBH.
Now, as a counter, we can look at how this might be triggered using I2RS. There are two possible solutions here. The first is to configure the edge routers as before, using “magic number” next hops pointed at the null0 interface to trigger loose uRPF. In this case, an I2RS controller can simply inject the correct route at each edge eBGP speaker to block the traffic directly into the routing table at each device. There would only need to be one such route; the complexity of choosing which peers the traffic should be black holed on could be contained in a script at the controller, rather than dispersed throughout the entire network. This allows RTBH to be triggered on a per edge eBGP speaker basis with no additional configuration on any individual edge router.
Note the dynamic protocol isn’t being replaced in any way. We’re still receiving our primary routing information from BGP, including all the policies available in that protocol. What we’re doing, though, is removing one specific policy point out of BGP and moving it into a controller, where it can be more closely managed, and more easily automated. This is, of course, the entire point of I2RS—to augment, rather than replace, dynamic routing used as the control plane in a network.
Another option, for those devices that support it, is to inject a route that explicitly filters packets sourced from 2001:db8:3e8:101::/64 directly into the RIB using the filter based RIB model. This is a more direct method, if the edge devices support it.
Either way, the I2RS process is simpler than using BGP to trigger RTBH. It gathers as much of the policy as possible into one place, where it can be automated and managed in a more precise, fine grained way.