In this community roundtable over at the Network Collective, Jordan, Eyvonne, Nick, and I discuss some interesting use cases for MPLS. This is part of our ongoing series about protocols.
While the network engineering world tends to use the word resilience to describe a system that will support rapid change in the real world, another word often used in computer science is robustness. What makes a system robust or resilient? If you ask a network engineer this question, the most likely answer you will get is something like there is no single point of failure. This common answer, however, does not go “far enough” in describing resilience. For instance, it is at least sometimes the case that adding more redundancy into a network can actually harm MTTR. A simple example: adding more links in parallel can cause the control plane to converge more slowly; at some point, the time to converge can be reduced enough to offset the higher path availability.
In other cases, automating the response to a change in the network can harm MTTR. For instance, we often nail a static route up and redistribute that, rather than redistributing live routing information between protocols. Experience shows that sometimes not reacting automatically is better than reacting automatically.
This post will look at a paper that examines robustness more deeply, Robustness in Complexity Systems,” by Steven Gribble. While this is an older paper—it was written in 2000—it remains a worthwhile read for the lessons in distributed system design. The paper is based on the deployment of a cluster based Distributed Data Structure (DDS). A more convenient way for readers to think of this is as a distributed database. Several problems discovered when building and deploying this DDS are considered, including—
- A problem with garbage collection, which involved timeouts. The system was designed to allocate memory as needed to form and synchronize records. After the record had been synchronized, any unneeded memory would be released, but would not be reallocated immediately by some other process. Rather, a garbage collection routine would coallesce memory into larger blocks where possible, rearranging items and placing memory back into available pools. This process depends on a timer. What the developers discovered is their initial “guess” at a a good timer was ultimately an order of a magnitude too small, causing some of the nodes to “fall behind” other nodes in their synchronization. Once a node fell behind, the other nodes in the system were required to “take up the slack,” causing them to fail at some point in the future. This kind of cascading failure, triggered by a simple timer setting, is common in a distributed system.
- A problem with a leaky abstraction from TCP into the transport. The system was designed to attempt to connect on TCP, and used fairly standard timeouts for building TCP connections. However, a firewall in the network was set to disallow inbound TCP sessions. Another process connecting on TCP was passing through this firewall, causing the TCP session connection to fail, and, in turn, causing the TCP stack on the nodes to block for 15 minutes. This interaction of different components caused nodes to fall out of availability for long periods of time.
Gribble draws several lessons from these, and other, outages in the system.
First, he states that for a system to be truly robust, it must use some form of admission control. The load on the system, in other words, must somehow be controlled so more work cannot be given than the system can support. This has been a contentious issue in network engineering. While circuit switched networks can control the amount of work offered to the network (hence a Clos can be non-blocking in a circuit switched network), admission control in a packet switched network is almost impossible. The best you can do is some form of Quality of Service marking and dropping, such as traffic shaping or traffic policing, along the edge. This does highlight the importance of such controls, however.
Second, he states that systems must be systematically overprovisioned. This comes back to the point about redundant links. The caution above, however, still applies; systematic overprovisioning needs to be balanced against other tools to build a robust system. Far too often, overprovisioning is treated as “the only tool in the toolbox.”
Third, he states introspection must be built into the system. The system must be designed to be monitorable from its inception. In network engineering, this kind of thinking is far too often taken to say “everything must be measureable.” This does not go far enough. Network engineers need to think about not only how to measure, but also what they expect normal to look like, and how to tell when “normal” is no longer “normal.” The system must be designed within limits. Far too often, we just build “as large as possible,” and run it to see what happens.
Fourth, Gribbles says adaptivitiy must be provided through a closed control loop. This is what we see in routing protocols, in the sense that the control plane reacts to topology changes in a specific way, or rather within a specific state machine. Learning this part of the network is a crucial, but often skimmed over, part of network engineering.
This is an excellent paper, well worth reading for those who are interested in classic work around the area of robustness and distributed systems.
I was recently invited to a webinar for the RIPE NCC about the future of BGP security. The entire series is well worth watching; I was in the final session, which was a panel discussion on where we are now, and where we might go to make BGP security better.
A reader recently emailed me with this question: Why isn’t the condition for a Feasible Successor set to less than (<), rather than less than of equal (<=), in EIGRP? It certainly seems, as noted in the email, that this rules out a lot of possible possible loop free alternate paths. The network below will be used to illustrate.
First, assume all links are cost of 1 except D->C, which is cost of 2. Here D will choose B as the Successor, and the FC will be set to 2. The RD of C will be 1, so C will be an FS. Now consider two failures. The first failure is D->B. D will immediately reroute to the FS, which is C, without changing the FC. This works, because C’s cost to 100::/64 via D is 4, much higher than it’s cost to 100::64 along C->A. Now consider what happens if A->100::/64 fails. If the timing of the query “works right,” C and B will be notified first, then finally D. Even if D is somehow notified before C, and D switches to C as its FS, the traffic is dropped, rather than looped—so all is happy.
Now change the situation a little. Assume the A->C link is cost of 2, and the remaining links are 1. Now assume you make the FS condition <=, rather than "just" <. From D's perspective, C is still an FS. From C's perspective, D is also an FS. Suppose the B->D link fails. D switches to C, who’s path is intact; this works. Assume the C->A link fails. C switches to D, who’s path is intact; this works.
Finally, assume the A->100::/64 link fails. If the timing is just right, D and C will receive the update with this failure at the same moment, and will both switch to their FS’s. Now you have a loop. How long will this loop last? Until C and D can do a diffusing update — probably around 250ms or less… But if you count the outside computation time, it’s the SIA timer, which is around 10 minutes in more recent versions. Hence, “just the wrong circumstances” can cause up to a 10 minute loop. Not good.
The bottom line is this: any time you have a situation where two routers can end up pointing at one another as their local FS, you have a ring of some number of hops. If the final destination is outside the ring, some member of the ring must be the point at which traffic leaves the ring, and moves towards the destination. If the link connecting the ring to the destination fails, the update carrying the information about the loss of connectivity to the destination must travel around the ring in both directions.
When the update reaches the point at which routing would normally split horizon—the “point at which the waterfall splits,” so-to-speak, it will like reach both sides of the split horizon point at a close enough interview to cause a loop in the forwarding tables. This situation causes microloops in a link state protocol, but microloops are often resolved quickly, and hence tend to be tolerable. In a distance vector protocol, like EIGRP, the length of time the microloop can exist can be much longer—ultimately it depends on the speed at which a distributed computation can take place (because the computation is not local to each node), and, failing that, the amount of time the network can remain in an unstable state before “something is done about it.”
This, by the way, is why I am always opposed to increasing the SIA timer in EIGRP above the “factory defaults.” The SIA timer is essentially the amount of time you are willing to allow your network to remain unconverged in the worst case, and hence either dropping or looping traffic.
Have you ever wondered why spine-and-leaf networks are the “standard” for data center networks? While the answer has a lot to do with trial and error, it turns out there is also a mathematical reason the fat-tree spine-and-leaf is is used almost universally. There often is some mathematical reason for the decisions made in engineering, although we rarely explore those reasons. If it seems to work, there is probably a reason.
The fat-tree design is explored in a paper published in 2015 (available here at the ACM, and now added to my “classic papers” page so there is a local copy as well), using a novel technique to not only explore why the spine-and-leaf fat-tree is so flexible, but even what the ideal ratio of network capacity is at each stage. The idea begins with this basic concept: one kind of network topology can be emulated on top of another physical topology. For instance, you can emulate a toroid topology on top of a hierarchical network, or a spine-and-leaf on top of of hypercube, etc. To use terms engineers are familiar with in a slightly different way, let’s call the physical topology the underlay, and the emulated topology the overlay. This is not too far off, as emulated topologies do need to be a virtual construction on top of the underlay, which means… tunnels, of course.
In emulating one kind of topology over another kind of topology, however, you lose some performance. This performance loss factor is quite important, in that it tells you whether or not it is worth building your actual network using the design of the underlay, or using the overlay. If, for instance, you emulate a toroid design on top of a folded Clos, and you discover the performance is half as good as a real, physical toroid, then you might be better off building a toroid rather than a folded Clos.
This might be difficult to grasp at first, so let’s use a specific example. The only topologies most engineers are familiar with are rings, hierarchical networks, partial mesh, full mesh, and some form of spine-and-leaf. Consider the case of a full mesh network versus a ring topology. Suppose you have some application that, through careful testing, you have determined best runs on a full mesh network. Now when you go to build your physical network, you find that you have the choice between building a real full mesh network, which will be very expensive, or a ring network, which will be very cheap. How much performance will you lose in running this application on the ring topology?
The way to discover the answer is to build a virtual full mesh overlay onto the ring topology. Once you do this, it turns out there are ways to determine how much performance loss you will encounter using this combination of underlay and overlay. Now, take this one step farther—what if you decide not to build the overlay, but simply run the application directly on the ring topology? Minus the tunnel encapsulation and de-encapsulation, the resulting loss in performance should be roughly the same,
Given this, you should be able to calculate the performance loss from emulating every kind of topology on top of every other kind of topology. From here, you can determine if there is one topology that can be used as an underlay for every other kind of topology with a minimal amount of performance loss. It turns out there is: the fat-tree spine-and-leaf topology. So an application that might best work on a full mesh network will run on a fat-tree spine-and-leaf topology with minimal performance loss. The same can be said of an application that runs best on a ring topology, or a hierarchical topology, or a partial mesh, or a… Thus, the fat-tree spine-and-leaf can be called a universal topology.
After coming to this point, the authors wonder if it is possible to determine how the bandwidth should be apportioned in a fat-tree spine-and-leaf to produce optimal emulation results across every possible overlay. This would give designers the ability to understand how to achieve the most optimal universal application performance on a single physical topology. Again, there turns out to be an answer to this question: doubling the bandwidth at each stage in the fabric is the best way to create an underlay that will emulate every other kind of overlay with the minimal amount of performance loss. The authors come to this conclusion by comparing a data center network to an on-chip connection network, which has a provable “best” design based on the physical layout of the wires. Translating the physical space into a monetary cost produces the same result as above: doubling the bandwidth at each stage creates the least expensive network with the most capabilities.
The result, then, is this: the fat-tree spine-and-leaf design is the most effective design for most applications, a universal topology. Further, building such a design where there are two neighbors down for every neighbor up, and doubling the bandwidth at each stage, is the most efficient use of resources in building such a network.
Anyone who has worked with OSPF for any length of time has at least heard of areas—but perhaps before diving into Topology Transparent Zones (TTZs), a short review is in order.
In this diagram, routers A and B are in area 0, routers C and D are Area Border Routers (ABRs), and routers E, F, G, H, and K are all in area 1. The ABRs, C and D, do not advertise the existence of E, F, G, H, or K to the routers in area 0, nor the links to or between any of those routers. Any reachable destinations in area 1 are advertised using a em>summary LSA, or a type 3 LSA, towards A and B. From the perspective of A and B, 100::/64 and 101::/64 would be advertised by C and D as directly connected destinations, using the cost from C and D to each of these two destinations, based on a summary LSA.
What if you wanted to place H and K in their own area, with G as an ABR, behind the existing area 1? You cannot do this in OSPF using any form of a standard flooding domain, or area. There is no way to take information about a destination out of a summary LSA and place it in another summary LSA without risking the formation of a routing loop. Once the topology information is lost, which OSPF relies on to prevent the formation of routing loops, there is no way to gain it back.
Why might you want to do this? The primary reasons for stacking multiple areas in this way relate to mobile ad-hoc networks; you cannot control the way routers might be connected together, nor where flooding domain boundaries might be drawn. Assuming you have a requirement, you would be out of luck if the only tool you have is a standard OSPF area.
However, you can stack areas using TTZs. What is a TTZ? It is a flooding domain that does not appear to exist to the routers around it—or rather, connected to either side of the TTZ. In the illustration below, routers C, D, E, F, and G have been configured so they form a TTZ.
Note that when the network is configured this way, routers A, B, H, and K all believe they are in a single flooding domain. From the perspective of these four routers, C, D, and G appear to be directly connected by a set of point-to-point links. Traffic sent from B to 101::/64, for instance, would pass into the TTZ at D, through the TTZ, and back out the other side at G. This traffic would be forwarded normally through the TTZ, as the routers within the TTZ have a full view of the network topology, as well as reachable destinations.
The destination within the TTZ, 100::/64, would be advertised by each TTZ edge router as if it were a directly connected destination. Hence G, C, and D would each advertise 100::/64 as a connected interface in their router LSA, or their type 1 advertisement. This allows routers outside the TTZ to reach destinations within the TTZ, without revealing the internal topology of the TTZ.
Information about the topology within the TTZ is carried in a set of opaque LSAs which are not forwarded outside the TTZ. This allows the TTZ to maintain consistent internal state, without burdening the rest of the network with internal topology information, or topology changes.
TTZ’s are a rather narrowly focused solution; it is not likely you will see these used in an OSPF network near you any time soon. On the other hand, they are an interesting, experimental, addition to OSPF that can be used to solve a set of edge cases.
Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.
Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.
So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.
In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?
Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?
After collecting several
show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.
Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.
Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…
You can see how this quickly developed into a positive feedback loop—
- EIGRP receives a set of queries to which it must respond
- EIGRP allocates memory for each packet to build the responses
- Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
- EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
- EIGRP SIA timer times out, causing a flood of new queries…
Rinse and repeat until the network fails to converge.
There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.
But rest assured—there are still positive feedback loops lurking in some corner of every network.
Two different readers, in two different forums, asked me some excellent questions about some older posts on mircoloops. Unfortunately I didn’t take down the names or forums when I noted the questions, but you know who you are! For this discussion, use the network show below.
In this network, assume all link costs are one, and the destination is the 100::/64 Ipv6 address connected to A at the top. To review, a microloop will form in this network when the A->B link fails:
- B will learn about the link failure
- B will send an updated router LSP or LSA towards D, with the A->B link removed
- At about the same time, B will recalculate its best path to 100::/64, so its routing and forwarding tables now point towards D as the best path
- D, in the meantime, receives the updated information, runs SPF, and installs the new routing information into its forwarding table, with the new path pointing towards E
Between the third and fourth steps, B will be using D as its best path, while D is using B as its best path. Hence the microloop. The first question about microloops was—
Would BFD help prevent the microloop (or perhaps make it last a shorter period of time)?
Consider what happens if we have BFD running between A and B in this network. While A and B discover the failure faster—perhaps a lot faster—none of the other timing points change. No matter how fast A and B discover the link failure, B is still going to take some time to flood the change to the topology to D, and D is going to take some time to compute a new set of shortest paths, and install them into its local routing and forwarding tables.
Essentially, if you look at convergence as a four step process—
Then you can see the microloop forms because different routers are “doing” steps 3 and 4 at different times. Discovery, which is what BFD is aimed at, is not going to change this dynamic. The second question was—
Can microloops occur on a link coming up?
For this one, let’s start in a different place. Assume, for a moment, that the A->B link is down. Now someone goes in and configures the A->B link to make it operational. At the moment the link comes up, B’s shortest path to 100::/64 is through D. When the new link comes up, it will learn about the new link, calculated a new shortest path tree, and then install the new route through A towards 100::/64. E will also need to calculate a new route, using B as its next hop.
The key point to consider is this: who tells E about this new path to the destination? It is B. To form a microloop, we need D to install a route through B towards 100::/64 before B does. This is theoretically possible in this situation, but unlikely, because D is dependent on B for information about this new path. Things would need to be pretty messed up for B to learn about the new path first, but not recalculate its shortest path tree and install the route before D can. So—while it is possible, it is not likely.
Thanks for sending these terrific questions in.
Nick Russo and I stopped by the Network Collective last week to talk about BGP traffic engineering—and in the process I confused BGP deterministic MED and always compare MED. I’ve embedded the video below.
Since Facebook has released their Open/R routing platform, there has been a lot of chatter around whether or not it will be a commercial success, whether or not every hyperscaler should use the protocol, whether or not this obsoletes everything in routing before this day in history, etc., etc. I will begin with a single point.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
Design is about tradeoffs. Protocol design is no different than any other design. Hence, we should expect that Open/R makes some tradeoffs. I know this might be surprising to some folks, particularly in the crowd that thinks every new routing system is going to be a silver bullet that solved every problem from the past, that the routing singularity has now occurred, etc. I’ve been in the world of routing since the early 1990’s, perhaps a bit before, and there is one thing I know for certain: if you understand the basics, you would understand there is no routing singularity, and there never will be—at least not until someone produces a quantum wave routing protocol.
Ther reality is you always face one of two choices in routing: build a protocol specifically tuned to a particular set of situations, which means application requirements, topologies, etc., or build a general purpose protocol that “solves everything,” at some cost. BGP is becoming the latter, and is suffering for it. Open/R is an instance of the former.
Which means the interesting question is: what are they solving for, and how? Once you’ve answered this question, you can then ask: would this be useful in my network?
A large number of the points, or features, highlighted in the first blog post are well known routing constructions, so we can safely ignore them. For instance: IPv6 link local only, graceful restart, draining and undraining nodes, exponential backoff, carrying random information in the protocol, and link status monitoring. These are common features of many protocols today, so we don’t need to discuss them. There are a couple of interesting features, however, worth discussing.
Dynamic Metrics. EIGRP once had dynamic metrics, and they were removed. This simple fact always makes me suspicious when I see dynamic metrics touted as a protocol feature. Looking at the heritage of Open/R, however, dynamic metrics were probably added for one specific purpose: to support wireless networks. This functionality is, in fact, provided through DLEP, and supported in OLSR, MANET extended OSPF, and a number of other MANET control planes. Support DLEP and dynamic metrics based on radio information was discussed at the BABEL working group at the recent Singapore IETF, in fact, and the BABEL folks are working on integration dynamic metrics for wireless. So this feature not only makes sense in the wireless world, it’s actually much more widespread than might be apparent if you are looking at the world from an “Enterprise” point of view.
But while this is useful, would you want this in your data center fabric? I’m not certain you would. I would argue dynamic metrics are actually counter productive in a fabric. What you want, instead, is basic reacbility provided by the distributed control plane (routing protocol), and some sort of controller that sits on top using an overlay sort of mechanism to do traffic engineering. You don’t want this sort of policy stuff in a routing protocol in a contained envrionment like a fabric.
Which leads us to our second point: The API for the controller. This is interesting, but not strictly new. Openfabric, for instance, already postulates such a thing, and the entire I2RS working group in the IETF was formed to build such an interface (though it has strayed far from this purpose, as usual with IETF working groups). The really interesting thing, though, is this: this southbound interface is built into the routing protocol itself. This design decision makes a lot of sense in a wireless network, but, again, I’m not certain it does in a fabric.
Why not? It ties the controller architecture, including the southbound interface, to the routing protocol. This reduced component flexibility, which means it is difficult to replace one piece without replacing the other. If you wanted to replace the basic functionality of Open/R without replacing the controller architecture at some point int he future, you must hack your way around this problem. In a monolithic system like Facebook, this might be okay, but in most other network environments, it’s not. In other words, this is a rational decision for Open/R, but I’m not certain it can, or should, be generalized.
This leads to a third observation: This is a monolithic architecture. While in most implementations, there is a separate RIB, FIB, and interface into the the forwarding hardware, Open/R combines all these things into a single system. In any form of Linux based network operating system, for instance, the routing processes install routes into Zebra, which then installs routes into the kernel and notifies processes about routes through the Forwarding Plane Manager (FPM). Some external process (switchd in Cumulus Linux, SWSS in SONiC), then carry this routing information into the hardware.
Open/R, from the diagrams in the blog post, pushes all of this stuff, including the southbound interface from the controller, into a different set of processes. The traditional lines are blurred, which means the entire implemention acts as a single “thing.” You are not going to take the BGP implementation from snaproute or FR Routing and run it on top of Open/R without serious modification, nor are you going to run Open/R on ONL or SONiC or Cumulus Linux without serious modification (or at least a lot of duplication of effort someplace).
This is probably an intentional decision on the part of Open/R’s designers—it is designed to be an “all in one solution.” You RPM it to a device, with nothing else, and it “just works.” This makes perfect sense in the wrieless environment, particularly for Facebook. Whether or not it makes perfect sense in a fabric depends—does this fit into the way you manage boxes today? Do you plan on using boxex Faebook will support, or roll your own drivers as needed for different chipsets, or hope the SAI support included in Open/R is enough? Will you ever need segment routing, or some other capability? How will those be provided for in the Open/R model, given it is an entire stack, and does not interact with any other community efforts?
Finally, there are a number of interesting points that are not discussed in the publicly available information. For instance, this controller—what does it look like? What does it do? How would you do traffic engineering with this sytem? Segment routing, MPLS—none of the standard ways of providing virtualization are mentioned at all. Dynamic metrics simply are not enough in a fabric. How is the flooding of information actually done? In the past, I’ve been led to believe this is based on ZeroMQ—is this still true? How optimal is ZeroMQ for flooding information? What kind of telemetry can you get out of this, and is it carried in the protocol, or in a separate system? I assume they want to carry telemtry as opaque information flooded by the protocol, but does it really make sense to do this?
Overall, Open/R is interesting. It’s a single protocol designed to opperate optimally in a small range of environments. As such, it has some interesting features, and it makes some very specific design choices. Are those design choices optimal for more general cases, or even other specific problem spaces? I would argue the architecture, in particular, is going to be problematic in terms of long term maintenance and growth. This can modified over time, of course, but then we are left with a collection of ideas that are available in many other protocols, making the idea much less interesting.
Is it interesting? Yes. Is it the routing singularity? No. As engineers, we should take it for what it is worth—a chance to see how other folks are solving the problems they are facing in day-to-day operation, and thinking about how some of those lessons might be applied in our own world. I don’t think the folks at Facebook would argue any more than this, either.
In 2003, the world of network engineering was far different than it is today. For instance, EIGRP was still being implemented on the basis of its ability to support multi-protocol routing. SONET, and other optical technologies, were just starting to come into their own, and all optical switching was just beginning to be considered for large scale deployment. What Hartley says of history holds true when looking back at what seems to be a former age: “The past is a foreign country; they do things differently there.”
In the midst of this change, the Association for Computing Machinery (the ACM) published a paper entitled “Will IP really take over the world (of communications)?” This paper, written during the ongoing discussion within the engineering community over whether packet switching or circuit switching is the “better” technology, provides a lot of insight into the thinking of the time. Specifically, as the author say, the belief that IP is better:
…is based on our collective belief that packet switching is inherently superior to circuit switching because of the efficiencies of statistical multiplexing, and the ability of IP to route around failures. It is widely assumed that IP is simpler than circuit switching, and should be more economical to deploy and manage.
Several of the reasons given are worth considering in light of the intervening years between the paper being written and today. Section 2 of the paper suggest four myths that need to be considering in the world of IP based packet switching.
The first of these myths is that IP is more efficient. Here the authors point out an early argument used by packet switching advocates: when a circuit is idle, reserved bandwidth is unused. Because packets does not reserve bandwidth in this way, packet switching makes more efficient use of the available resources. The authors counter this argument by observing packets switched networks are not that heavily utilized, and the exploring why this might be so. The reasons they give are that packet switched networks become unstable if they are overutilized, and the desire to drive delay down as much as possible. The authors say:
But simply reducing the average link utilization will not be enough to make users happy. For a typical user to experience low utilization, the variance of the network utilization needs to be low, too. Reducing variations in link utilization is hard; today we lack effective techniques to do it. It might be argued that the problem will be solved by research efforts on traffic management, congestion control, and multipath routing. But to-date, despite these problems being understood for many years, effective measures are yet to be introduced.
The second of these myths is that IP is stable. Here the authors point out the many potential problems with packet switched networks, particularly in the area of in band signaling.
The third myth the authors consider is that IP is simpler. Here the argument is that circuit switched networks have fewer moving parts, each of which can be more regimented in their implementation. This means circuit switched networks are generally simpler than their packet switched counterparts.
The fourth myth the authors consider is that real time traffic will be able to run over IP based networks. They argue that because of the various quality of service problems, near real time traffic, such as voice and video, can never be carried over an IP network.
What did the authors do get right, and what did they get wrong? Looking back over the intervening years, some of these problems seem to have been largely solved. For instance, between research into new quality of service solutions, virtualization technologies, and research into and deployment of new methods of managing traffic flows over a packet switched network (for instance, as implemented in QUIC), have largely eased the problems the authors discuss.
Other problems in their list seem to have simply fallen by the wayside. For instance, the authors are largely correct on the stability of IP routing. The Internet, as it exists today, is not really stable. The default free zone, the Internet core, never does really converge. But, on the other hand, it turns out that it might not be all that important. So long as reachability exists, and the applications running over the network can deal with the variable delay, then it does not really matter. It is clear that IP networks can be used for near real time traffic at this point, as well.
The third myth is, perhaps, the most interesting—IP is more complex. Circuit switching generally relies on a centralized control plane, rather than a distributed one. Is this truly simpler? Advocates of Software Defined Networks have argued recently that it is, but I am not convinced. The problem here, it seems to me, is the conflation of reachability and policy into a single “thing” that must be solved using a single subsystem of the network as a system. I tend to think that we will have a much healthier discussion around control plane design and operation once we split the various purposes the control plane serves up into different pieces, and discuss each one as a separate problem and solution space.
What did the authors get right? They were right in surmising that circuit switched networks, particularly in the area of optical switching, would continue to play a role in the global Internet. In fact, you could argue that packet and circuit switched networks often live side-by-side in the global Internet today, with many long haul and metropolitan area links being good examples of circuit switched networks underlying a packet switched overlay. MPLS, of course, also provides a sort of hybrid mode between circuit and packet switching which enables many of the solutions deployed today.
Much as we might like to look back in derision at papers such as this, the reality is the authors were accurately pointing out many of the problems with packet switched networks, some of which still have not been solved today. These objections should not be taken as a relic of the past, but rather as a pointer to what yet remains to be solved, where problems may crop up again, and, finally, where we might look for solutions that may, ultimately, solve the problem of carrying data at scale.
From time to time, someone publishes a new blog post lauding the wonderfulness of BGPsec, such as this one over at the Internet Society. In return, I sometimes feel like I am a broken record discussing the problems with the basic idea of BGPsec—while it can solve some problems, it creates a lot of new ones. Overall, BGPsec, as defined by the IETF Secure Interdomain (SIDR) working group is a “bad idea,” a classic study in the power of unintended consequences, and the fond hope that more processing power can solve everything. To begin, a quick review of the operation of BGPsec might be in order. Essentially, each AS in the AS Path signs the “BGP update” as it passes through the internetwork, as shown below.
In this diagram, assume AS65000 is originating some route at A, and advertising it to AS65001 and AS65002 at B and C. At B, the route is advertised with a cryptographic signature “covering” the first two hops in the AS Path, AS65000 and AS65001. At C, the route is advertised with a cryptogrphic signature “covering” the first two hops in the AS Path, AS65000 and AS65002. When F advertises this route to H, at the AS65001 to AS65003 border, it again signs the AS Path, including the AS F is advertising the route to, so the signed path includes AS65000, AS65001, and AS65003.
To validate the route, H can use AS65000’s public key to verify the signature over the first two hops in the AS Path. This shows that AS65000 not only did advertise the route to AS65001, but also that it intended to advertise this route to AS65001. In this way, according to the folks working on BGPsec, the intention of AS65000 is laid bare, and the “path of the update” is cryptographically verified through the network.
Except, of course, there is no such thing as an “update” in BGP that is carried from A to H. Instead, at each router along the way, the information stored in the update is broken up and stored in different memory structures, and then rebuilt to be transmitted to specific peers as needed. BGPsec, then, begins with a misunderstanding of how BGP actually works; it attempts to validate the path of an update through an internetwork—and this turns out to be the one piece of information that doesn’t matter all that much in security terms.
But set this problem aside for a moment, and consider how this actually works. First, B, before the signatures, could have sent a single update to multiple peers. After the signatures, each peer must receive its own update. One of the primary ways BGP uses to increase performance is in gathering updates up and sending one update whenever possible using either a peer group or an update group. Worse yet, every reachable destination—NLRI—now must be carried in its own update. So this means no packing, and no peer groups. The signatures themselves must be added to the update packets, as well, which means they must be stored, carried across the wire, etc.
The general assumption in the BGPsec community is the resulting performance problems can be resolved by just upping the processor and bandwidth. That BGPsec has been around for 20 years, and the performance problem still hasn’t been solved is not something anyone seems to consider. 🙂 In practice, this also means replacing every eBGP speaker in the internetwork—perhaps hundreds of thousands of them in the ‘net—to support this functionality. “At what cost,” and “for what tradeoffs,” are questions that are almost never asked.
But let’s lay aside this problem for a moment, and just assumed every eBGP speaking router in the entire ‘net could be replaced tomorrow, at no cost to anyone. Okay, all the BGP AS Path problems are now solved right? Not so fast…
Assume, for a moment, that AS65000 and AS65001 break their peering relationship for some reason. At the moment the B to D peering relationship is shut down, D still has a copy of the signed updates it has been using. How long can AS65001 continue advertising connectivity to this route? The signatures are in band, carried in the BGP update as constructed at B, and transmitted to D. So long as AS65001 has a copy of a single update, it can appear to remain connected to AS65000, even though the connection has been shut down. The answer, then, is that AS65000 must somehow invalidate the updates it previously sent to AS65001. There are three ways to do this.
First, AS65000 could roll its public and private key pair. This might work, so long as peering and depeering events are relatively rare, and the risk from such depeering situations is small. But are they? Further, until the the new public and private key pairs are distributed, and until new routes can be sent through the internetwork using these new keys, the old keys must remain in place to prevent a routing disruption. How long is this? Even if it is 24 hours, probably a reasonable number, AS65001 has the means to grab traffic that is destined to AS65000 and do what it likes with that traffic. Are you comfortable with this?
Second, the community could build a certificate revocation list. This is a mess, so there’s no point in going there.
Third, you could put a timer in the BGP update, like a Link State Update. Once the timer runs down or our, the advertisement must be replaced. Given there are 800k routes in the default free zone, a timer of 24 hours (which would still make me uncomfortable in security terms), there would need to be 800k/24 hours updates per hour added to the load of every router in the Internet. On top of the reduced performance noted above.
Again, it is useful to set this problem aside, and assume it can be solved with the wave of a magic wand someplace. Perhaps someone comes up with a way to add a timer without causing any additional load, or a new form of revocation list is created that has none of the problems of any sort known about today. Given these, all the BGP AS Path problems in the Internet are solved, right?
Consider, for a moment, the position of AS65001 and AS65002. These are transit providers, companies that rely on their customer’s trust, and their ability to out compete in the area of peering, to make money. First, signing updates means that you are declaring, to the entire world, in a legally provable way, who your customers are. This, from what I understand of the provider business model, is not only a no-no, but a huge legal issue. But this is actually, still, the simpler problem to solve.
Second, you cannot deploy this kind of system with a single, centrally stored private key. Assume, for a moment, that you do solve the problem this way. What happens if a single eBGP speaker is compromised? What if you need to replace a single eBGP speaker? You must roll your AS level private key. And replace all your advertisements in the entire Internet. This, from a security standpoint, is a bad idea.
Okay—the reasonable alternative is to create a private key per eBGP speaker. This private key would have its own public key, which would, in turn, be signed by the AS level private key. There are two problems with this scheme, however. The first is: when H validates the signature on some update it has received, it must now find not only the AS level public keys for AS65000 and AS65001, it must find the public key for B and F. This is going to be messy. The second is: By examining the publickeys I receive in a collection of “every update on the Internet,” I can now map the actual peering points between every pair of autonomous systems in the world. All the secret sauce in peering relationships? Exposed. Which router (or set of routers) to attack to impact the business of a specific company? Exposed.
The bottom line is this: even setting aside BGPsec’s flawed view of the way BGP works, even setting aside BGPsec’s flawed view of what needs to be secured, even setting aside BGPsec implementations the benefit of doing the impossible (adding state and processing without impacting performance), even given some magical form of replay attack prevention that costs nothing, BGPsec still exposes information no-one really wants exposed. The tradeoffs are ultimately unacceptable.
Which all comes back to this: If you haven’t found the tradeoffs, you haven’t looked hard enough.
One of the nice things about IS-IS is the ability to run IPv6 and IPv4 in the same protocol, over a single instance. So long as the two topologies are congruent, deploying v6 as dual stack is very simply. But what if your topologies are not congruent? The figure below illustrates the difference.
In this network, there are two topologies, and each topology has two different set of level 1/level 2 flooding domain boundaries. If topology 1 is running IPv4, and topology 2 is running IPv4, it is difficult to describe such a pair of topologies with “standard” IS-IS. The actual flooding process assumes the flooding domain boundaries are on the same intermediate systems, or that the two topologies are congruent.
One way to solve this problem today is to use IS-IS multi-topology, which allows the IPv6 and IPv4 routing information to be carried in separate TLVs so two different Link State Databases (LSDBs), so each IS can compute a different Shortest Path Tree (SPT), one for IPv4, and another for IPv6. Some engineers might find the concept of multi-topology confusing, and it seems like it might be overkill for other use cases. For instance, perhaps you do not care about incongruent topologies, but you do care about carrying IPv6 in a separate instance of IS-IS just to see how it all works, with the eventual goal of combining IPv4 and IPv6 into a single instance of IS-IS. Or perhaps there is some new feature you want to try on a production network in a different instance of IS-IS, without impacting the IS-IS instance that provides the production routing information. There are a number of use cases, of course—you can probably imagine a few.
How can these kind of use cases be solved in IS-IS? In EIGRP, for instance, the Autonomous System (AS) number is used as the EIGRP protocol port number on the wire, allowing multiple EIGRP processes to run in parallel on the same link (even though this capability has never been implemented, as far as I know). Some sort of parallel capability would need to be created for IS-IS; this is what RFC8202, IS-IS Multi-Instance, provides. Not only does RFC8202 provide this capability, it also illustrates an interesting use case for using TLVs in protocol design, rather than fixed length fields.
For readers not familiar with these concepts, fixed length field protocols marshal data into fields of a fixed length, each of which represents a single piece of information. The metadata required to interpret the data is carried entirely in the protocol specification; the protocol itself does not carry any information about the information, only the information itself. The positive attribute of working with fixed length fields is the amount of information carried on the wire is minimized. The negative is that any change in the protocol requires deploying a new version throughout the network. It is difficult to “ignore” bits that are carried without introducing failures. Further, in a fixed length field format, as new information is pushed into the protocol, either new packet formats must be created and handled, or the length of any given packet must be increased.
Type/Length/Value (TLV) formats carry the kind of information in the specification, but they carry information about the kind of information being carried, and the size of the information being carried, in the protocol itself. This means the packet format is larger, but the protocol is more flexible.
In the case of RFC8202, adding this kind of multi-instance capability in a fixed length field formatted protocol would require a shift in the packet format. In a TLV based protocol, like IS-IS, you can add new features can be added by adding a new TLV; this is precisely what RFC8202 does. To provide multi-instance capability, RFC8202 adds a new multi-instance TLV to the IS-IS PDU, which is the “outer packet format” used to carry every other kind of IS-IS information, including hellos, link state information, etc. This new TLV carries an instance ID, which differentiates each instance of IS-IS.
The instance IDs must be configured the same on each IS so they match in order to build adjacencies. Point to point and broadcast operation works the same as “standard” IS-IS, including Designated Intermediate System operation on each instance, etc. IS-IS would be implemented on each IS so each instance will have a separate LSDB, an a separate SPT would be computed across each of these LSDBs. The other key factor will be implementing multiple routing tables, and then finding some way to route traffic using the correct routing table. In the case of IPv4 and IPv6, this is fairly simple to sort out, but it would be more complex in other cases.
RFC8202 adds a new and interesting capability to IS-IS—it may take some time for vendors to implement and deploy this new capability, but this should make IS-IS more flexible in real world situations where multiple interior gateway protocol on a single network.
On this edition of the History of Networking at the Network Collective, we discuss the history of BGP churn and Add Paths with Daniel Walton of Cumulus Networks. The original post on the Network Collective can be seen here.
Our guests are Russ White, a network architect at LinkedIn; and Sue Hares, a consultant and chair of the Inter-Domain Routing Working Group at the IETF. They discuss the history of BGP, the original problems it was intended to solve, and what might change. This is an informed and wide-ranging conversation that also covers whitebox, software quality, and more. Thanks to Huawei, which covered travel and accommodations to enable the Packet Pushers to attend IETF 99 and record some shows to spread the news about IETF projects and initiatives.