ROUTING

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

  • EIGRP receives a set of queries to which it must respond
  • EIGRP allocates memory for each packet to build the responses
  • Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
  • EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
  • EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

Thoughts on Open/R

Since Facebook has released their Open/R routing platform, there has been a lot of chatter around whether or not it will be a commercial success, whether or not every hyperscaler should use the protocol, whether or not this obsoletes everything in routing before this day in history, etc., etc. I will begin with a single point.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

Design is about tradeoffs. Protocol design is no different than any other design. Hence, we should expect that Open/R makes some tradeoffs. I know this might be surprising to some folks, particularly in the crowd that thinks every new routing system is going to be a silver bullet that solved every problem from the past, that the routing singularity has now occurred, etc. I’ve been in the world of routing since the early 1990’s, perhaps a bit before, and there is one thing I know for certain: if you understand the basics, you would understand there is no routing singularity, and there never will be—at least not until someone produces a quantum wave routing protocol.

Ther reality is you always face one of two choices in routing: build a protocol specifically tuned to a particular set of situations, which means application requirements, topologies, etc., or build a general purpose protocol that “solves everything,” at some cost. BGP is becoming the latter, and is suffering for it. Open/R is an instance of the former.

Which means the interesting question is: what are they solving for, and how? Once you’ve answered this question, you can then ask: would this be useful in my network?

A large number of the points, or features, highlighted in the first blog post are well known routing constructions, so we can safely ignore them. For instance: IPv6 link local only, graceful restart, draining and undraining nodes, exponential backoff, carrying random information in the protocol, and link status monitoring. These are common features of many protocols today, so we don’t need to discuss them. There are a couple of interesting features, however, worth discussing.

Dynamic Metrics. EIGRP once had dynamic metrics, and they were removed. This simple fact always makes me suspicious when I see dynamic metrics touted as a protocol feature. Looking at the heritage of Open/R, however, dynamic metrics were probably added for one specific purpose: to support wireless networks. This functionality is, in fact, provided through DLEP, and supported in OLSR, MANET extended OSPF, and a number of other MANET control planes. Support DLEP and dynamic metrics based on radio information was discussed at the BABEL working group at the recent Singapore IETF, in fact, and the BABEL folks are working on integration dynamic metrics for wireless. So this feature not only makes sense in the wireless world, it’s actually much more widespread than might be apparent if you are looking at the world from an “Enterprise” point of view.

But while this is useful, would you want this in your data center fabric? I’m not certain you would. I would argue dynamic metrics are actually counter productive in a fabric. What you want, instead, is basic reacbility provided by the distributed control plane (routing protocol), and some sort of controller that sits on top using an overlay sort of mechanism to do traffic engineering. You don’t want this sort of policy stuff in a routing protocol in a contained envrionment like a fabric.

Which leads us to our second point: The API for the controller. This is interesting, but not strictly new. Openfabric, for instance, already postulates such a thing, and the entire I2RS working group in the IETF was formed to build such an interface (though it has strayed far from this purpose, as usual with IETF working groups). The really interesting thing, though, is this: this southbound interface is built into the routing protocol itself. This design decision makes a lot of sense in a wireless network, but, again, I’m not certain it does in a fabric.

Why not? It ties the controller architecture, including the southbound interface, to the routing protocol. This reduced component flexibility, which means it is difficult to replace one piece without replacing the other. If you wanted to replace the basic functionality of Open/R without replacing the controller architecture at some point int he future, you must hack your way around this problem. In a monolithic system like Facebook, this might be okay, but in most other network environments, it’s not. In other words, this is a rational decision for Open/R, but I’m not certain it can, or should, be generalized.

This leads to a third observation: This is a monolithic architecture. While in most implementations, there is a separate RIB, FIB, and interface into the the forwarding hardware, Open/R combines all these things into a single system. In any form of Linux based network operating system, for instance, the routing processes install routes into Zebra, which then installs routes into the kernel and notifies processes about routes through the Forwarding Plane Manager (FPM). Some external process (switchd in Cumulus Linux, SWSS in SONiC), then carry this routing information into the hardware.

Open/R, from the diagrams in the blog post, pushes all of this stuff, including the southbound interface from the controller, into a different set of processes. The traditional lines are blurred, which means the entire implemention acts as a single “thing.” You are not going to take the BGP implementation from snaproute or FR Routing and run it on top of Open/R without serious modification, nor are you going to run Open/R on ONL or SONiC or Cumulus Linux without serious modification (or at least a lot of duplication of effort someplace).

This is probably an intentional decision on the part of Open/R’s designers—it is designed to be an “all in one solution.” You RPM it to a device, with nothing else, and it “just works.” This makes perfect sense in the wrieless environment, particularly for Facebook. Whether or not it makes perfect sense in a fabric depends—does this fit into the way you manage boxes today? Do you plan on using boxex Faebook will support, or roll your own drivers as needed for different chipsets, or hope the SAI support included in Open/R is enough? Will you ever need segment routing, or some other capability? How will those be provided for in the Open/R model, given it is an entire stack, and does not interact with any other community efforts?

Finally, there are a number of interesting points that are not discussed in the publicly available information. For instance, this controller—what does it look like? What does it do? How would you do traffic engineering with this sytem? Segment routing, MPLS—none of the standard ways of providing virtualization are mentioned at all. Dynamic metrics simply are not enough in a fabric. How is the flooding of information actually done? In the past, I’ve been led to believe this is based on ZeroMQ—is this still true? How optimal is ZeroMQ for flooding information? What kind of telemetry can you get out of this, and is it carried in the protocol, or in a separate system? I assume they want to carry telemtry as opaque information flooded by the protocol, but does it really make sense to do this?

Overall, Open/R is interesting. It’s a single protocol designed to opperate optimally in a small range of environments. As such, it has some interesting features, and it makes some very specific design choices. Are those design choices optimal for more general cases, or even other specific problem spaces? I would argue the architecture, in particular, is going to be problematic in terms of long term maintenance and growth. This can modified over time, of course, but then we are left with a collection of ideas that are available in many other protocols, making the idea much less interesting.

Is it interesting? Yes. Is it the routing singularity? No. As engineers, we should take it for what it is worth—a chance to see how other folks are solving the problems they are facing in day-to-day operation, and thinking about how some of those lessons might be applied in our own world. I don’t think the folks at Facebook would argue any more than this, either.

A glance back at the looking glass: Will IP really take over the world?

In 2003, the world of network engineering was far different than it is today. For instance, EIGRP was still being implemented on the basis of its ability to support multi-protocol routing. SONET, and other optical technologies, were just starting to come into their own, and all optical switching was just beginning to be considered for large scale deployment. What Hartley says of history holds true when looking back at what seems to be a former age: “The past is a foreign country; they do things differently there.”

Cross posted at CircleID

In the midst of this change, the Association for Computing Machinery (the ACM) published a paper entitled “Will IP really take over the world (of communications)?” This paper, written during the ongoing discussion within the engineering community over whether packet switching or circuit switching is the “better” technology, provides a lot of insight into the thinking of the time. Specifically, as the author say, the belief that IP is better:

…is based on our collective belief that packet switching is inherently superior to circuit switching because of the efficiencies of statistical multiplexing, and the ability of IP to route around failures. It is widely assumed that IP is simpler than circuit switching, and should be more economical to deploy and manage.

Several of the reasons given are worth considering in light of the intervening years between the paper being written and today. Section 2 of the paper suggest four myths that need to be considering in the world of IP based packet switching.

The first of these myths is that IP is more efficient. Here the authors point out an early argument used by packet switching advocates: when a circuit is idle, reserved bandwidth is unused. Because packets does not reserve bandwidth in this way, packet switching makes more efficient use of the available resources. The authors counter this argument by observing packets switched networks are not that heavily utilized, and the exploring why this might be so. The reasons they give are that packet switched networks become unstable if they are overutilized, and the desire to drive delay down as much as possible. The authors say:

But simply reducing the average link utilization will not be enough to make users happy. For a typical user to experience low utilization, the variance of the network utilization needs to be low, too. Reducing variations in link utilization is hard; today we lack effective techniques to do it. It might be argued that the problem will be solved by research efforts on traffic management, congestion control, and multipath routing. But to-date, despite these problems being understood for many years, effective measures are yet to be introduced.

The second of these myths is that IP is stable. Here the authors point out the many potential problems with packet switched networks, particularly in the area of in band signaling.

The third myth the authors consider is that IP is simpler. Here the argument is that circuit switched networks have fewer moving parts, each of which can be more regimented in their implementation. This means circuit switched networks are generally simpler than their packet switched counterparts.

The fourth myth the authors consider is that real time traffic will be able to run over IP based networks. They argue that because of the various quality of service problems, near real time traffic, such as voice and video, can never be carried over an IP network.

What did the authors do get right, and what did they get wrong? Looking back over the intervening years, some of these problems seem to have been largely solved. For instance, between research into new quality of service solutions, virtualization technologies, and research into and deployment of new methods of managing traffic flows over a packet switched network (for instance, as implemented in QUIC), have largely eased the problems the authors discuss.

Other problems in their list seem to have simply fallen by the wayside. For instance, the authors are largely correct on the stability of IP routing. The Internet, as it exists today, is not really stable. The default free zone, the Internet core, never does really converge. But, on the other hand, it turns out that it might not be all that important. So long as reachability exists, and the applications running over the network can deal with the variable delay, then it does not really matter. It is clear that IP networks can be used for near real time traffic at this point, as well.

The third myth is, perhaps, the most interesting—IP is more complex. Circuit switching generally relies on a centralized control plane, rather than a distributed one. Is this truly simpler? Advocates of Software Defined Networks have argued recently that it is, but I am not convinced. The problem here, it seems to me, is the conflation of reachability and policy into a single “thing” that must be solved using a single subsystem of the network as a system. I tend to think that we will have a much healthier discussion around control plane design and operation once we split the various purposes the control plane serves up into different pieces, and discuss each one as a separate problem and solution space.

What did the authors get right? They were right in surmising that circuit switched networks, particularly in the area of optical switching, would continue to play a role in the global Internet. In fact, you could argue that packet and circuit switched networks often live side-by-side in the global Internet today, with many long haul and metropolitan area links being good examples of circuit switched networks underlying a packet switched overlay. MPLS, of course, also provides a sort of hybrid mode between circuit and packet switching which enables many of the solutions deployed today.

Much as we might like to look back in derision at papers such as this, the reality is the authors were accurately pointing out many of the problems with packet switched networks, some of which still have not been solved today. These objections should not be taken as a relic of the past, but rather as a pointer to what yet remains to be solved, where problems may crop up again, and, finally, where we might look for solutions that may, ultimately, solve the problem of carrying data at scale.

BGPsec and Reality

From time to time, someone publishes a new blog post lauding the wonderfulness of BGPsec, such as this one over at the Internet Society. In return, I sometimes feel like I am a broken record discussing the problems with the basic idea of BGPsec—while it can solve some problems, it creates a lot of new ones. Overall, BGPsec, as defined by the IETF Secure Interdomain (SIDR) working group is a “bad idea,” a classic study in the power of unintended consequences, and the fond hope that more processing power can solve everything. To begin, a quick review of the operation of BGPsec might be in order. Essentially, each AS in the AS Path signs the “BGP update” as it passes through the internetwork, as shown below.

In this diagram, assume AS65000 is originating some route at A, and advertising it to AS65001 and AS65002 at B and C. At B, the route is advertised with a cryptographic signature “covering” the first two hops in the AS Path, AS65000 and AS65001. At C, the route is advertised with a cryptogrphic signature “covering” the first two hops in the AS Path, AS65000 and AS65002. When F advertises this route to H, at the AS65001 to AS65003 border, it again signs the AS Path, including the AS F is advertising the route to, so the signed path includes AS65000, AS65001, and AS65003.

To validate the route, H can use AS65000’s public key to verify the signature over the first two hops in the AS Path. This shows that AS65000 not only did advertise the route to AS65001, but also that it intended to advertise this route to AS65001. In this way, according to the folks working on BGPsec, the intention of AS65000 is laid bare, and the “path of the update” is cryptographically verified through the network.

Except, of course, there is no such thing as an “update” in BGP that is carried from A to H. Instead, at each router along the way, the information stored in the update is broken up and stored in different memory structures, and then rebuilt to be transmitted to specific peers as needed. BGPsec, then, begins with a misunderstanding of how BGP actually works; it attempts to validate the path of an update through an internetwork—and this turns out to be the one piece of information that doesn’t matter all that much in security terms.

But set this problem aside for a moment, and consider how this actually works. First, B, before the signatures, could have sent a single update to multiple peers. After the signatures, each peer must receive its own update. One of the primary ways BGP uses to increase performance is in gathering updates up and sending one update whenever possible using either a peer group or an update group. Worse yet, every reachable destination—NLRI—now must be carried in its own update. So this means no packing, and no peer groups. The signatures themselves must be added to the update packets, as well, which means they must be stored, carried across the wire, etc.

The general assumption in the BGPsec community is the resulting performance problems can be resolved by just upping the processor and bandwidth. That BGPsec has been around for 20 years, and the performance problem still hasn’t been solved is not something anyone seems to consider. 🙂 In practice, this also means replacing every eBGP speaker in the internetwork—perhaps hundreds of thousands of them in the ‘net—to support this functionality. “At what cost,” and “for what tradeoffs,” are questions that are almost never asked.

But let’s lay aside this problem for a moment, and just assumed every eBGP speaking router in the entire ‘net could be replaced tomorrow, at no cost to anyone. Okay, all the BGP AS Path problems are now solved right? Not so fast…

Assume, for a moment, that AS65000 and AS65001 break their peering relationship for some reason. At the moment the B to D peering relationship is shut down, D still has a copy of the signed updates it has been using. How long can AS65001 continue advertising connectivity to this route? The signatures are in band, carried in the BGP update as constructed at B, and transmitted to D. So long as AS65001 has a copy of a single update, it can appear to remain connected to AS65000, even though the connection has been shut down. The answer, then, is that AS65000 must somehow invalidate the updates it previously sent to AS65001. There are three ways to do this.

First, AS65000 could roll its public and private key pair. This might work, so long as peering and depeering events are relatively rare, and the risk from such depeering situations is small. But are they? Further, until the the new public and private key pairs are distributed, and until new routes can be sent through the internetwork using these new keys, the old keys must remain in place to prevent a routing disruption. How long is this? Even if it is 24 hours, probably a reasonable number, AS65001 has the means to grab traffic that is destined to AS65000 and do what it likes with that traffic. Are you comfortable with this?

Second, the community could build a certificate revocation list. This is a mess, so there’s no point in going there.

Third, you could put a timer in the BGP update, like a Link State Update. Once the timer runs down or our, the advertisement must be replaced. Given there are 800k routes in the default free zone, a timer of 24 hours (which would still make me uncomfortable in security terms), there would need to be 800k/24 hours updates per hour added to the load of every router in the Internet. On top of the reduced performance noted above.

Again, it is useful to set this problem aside, and assume it can be solved with the wave of a magic wand someplace. Perhaps someone comes up with a way to add a timer without causing any additional load, or a new form of revocation list is created that has none of the problems of any sort known about today. Given these, all the BGP AS Path problems in the Internet are solved, right?

Consider, for a moment, the position of AS65001 and AS65002. These are transit providers, companies that rely on their customer’s trust, and their ability to out compete in the area of peering, to make money. First, signing updates means that you are declaring, to the entire world, in a legally provable way, who your customers are. This, from what I understand of the provider business model, is not only a no-no, but a huge legal issue. But this is actually, still, the simpler problem to solve.

Second, you cannot deploy this kind of system with a single, centrally stored private key. Assume, for a moment, that you do solve the problem this way. What happens if a single eBGP speaker is compromised? What if you need to replace a single eBGP speaker? You must roll your AS level private key. And replace all your advertisements in the entire Internet. This, from a security standpoint, is a bad idea.

Okay—the reasonable alternative is to create a private key per eBGP speaker. This private key would have its own public key, which would, in turn, be signed by the AS level private key. There are two problems with this scheme, however. The first is: when H validates the signature on some update it has received, it must now find not only the AS level public keys for AS65000 and AS65001, it must find the public key for B and F. This is going to be messy. The second is: By examining the publickeys I receive in a collection of “every update on the Internet,” I can now map the actual peering points between every pair of autonomous systems in the world. All the secret sauce in peering relationships? Exposed. Which router (or set of routers) to attack to impact the business of a specific company? Exposed.

The bottom line is this: even setting aside BGPsec’s flawed view of the way BGP works, even setting aside BGPsec’s flawed view of what needs to be secured, even setting aside BGPsec implementations the benefit of doing the impossible (adding state and processing without impacting performance), even given some magical form of replay attack prevention that costs nothing, BGPsec still exposes information no-one really wants exposed. The tradeoffs are ultimately unacceptable.

Which all comes back to this: If you haven’t found the tradeoffs, you haven’t looked hard enough.

On the ‘web: What’s Wrong with BGP

Our guests are Russ White, a network architect at LinkedIn; and Sue Hares, a consultant and chair of the Inter-Domain Routing Working Group at the IETF. They discuss the history of BGP, the original problems it was intended to solve, and what might change. This is an informed and wide-ranging conversation that also covers whitebox, software quality, and more. Thanks to Huawei, which covered travel and accommodations to enable the Packet Pushers to attend IETF 99 and record some shows to spread the news about IETF projects and initiatives.

You can jump to the original post on Packet Pushers here.

BGP Persistent Oscillation

After Daniel Walton visited the History of Networking at the Network Collective, I went back and poked at BGP permanent route oscillations just to refresh my memory. Since I spent the time, I thought it was worth a post, with some observations. When working with networking problems, it is always wise to begin with a network, so…

For those who are interested, I’m pretty much following RFC3345 in this explanation.

There are two BGP route reflectors here, in two different clusters, labeled A and D. The metric for each link is listed on the links between the RR clients, B, C, and E, and the RRs; the cost of the link between the RRs is 1. A single route, 2001:db8:3e8:100::/64 is being advertised in with an AS path of the same length from three different eBGP peering points, each with a different MED. E is receiving the route with a MED of 0, C with a MED of 1, and B with a MED of 10.

Starting with A, walk through one cycle of the persistent oscillation. At A there are two routes—

edge MED IGP Cost
C    1   4
B    10  5 (BEST)

When A runs the bestpath calculation, it will determine the best path should be through C, rather than B (because the IGP cost is lower; the MEDs are not compared due to different AS paths), so it will send an update to each of its peers about this change in its best path. This results in D having the following table—

edge MED IGP Cost
E    0   12 (BEST)
C    1   5

When D receives this update from A, it will calculate the best path towards 2001:db8:3e8:100::/64, and choose the path through E because this path has the lowest MED (the MEDs are compared here because the AS path is the same). On completing the best path calculation, E will send an update to its peers, letting them know about its new best path to the destination, which primarily means A. On receiving this update, A has three routes to the destination in its table—

edge MED IGP Cost
E    0   13
C    1   4
B    10  5 (BEST)

The key point to think through here is why the third route in the table is best. The BGP bestpath process compares each pair of routes, starting with the first pair. So the first two routes are compared, the best of those two is chosen and compared with the third, the best of these two is compared with the fourth, etc., until the end of the table is reached. Here the first two paths are compared, and the path through E because of the lower MED (the AS path is the same). When the path through E is compared to the path through B, however, the path through B wins because the IGP metric is lower (the AS path is different).

At this point, A will now send an update to its peers, specifically D, informing D of this change in its best path. Note this update removes any information about the path through C from D’s table, so D only has a partial view of the available paths. As a result, D will have the following in its table—

edge MED IGP Cost
E    0   12
B    10  6 (BEST)

D will select the path through B because the IGP cost is lower, and send an update to A with this new information. This will result in A having the table this process started with—

edge MED IGP Cost
C    1   4
B    10  5 (BEST)

What is interesting about this is the removal of information about the path through C from D’s view of the network. Essentially, what is happening here is D is switching between two different views of the network topology, one of which includes B, the other of which includes C. The reason the ADD_PATH extension solves this problem is that A and D both have a full view of every exit point once each BGP speaker sends every route to each destination, rather than just the best path.

This is, in effect, another instance of the inconsistency of a distributed database causing a persistent condition in a control plane. In (loosely!) CAP theorem terms, distributed routing protocol always choose accessibility (the local device can read the database to calculate loop free paths) and partitioning (the database is always copied to every device speaking the protocol) over consistency—eventually, or “not always,” consistent databases will always be the result of such a situation. As A and D read their databases, each of which contain incomplete information about the real state of the network, they will make different decisions about what the best path to the destination in question is. As they each change their views of the topology, they will send updated information to one another, causing the other BGP speaker to recompute its view of the topology, and…

Persistent BGP oscillation is an interesting study in the way consistency impacts distributed routing protocol design and convergence.

On the ‘web: All you ever wanted to know about EIGRP

In episode 5 the Network Collective panel dives deep into the inner-workings of EIGRP and how to tune the protocol to work best for you. This isn’t your run of the mill EIGRP training session though, so buckle up and dig in to learn a lot about a protocol which appears pretty straight forward on the surface.

This last week I was on the Network Collective discussing EIGRP with Nick Russo; even if you think this protocol is dead, it’s well worth watching or listening to. And if this isn’t enough EIGRP for you, the EIGRP book on Addision-Wesley is another good resource.

eigrp-for-ip

Anycast and Latency

One of the things I hear from time to time is how smaller Internet facing service deployments, with just a few instances, cannot really benefit from anycast. Particularly in the active-active data center use case, where customers can connect to one data center or another, the cost of advertising the service as an anycast, and the resulting requirement to keep the backend databases tightly synchronized, is often played as a eating a lot of complexity for the simplicity of having a single address in the DNS system, and hence not losing customer interaction time while the DNS records are timing out so the customer can reconnect to the service.

There is, in fact, some interesting recent research in this area. The research is directed at the DNS root servers themselves, probably because they are publicly accessible, and a well known system that has relied on anycast for many years (so the operators of the root DNS servers are probably well versed in the ways of anycast). One interesting chart from the post over at APNIC’s blog is—

The C root has 8 servers, while the L root has around 144 (according to the article pointed to above). Why is it that the C and L roots both show about the same performance, from an RTT perspective, even though they have wildly different numbers of servers serving an anycast address? The most likely explanation lies in the problem of diminishing returns; once you get past some number of servers in an anycast cluster, the gain from adding “one more server” really isn’t all that great.

How many servers? The authors of the paper say the magic number is 12.

But—it is important to point out that root servers serve a (largely) global audience, and so the “customers” of each of these anycast addresses is bound to be widely dispersed. For narrower audiences (geographically), the problem of diminishing returns will likely set in more quickly, perhaps with three or four servers servicing the anycast address.

The bottom line? Geographic dispersion against your customer base probably has more impact on the effectiveness of anycast in spreading load than the number of servers services the anycast address, above some minimal number (probably around three or four). If you are running any sort of service that needs high availability, even if its is grounded in TCP rather than UDP or QUIC, it is well worth looking at anycast to provide faster service and higher availability.

Don’t Leave Features Lying Around

Many years ago, when multicast was still a “thing” everyone expected to spread throughout the Internet itself, a lot of work went into specifying not only IP multicast control planes, but also IP multicast control planes for interdomain use (between autonomous systems). BGP was modified to support IP multicast, for instance, in order to connect IP multicast groups from sender to receiver across the entire ‘net. One of these various efforts was a protocol called the Distance Vector Multicast Routing Protocol, or DVMRP. The general idea behind DVMRP was to extend many of the already well-known mechanisms for signaling IP multicast with interdomain counterparts. Specifically, this meant extending IGMP to operate across provider networks, rather than within a single network.

As you can imagine, one problem with any sort of interdomain effort is troubleshooting—how will an operator be able to troubleshoot problems with interdomain IGMP messages sources from outside their network? There is no way to log into another provider’s network (some silliness around competition, I would imagine), so something else was needed. Hence the idea of being able to query a router for information about its connected interfaces, multicast neighbors, and other information, was written up in draft-ietf-idmr-dvmrp-v3-11 (which expired in 2000). Included in this draft are two extensions to IGMP; Ask Neighbors2 and Neighbors2. If an operator wanted to know about a particular router which seemed to be causing a particular multicast traffic flow problems, they could ask some local router to send the remote router an Ask Neighbors2 message. The receiving router would respond with a unicast message, Neighbors2 providing details about the local configuration of interfaces, multicast neighbors, and other odds and ends.

If this is starting to sound like a bad idea, that’s because it probably is a bad idea… But many vendors implemented it anyway (probably because there were fat checks associated with implementing the feature, the main reason vendors implement most things). Now some more recent security researchers enter into the picture, and start asking questions like, “I wonder if this functionality can be used to build a DDoS attack.” As it turns out, it can.

Team Cymru set about scanning the entire IPv4 address space to discover any routers “out there” that might happen to support Ask Neighbor2, and to figure out what the response packets would look like. The key point is, of course, that the source address of the Ask Neighbor2 packet can be forged, so you can send a lot of routers Ask Neighbor2 packets, and—by using the source address of some device you would like to attack—have all of the routers known to respond send back Neighbor2 messages. The key questions were, then, how many responders would there be, and how large the replies would be.

The good news is they only found around 305,000 responders to the Ask Neighbor2 request. Those responders, however, transmitted some 263 million packets, most of which were much larger than the original query. This could, therefore, actually be a solid base for a nice DDoS attack. Cisco and Juniper issues alerts, and started working to remove this code from future releases of their operating systems.

One interesting result of this test was that the Cisco implementation of the Neighbor2 response actually contained the IOS Software version number, rather than the IGMP version number, or some other version number. The test revealed that some 1.3% of the responding Cisco routers are running IOS version 10.0, which hasn’t been supported in 20 years. 73.7% of the Cisco routers that responded are running IOS 12.x, which hasn’t been supported in 4 years.

There are a number of lessons here, including—

  • Protocols designed to aid troubleshooting can often easily turned into an attack surface
  • Old versions of code might well be vulnerable to things you don’t know about, would never have looked for, and would likely even never have thought about looking for
  • Large feature sets often also have large attack surfaces; it is almost impossible to actually know about, or even think through, where every possible attack surface might be in tens of millions of lines of code

It is the last lesson I think network engineers need to take to heart. The main thing network engineers seem to do all day is chase new features. Maybe we need an attitude adjustment in this—even new features are a tradeoff. This is one of those points that make disaggregation so very interesting in large scale networks.

At some point, it’s not adding features that is interesting. It’s removing them.