CONTENT TYPE
The EIGRP SIA Incident: Positive Feedback Failure in the Wild
Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.
Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.
So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.
In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?
Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?
After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.
Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.
Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…
You can see how this quickly developed into a positive feedback loop—
- EIGRP receives a set of queries to which it must respond
- EIGRP allocates memory for each packet to build the responses
- Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
- EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
- EIGRP SIA timer times out, causing a flood of new queries…
Rinse and repeat until the network fails to converge.
There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.
But rest assured—there are still positive feedback loops lurking in some corner of every network.
DFS and Low Points
On a recent history of networking episode, Alia talked a little about Maximally Redundant Trees (MRTs), and the concept of Depth First Search (DFS) numbering, along with the idea of a low point. While low points are quickly explained in my new book in the context of MRTs, I thought it worthwhile to revisit the concept in a blog post. Take a look at the following network:

On the left side is a small network with the nodes (think of these as routers) being labeled from A through G. On the right side is the same network, only each node has been numbered by traversing the graph, starting at A. This process, in a network, would either require some device which knows about every node and edge (link) in the network, or it would require a distributed algorithm that “walks” the network from one node to another, numbering each node as it is touched, and skipping any node that has already been visited (again, for more details on this, please see the book).
Once this numbering has been done, the numbers now produce this interesting property: if you remove the parent of any node, and the node can still reach a number lower than its own number, the network is two-connected. Take E, numbered as 5, as an example. E’s parent is D, labeled as 3 on the numbered side of the illustration. If you remove D from the network, what is the lowest numbered node E can reach? Start by jumping to the lowest numbered neighbor. In this case, E only has one neighbor remaining, C, which is numbered 6. From here, what is the lowest numbered neighbor of C? It is A, with a number of 1.
D, then, can reach a node which is numbered 1 through some other neighbor than its parent. This means D has some other path to the parent than through its parent, which means D is part of a topology with at least two connections to some other node in the network—it is two connected.
Using this sort of calculation, you can find alternate paths in a network. The problem with using DFS numbering for this is what was stated above—the calculation requires either a “walk through the network” protocol, or it requires some device with a complete view of the network (an LSDB, in link state terms). Neither of these are really conducive to real time calculation during a topology change. MRT solves this by using low points from DFS numbering with Dijkstra’s SPF algorithm to allow the calculation of disjoint paths in near real time in a distributed control plane.
If you haven’t found the tradeoff…
This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:
[time-span]
In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.
Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.
But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.
What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.
Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.
What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.
In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.
The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.
Leave Your Ego at the Door
You are just about to walk into the interview room. Regardless of whether you are being interviewed, or interviewing—what are you thinking about? Are you thinking about winning? Are you thinking about whining? Or are you thinking about engaging? I have noticed, on many mailing lists, and in many other forums, that interviews in our world have devolved into a contest of egos.
The person on the other side of the table has some certification I don’t care about—how can I prove they are dumb, not as smart as their certification might indicate, or… The person on the other side of the table claims to know some protocol, can I find some bit of information they don’t know? These kinds of questions are really just ego questions—and you need to leave them at the door. This is particularly acute with certifications right now—a lot of people doubt the value of certifications, claiming folks who have them don’t know anything, the certifications are worthless, they don’t reflect the real world, etc.
I will agree that we have a problem with the depth and level of knowledge of network engineers at the moment. We all need to grow up a little, learn technologies rather than CLIs, and actually learn how to be engineers. On the other hand, when you interview someone, or when you are being interviewed…
Leave your ego at the door.
Is it really worth losing a really good hire because you needed your ego stroked by “beating” someone in an interview?
No, I didn’t think so.
Securing BGP: A Case Study (10)
The next proposed (and actually already partially operational) system on our list is the Router Public Key Infrastructure (RPKI) system, which is described in RFC7115 (and a host of additional drafts and RFCs). The RPKI systems is focused on solving a single solution: validating that the originating AS is authorized to originate a particular prefix. An example will be helpful; we’ll use the network below.

(this is a graphic pulled from a presentation, rather than one of my usual line drawings)
Assume, for a moment, that AS65002 and AS65003 both advertise the same route, 2001:db8:0:1::/64, towards AS65000. How can the receiver determine if both of these two advertisers can actually reach the destination, or only one can? And, if only one can, how can AS65000 determine which one is the “real thing?” This is where the RPKI system comes into play. A very simplified version of the process looks something like this (assuming AS650002 is the true owner of 2001:db8:0:1::/64):
- AS65002 obtains, from the Regional Internet Registry (labeled the RIR in the diagram), a certificate showing AS65002 has been issued 2001:db8:0:1::/64.
- AS65002 places this certificate into a local database that is synchronized with all the other operators participating in the routing system.
- When AS65000 receives a route towards 2001:db8:0:1::/64, it checks this database to make certain the origin AS on the advertisement matches the owning AS.
If the owner and the origin AS match, AS65000 can increase the route’s preference. If it doesn’t AS65000 can reduce the route’s preference. It might be that AS65000 discards the route if the origin doesn’t match—or it may not. For instance, AS65003 may know, from historical data, or through a strong and long standing business relationship, or from some other means, that 2001:db8:0:1::/64 actually belongs to AS65004, even through the RPKI data claims it belongs to AS65002. Resolving such problems falls to the receiving operator—the RPKI simply provides more information on which to act, rather than dictating a particular action to take.
Let’s compare this to our requirements to see how this proposal stacks up, and where there might be objections or problems.
Centralized versus Decentralized: The distribution of the origin authentication information is currently undertaken with rsync, which means the certificate system is decentralized from a technical perspective.
However—there have been technical issues with the rsync solution in the past, such that it can take up to 24 hours to change the distributed database. This is a pretty extreme case of eventual consistency, and it’s a major problem in the global default free zone. BGP might converge very slowly, but it still converges more quickly than 24 hours.
Beyond the technical problems, there is a business side to the centralized/decentralized issue as well. Specifically, many business don’t want their operations impacted by contract issues, negotiation issues, and the like. Many large providers see the RPKI system as creating just such problems, as the “trust anchor” is located in the RIRs. There are ways to mitigate this—just use some other root, or even self sign your certificates—but the RPKI system faces an uphill battle in this are from large transit providers.
Cost: The actual cost of setting up and running a server doesn’t appear to be very high within the RPKI system. The only things you need to “get into the game” are a couple of VMs or physical servers to run rsync, and some way to inject the information gleaned from the RPKI system into the routing decisions along the network edge (which could even be just plugging the information into existing policy mechanisms).
The business issue described above can also be counted as a cost—how much would it cost a provider if their origin authentication were taken out of the database for a day or two, or even a week or two, while a contract dispute with the RIR was worked out?
Information Cost: There is virtually no additional information cost involved in deploying the RPKI.
Other thoughts: The RPKI system wasn’t designed to, and doesn’t, validate anything other than the origin in the AS Path. It doesn’t, therefore, allow an operator to detect AS65003, for instance, claiming to be connected to AS65002 even though it’s not (or it’s not supposed to transit traffic to AS65002). This isn’t really a “lack” on the part of the RPKI, it’s just not something it’s designed to do.
Overall, the RPKI is useful, and will probably be deployed by a number of providers, and shunned by others. It would be a good component of some larger system (again, this was the original intent, so this isn’t a lack), but it cannot stand alone as a complete BGP security system.
Securing BGP: A Case Study (9)
There are a number of systems that have been proposed to validate (or secure) the path in BGP. To finish off this series on BGP as a case study, I only want to look at three of them. At some point in the future, I will probably write a couple of posts on what actually seems to be making it to some sort of deployment stage, but for now I just want to compare various proposals against the requirements outlined in the last post on this topic (you can find that post here).
The first of these systems is BGPSEC—or as it was known before it was called BGPSEC, S-BGP. I’m not going to spend a lot of time explaining how S-BGP works, as I’ve written a series of posts over at Packet Pushers on this very topic:
Part 1: Basic Operation
Part 2: Protections Offered
Part 3: Replays, Timers, and Performance
Part 4: Signatures and Performance
Part 5: Leaks
Considering S-BGP against the requirements:
- Centralized versus decentralized balance: S-BGP distributes path validation information throughout the internetwork, as this information is actually contained in a new attribute carried with route advertisements. Authorization and authentication are implicitly centralized, however, with the root certificates being held by address allocation authorities. It’s hard to say if this is the correct balance.
- Cost: In terms of financial costs, S-BGP (or BGPSEC) requires every eBGP speaker to perform complex cryptographic operations in line with receiving updates and calculating the best path to each destination. This effectively means replacing every edge router in every AS in the entire world to deploy the solution—this is definitely not cost friendly. Adding to this cost is the simply increase in the table size required to carry all this information, and the loss of commonly used (and generally effective) optimizations.
- Information cost: S-BGP leaks new information into the global table as a matter of course—not only can anyone see who is peered with whom by examining information gleaned from route view servers, they can even figure out how many actual pairs of routers connect each AS, and (potentially) what other peerings those same routers serve. This huge new chunk of information about provider topology being revealed simply isn’t acceptable.
Overall, then, BGP-SEC doesn’t meet the requirements as they’ve been outlined in this series of posts. Next week, I’ll spend some time explaining the operation of another potential system, a graph overlay, and then we’ll consider how well it meets the requirements as outlined in these posts.
Securing BGP: A Case Study (8)
Throughout the last several months, I’ve been building a set of posts examining securing BGP as a sort of case study around protocol and/or system design. The point of this series of posts isn’t to find a way to secure BGP specifically, but rather to look at the kinds of problems we need to think about when building such a system. The interplay between technical and business requirements are wide and deep. In this post, I’m going to summarize the requirements drawn from the last seven posts in the series.
Don’t try to prove things you can’t. This might feel like a bit of an “anti-requirement,” but the point is still important. In this case, we can’t prove which path along which traffic will flow. We also can’t enforce policies, specifically “don’t transit this AS;” the best we can do is to provide information and letting other operators make a local decision about what to follow and what not to follow. In the larger sense, it’s important to understand what can, and what can’t, be solved, or rather what the practical limits of any solution might be, as close to the beginning of the design phase as possible.
In the case of securing BGP, I can, at most, validate three pieces of information:
- That the origin AS in the AS Path matches the owner of the address being advertised.
- That the AS Path in the advertisement is a valid path, in the sense that each pair of autonomous systems in the AS Path are actually connected, and that no-one has “inserted themselves” in the path silently.
- The policies of each pair of autonomous systems along the path towards one another. This is completely voluntary information, of course, and cannot be enforced in any way if it is provided, but more information provided will allow for stronger validation.
There is a fine balance between centralized and distributed systems. There are actually things that can be centralized or distributed in terms of BGP security: how ownership is claimed over resources, and how the validation information is carried to each participating AS. In the case of ownership, the tradeoff is between having a widely trusted third party validate ownership claims and having a third party who can shut down an entire business. In the case of distributing the information, there is a tradeoff between the consistency and the accessibility of the validation information. These are going to be points on which reasonable people can disagree, and hence are probably areas where the successful system must have a good deal of flexibility.
Cost is a major concern. There are a number of costs that need to be considered when determining which solution is best for securing BGP, including—
- Physical equipment costs. The most obvious cost is the physical equipment required to implement each solution. For instance, any solution that requires providers to replace all their edge routers is simply not going to be acceptable.
- Process costs. Any solution that requires a lot of upkeep and maintenance is going to be cast aside very quickly. Good intentions are overruled by the tyranny of the immediate about 99.99% of the time.
Speed is also a cost that can be measured in business terms; if increasing security decreases the speed of convergence, providers who deploy security are at a business disadvantage relative to their competitors. The speed of convergence must be on the order of Internet level convergence today.
Information costs are a particularly important issue. There are at least three kinds of information that can leak out of any attempt to validate BGP, each of them related to connectivity—
- Specific information about peering, such as how many routers interconnect two autonomous systems, where interconnections are, and how interconnection points are related to one another.
- Publicly verifiable claims about interconnection. Many providers argue there is a major difference between connectivity information that can be observed and connectivity information that is claimed.
- Publicly verifiable information about business relationships. Virtually every provider considers it important not to release at least some information about their business relationships with other providers and customers.
While there is some disagreement in the community over each of these points, it’s clear that releasing the first of these is almost always going to be unacceptable, while the second and third are more situational.
With these requirements in place, it’s time to look at a couple of proposed systems to see how they measure up.
