TECH – Page 11 – rule 11 reader

Why is the Feasibility Condition Less Than?

A reader recently emailed me with this question: Why isn’t the condition for a Feasible Successor set to less than (<), rather than less than of equal (<=), in EIGRP? It certainly seems, as noted in the email, that this rules out a lot of possible possible loop free alternate paths. The network below will be used to illustrate.

First, assume all links are cost of 1 except D->C, which is cost of 2. Here D will choose B as the Successor, and the FC will be set to 2. The RD of C will be 1, so C will be an FS. Now consider two failures. The first failure is D->B. D will immediately reroute to the FS, which is C, without changing the FC. This works, because C’s cost to 100::/64 via D is 4, much higher than it’s cost to 100::64 along C->A. Now consider what happens if A->100::/64 fails. If the timing of the query “works right,” C and B will be notified first, then finally D. Even if D is somehow notified before C, and D switches to C as its FS, the traffic is dropped, rather than looped—so all is happy.

Now change the situation a little. Assume the A->C link is cost of 2, and the remaining links are 1. Now assume you make the FS condition <=, rather than "just" <. From D's perspective, C is still an FS. From C's perspective, D is also an FS. Suppose the B->D link fails. D switches to C, who’s path is intact; this works. Assume the C->A link fails. C switches to D, who’s path is intact; this works.

Finally, assume the A->100::/64 link fails. If the timing is just right, D and C will receive the update with this failure at the same moment, and will both switch to their FS’s. Now you have a loop. How long will this loop last? Until C and D can do a diffusing update — probably around 250ms or less… But if you count the outside computation time, it’s the SIA timer, which is around 10 minutes in more recent versions. Hence, “just the wrong circumstances” can cause up to a 10 minute loop. Not good.

The bottom line is this: any time you have a situation where two routers can end up pointing at one another as their local FS, you have a ring of some number of hops. If the final destination is outside the ring, some member of the ring must be the point at which traffic leaves the ring, and moves towards the destination. If the link connecting the ring to the destination fails, the update carrying the information about the loss of connectivity to the destination must travel around the ring in both directions.

When the update reaches the point at which routing would normally split horizon—the “point at which the waterfall splits,” so-to-speak, it will like reach both sides of the split horizon point at a close enough interview to cause a loop in the forwarding tables. This situation causes microloops in a link state protocol, but microloops are often resolved quickly, and hence tend to be tolerable. In a distance vector protocol, like EIGRP, the length of time the microloop can exist can be much longer—ultimately it depends on the speed at which a distributed computation can take place (because the computation is not local to each node), and, failing that, the amount of time the network can remain in an unstable state before “something is done about it.”

This, by the way, is why I am always opposed to increasing the SIA timer in EIGRP above the “factory defaults.” The SIA timer is essentially the amount of time you are willing to allow your network to remain unconverged in the worst case, and hence either dropping or looping traffic.

Posted in TECH, WRITTEN

OSPF Topology Transparent Zones

Anyone who has worked with OSPF for any length of time has at least heard of areas—but perhaps before diving into Topology Transparent Zones (TTZs), a short review is in order.

In this diagram, routers A and B are in area 0, routers C and D are Area Border Routers (ABRs), and routers E, F, G, H, and K are all in area 1. The ABRs, C and D, do not advertise the existence of E, F, G, H, or K to the routers in area 0, nor the links to or between any of those routers. Any reachable destinations in area 1 are advertised using a em>summary LSA, or a type 3 LSA, towards A and B. From the perspective of A and B, 100::/64 and 101::/64 would be advertised by C and D as directly connected destinations, using the cost from C and D to each of these two destinations, based on a summary LSA.

What if you wanted to place H and K in their own area, with G as an ABR, behind the existing area 1? You cannot do this in OSPF using any form of a standard flooding domain, or area. There is no way to take information about a destination out of a summary LSA and place it in another summary LSA without risking the formation of a routing loop. Once the topology information is lost, which OSPF relies on to prevent the formation of routing loops, there is no way to gain it back.

Why might you want to do this? The primary reasons for stacking multiple areas in this way relate to mobile ad-hoc networks; you cannot control the way routers might be connected together, nor where flooding domain boundaries might be drawn. Assuming you have a requirement, you would be out of luck if the only tool you have is a standard OSPF area.

However, you can stack areas using TTZs. What is a TTZ? It is a flooding domain that does not appear to exist to the routers around it—or rather, connected to either side of the TTZ. In the illustration below, routers C, D, E, F, and G have been configured so they form a TTZ.

Note that when the network is configured this way, routers A, B, H, and K all believe they are in a single flooding domain. From the perspective of these four routers, C, D, and G appear to be directly connected by a set of point-to-point links. Traffic sent from B to 101::/64, for instance, would pass into the TTZ at D, through the TTZ, and back out the other side at G. This traffic would be forwarded normally through the TTZ, as the routers within the TTZ have a full view of the network topology, as well as reachable destinations.

The destination within the TTZ, 100::/64, would be advertised by each TTZ edge router as if it were a directly connected destination. Hence G, C, and D would each advertise 100::/64 as a connected interface in their router LSA, or their type 1 advertisement. This allows routers outside the TTZ to reach destinations within the TTZ, without revealing the internal topology of the TTZ.

Information about the topology within the TTZ is carried in a set of opaque LSAs which are not forwarded outside the TTZ. This allows the TTZ to maintain consistent internal state, without burdening the rest of the network with internal topology information, or topology changes.

TTZ’s are a rather narrowly focused solution; it is not likely you will see these used in an OSPF network near you any time soon. On the other hand, they are an interesting, experimental, addition to OSPF that can be used to solve a set of edge cases.

You can read about TTZs more in RFC8099.

Posted in STANDARDS, TECH, WRITTEN

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

EIGRP receives a set of queries to which it must respond
EIGRP allocates memory for each packet to build the responses
Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

Posted in DESIGN, TECH, WRITTEN

DFS and Low Points

On a recent history of networking episode, Alia talked a little about Maximally Redundant Trees (MRTs), and the concept of Depth First Search (DFS) numbering, along with the idea of a low point. While low points are quickly explained in my new book in the context of MRTs, I thought it worthwhile to revisit the concept in a blog post. Take a look at the following network:

On the left side is a small network with the nodes (think of these as routers) being labeled from A through G. On the right side is the same network, only each node has been numbered by traversing the graph, starting at A. This process, in a network, would either require some device which knows about every node and edge (link) in the network, or it would require a distributed algorithm that “walks” the network from one node to another, numbering each node as it is touched, and skipping any node that has already been visited (again, for more details on this, please see the book).

Once this numbering has been done, the numbers now produce this interesting property: if you remove the parent of any node, and the node can still reach a number lower than its own number, the network is two-connected. Take E, numbered as 5, as an example. E’s parent is D, labeled as 3 on the numbered side of the illustration. If you remove D from the network, what is the lowest numbered node E can reach? Start by jumping to the lowest numbered neighbor. In this case, E only has one neighbor remaining, C, which is numbered 6. From here, what is the lowest numbered neighbor of C? It is A, with a number of 1.

D, then, can reach a node which is numbered 1 through some other neighbor than its parent. This means D has some other path to the parent than through its parent, which means D is part of a topology with at least two connections to some other node in the network—it is two connected.

Using this sort of calculation, you can find alternate paths in a network. The problem with using DFS numbering for this is what was stated above—the calculation requires either a “walk through the network” protocol, or it requires some device with a complete view of the network (an LSDB, in link state terms). Neither of these are really conducive to real time calculation during a topology change. MRT solves this by using low points from DFS numbering with Dijkstra’s SPF algorithm to allow the calculation of disjoint paths in near real time in a distributed control plane.

Posted in STANDARDS, TECH, WRITTEN

SLAAC and DHCPv6

When deploying IPv6, one of the fundamental questions the network engineer needs to ask is: DHCPv6, or SLAAC? As the argument between these two has reached almost political dimensions, perhaps a quick look at the positive and negative attributes of each solution are. Originally, the idea was that IPv6 addresses would be created using stateless configuration (SLAAC). The network parts of the address would be obtained by listening for a Router Advertisement (RA), and the host part would be built using a local (presumably unique) physical (MAC) address. In this way, a host can be connected to the network, and come up and run, without any manual configuration. Of course, there is still the problem of DNS—how should a host discover which server it should contact to resolve domain names? To resolve this part, the DHCPv6 protocol would be used. So in IPv6 configuration, as initially conceived, the information obtained from RA would be combined with DNS information from DHCPv6 to fully configure an IPv6 host when it is attached to the network.

There are several problems with this scheme, as you might expect. The most obvious is that most network operators do not want to deploy two protocols to solve a single problem—configuring IPv6 hosts. What might not be so obvious, however, is that many network operators care a great deal about whether hosts are configured statelessly or through a protocol like DHCPv6.

Why would an operator want stateful configuration? Primarily because they want to control which devices can receive an IPv6 address, and hence communicate with other devices on the network. When using DHCPv6, just like DHCP with IPv4, the operator can set parameters around what kinds of devices, or perhaps even which specific devices, will be able to receive an IPv6 address. Further, the DHCPv6 server can be tied to the DNS server, so each host which connects to the network can also be given a DNS entry. Proper DNS entries are often a requirement for many applications. There are Dynamic DNS (DDNS) implementations that can solve this problem, but they are not often considered secure enough for a controlled network environment.

Why would an operator want stateless autoconfiguration? First, because they want any random user who can successfully connect to the network to be able to get an IPv6 address without any other configuration, and without the provider needing run any sort of special protocol or configuration to allow this. In fact, DHCPv6, in some environments, at least, can be seen as an attack surface, or rather a hole through which attacks can potentially be driven. Second, stateful configuration also has a failover problem; if the DHCPv6 server fails, then hosts can no longer obtain an IPv6 address, and the network no longer works. This could be, to say the least, problematic for service providers. Finally, SLAAC has a set of privacy extensions outlined in RFC4941 that (theoretically) prevent a host from being tracked based on its IPv6 address over time. This is a very attractive property for edge facing service providers.

The original set of drafts, however, only provided for DNS information to be carried through DHCPv6, and had no failover mechanism for DHCPv6. These two things, together, made it impossible to use just one of these two options. More recent work, however, has remedied both parts of this problem, making either option able to stand on its own. RFC6106, which is a bit older (2010), provides for DNS advertisement in the RA protocol. This allows an operator who would like to run everything completely stateless to do so, including hosts learning which DNS resolver to use. On the other side, RFC8156, which was just ratified in July of 2017, allows a pair of DHCPv6 servers to act as a failover pair. While this is more complex than simple DHCPv6, it does solve the problem of a host failing to operate correctly simply because the DHCPv6 server has failed.

Which of the two is now the best choice? If you do not have any requirement to restrict the hosts that can attach to the network using IPv6, then SLAAC, combined with DNS advertisement in the RA, and possibly with DDNS (if needed), would be the right choice. However, if the environment must be more secure, then DHCPv6 is likely to be the better solution.

A word of warning, though—using DHCPv6 to ensure each host received an IPv6 address that can be used anyplace in the network, and then stretching layer 2 to allow any host to roam “anywhere,” is really just not a good idea. I have worked on networks where this kind of thing has been taken to a global scale. It might seem cute at first, but this kind of solution will ultimately become a monster when it grows up.

Posted in STANDARDS, TECH

Securing BGP: A Case Study (10)

The next proposed (and actually already partially operational) system on our list is the Router Public Key Infrastructure (RPKI) system, which is described in RFC7115 (and a host of additional drafts and RFCs). The RPKI systems is focused on solving a single solution: validating that the originating AS is authorized to originate a particular prefix. An example will be helpful; we’ll use the network below.

(this is a graphic pulled from a presentation, rather than one of my usual line drawings)

Assume, for a moment, that AS65002 and AS65003 both advertise the same route, 2001:db8:0:1::/64, towards AS65000. How can the receiver determine if both of these two advertisers can actually reach the destination, or only one can? And, if only one can, how can AS65000 determine which one is the “real thing?” This is where the RPKI system comes into play. A very simplified version of the process looks something like this (assuming AS650002 is the true owner of 2001:db8:0:1::/64):

AS65002 obtains, from the Regional Internet Registry (labeled the RIR in the diagram), a certificate showing AS65002 has been issued 2001:db8:0:1::/64.
AS65002 places this certificate into a local database that is synchronized with all the other operators participating in the routing system.
When AS65000 receives a route towards 2001:db8:0:1::/64, it checks this database to make certain the origin AS on the advertisement matches the owning AS.

If the owner and the origin AS match, AS65000 can increase the route’s preference. If it doesn’t AS65000 can reduce the route’s preference. It might be that AS65000 discards the route if the origin doesn’t match—or it may not. For instance, AS65003 may know, from historical data, or through a strong and long standing business relationship, or from some other means, that 2001:db8:0:1::/64 actually belongs to AS65004, even through the RPKI data claims it belongs to AS65002. Resolving such problems falls to the receiving operator—the RPKI simply provides more information on which to act, rather than dictating a particular action to take.

Let’s compare this to our requirements to see how this proposal stacks up, and where there might be objections or problems.

Centralized versus Decentralized: The distribution of the origin authentication information is currently undertaken with rsync, which means the certificate system is decentralized from a technical perspective.

However—there have been technical issues with the rsync solution in the past, such that it can take up to 24 hours to change the distributed database. This is a pretty extreme case of eventual consistency, and it’s a major problem in the global default free zone. BGP might converge very slowly, but it still converges more quickly than 24 hours.

Beyond the technical problems, there is a business side to the centralized/decentralized issue as well. Specifically, many business don’t want their operations impacted by contract issues, negotiation issues, and the like. Many large providers see the RPKI system as creating just such problems, as the “trust anchor” is located in the RIRs. There are ways to mitigate this—just use some other root, or even self sign your certificates—but the RPKI system faces an uphill battle in this are from large transit providers.

Cost: The actual cost of setting up and running a server doesn’t appear to be very high within the RPKI system. The only things you need to “get into the game” are a couple of VMs or physical servers to run rsync, and some way to inject the information gleaned from the RPKI system into the routing decisions along the network edge (which could even be just plugging the information into existing policy mechanisms).

The business issue described above can also be counted as a cost—how much would it cost a provider if their origin authentication were taken out of the database for a day or two, or even a week or two, while a contract dispute with the RIR was worked out?

Information Cost: There is virtually no additional information cost involved in deploying the RPKI.

Other thoughts: The RPKI system wasn’t designed to, and doesn’t, validate anything other than the origin in the AS Path. It doesn’t, therefore, allow an operator to detect AS65003, for instance, claiming to be connected to AS65002 even though it’s not (or it’s not supposed to transit traffic to AS65002). This isn’t really a “lack” on the part of the RPKI, it’s just not something it’s designed to do.

Overall, the RPKI is useful, and will probably be deployed by a number of providers, and shunned by others. It would be a good component of some larger system (again, this was the original intent, so this isn’t a lack), but it cannot stand alone as a complete BGP security system.

Posted in SECURITY, TECH, WRITTEN

Securing BGP: A Case Study (9)

There are a number of systems that have been proposed to validate (or secure) the path in BGP. To finish off this series on BGP as a case study, I only want to look at three of them. At some point in the future, I will probably write a couple of posts on what actually seems to be making it to some sort of deployment stage, but for now I just want to compare various proposals against the requirements outlined in the last post on this topic (you can find that post here).

The first of these systems is BGPSEC—or as it was known before it was called BGPSEC, S-BGP. I’m not going to spend a lot of time explaining how S-BGP works, as I’ve written a series of posts over at Packet Pushers on this very topic:

Part 1: Basic Operation
Part 2: Protections Offered
Part 3: Replays, Timers, and Performance
Part 4: Signatures and Performance
Part 5: Leaks

Considering S-BGP against the requirements:

Centralized versus decentralized balance: S-BGP distributes path validation information throughout the internetwork, as this information is actually contained in a new attribute carried with route advertisements. Authorization and authentication are implicitly centralized, however, with the root certificates being held by address allocation authorities. It’s hard to say if this is the correct balance.
Cost: In terms of financial costs, S-BGP (or BGPSEC) requires every eBGP speaker to perform complex cryptographic operations in line with receiving updates and calculating the best path to each destination. This effectively means replacing every edge router in every AS in the entire world to deploy the solution—this is definitely not cost friendly. Adding to this cost is the simply increase in the table size required to carry all this information, and the loss of commonly used (and generally effective) optimizations.
Information cost: S-BGP leaks new information into the global table as a matter of course—not only can anyone see who is peered with whom by examining information gleaned from route view servers, they can even figure out how many actual pairs of routers connect each AS, and (potentially) what other peerings those same routers serve. This huge new chunk of information about provider topology being revealed simply isn’t acceptable.

Overall, then, BGP-SEC doesn’t meet the requirements as they’ve been outlined in this series of posts. Next week, I’ll spend some time explaining the operation of another potential system, a graph overlay, and then we’ll consider how well it meets the requirements as outlined in these posts.

Posted in SECURITY, TECH, WRITTEN