OSPF Topology Transparent Zones

Anyone who has worked with OSPF for any length of time has at least heard of areas—but perhaps before diving into Topology Transparent Zones (TTZs), a short review is in order.

In this diagram, routers A and B are in area 0, routers C and D are Area Border Routers (ABRs), and routers E, F, G, H, and K are all in area 1. The ABRs, C and D, do not advertise the existence of E, F, G, H, or K to the routers in area 0, nor the links to or between any of those routers. Any reachable destinations in area 1 are advertised using a em>summary LSA, or a type 3 LSA, towards A and B. From the perspective of A and B, 100::/64 and 101::/64 would be advertised by C and D as directly connected destinations, using the cost from C and D to each of these two destinations, based on a summary LSA.

What if you wanted to place H and K in their own area, with G as an ABR, behind the existing area 1? You cannot do this in OSPF using any form of a standard flooding domain, or area. There is no way to take information about a destination out of a summary LSA and place it in another summary LSA without risking the formation of a routing loop. Once the topology information is lost, which OSPF relies on to prevent the formation of routing loops, there is no way to gain it back.

Why might you want to do this? The primary reasons for stacking multiple areas in this way relate to mobile ad-hoc networks; you cannot control the way routers might be connected together, nor where flooding domain boundaries might be drawn. Assuming you have a requirement, you would be out of luck if the only tool you have is a standard OSPF area.

However, you can stack areas using TTZs. What is a TTZ? It is a flooding domain that does not appear to exist to the routers around it—or rather, connected to either side of the TTZ. In the illustration below, routers C, D, E, F, and G have been configured so they form a TTZ.

Note that when the network is configured this way, routers A, B, H, and K all believe they are in a single flooding domain. From the perspective of these four routers, C, D, and G appear to be directly connected by a set of point-to-point links. Traffic sent from B to 101::/64, for instance, would pass into the TTZ at D, through the TTZ, and back out the other side at G. This traffic would be forwarded normally through the TTZ, as the routers within the TTZ have a full view of the network topology, as well as reachable destinations.

The destination within the TTZ, 100::/64, would be advertised by each TTZ edge router as if it were a directly connected destination. Hence G, C, and D would each advertise 100::/64 as a connected interface in their router LSA, or their type 1 advertisement. This allows routers outside the TTZ to reach destinations within the TTZ, without revealing the internal topology of the TTZ.

Information about the topology within the TTZ is carried in a set of opaque LSAs which are not forwarded outside the TTZ. This allows the TTZ to maintain consistent internal state, without burdening the rest of the network with internal topology information, or topology changes.

TTZ’s are a rather narrowly focused solution; it is not likely you will see these used in an OSPF network near you any time soon. On the other hand, they are an interesting, experimental, addition to OSPF that can be used to solve a set of edge cases.

You can read about TTZs more in RFC8099.

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

  • EIGRP receives a set of queries to which it must respond
  • EIGRP allocates memory for each packet to build the responses
  • Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
  • EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
  • EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

What’s wrong with the IETF. And what’s right

I have not counted the IETF’s I have attended; I only know the first RFC on which I’m listed as a co-author was published in 2000, so this must be close to 20 years of interacting with the IETF community. I’m pretty certain I’ve attended at least two meetings a year in some years, and three meetings a year in most of those years. Across that time, there has never been a time when I have not been told, at least once, “the IETF is broken.” And there has not been a single time I cannot remember agreeing with the sentiment.

My belief that the IETF is broken, however, is narrow, and offset by the many ways in which I think the IETF is still useful for the larger networking community.

So, how is the IETF broken? The trend that bothers me the most right now is the gold rush syndrome. A new technology is brought into the IETF, and if it looks like it might somehow be “important,” there is a “land rush” as people stake out new drafts, find use cases, find corner cases, and work to develop drafts and communities around those drafts. This generally results in a sort of ossification process, where there are clear insiders and outsiders, an entirely new vocabulary is developed, and the drafts fly so fast and furious there is almost no time to read them all. There are many problematic parts of this process. For instance, there is often a feeling that “this is important, no need to get the details right,” or “if you don’t understand, butt out of the conversation.”

A particularly troubling aspect of this is the wide desire to “be famous,” to chair a working group, to get your name on a draft, and ultimately on an RFC. This eventually becomes all important, carrying all practical considerations before it. The old ethos of “build small and flexible, code it, and let it grow where needed” is almost always lost in the shuffle of producing tens of drafts. Companies pay by the draft, or only pay for travel if you have a draft—both of which have a tendency to destroy the value of the community itself, and the way the community functions.

So that’s what broken. What’s right?

One night I was walking back from dinner with a couple of friends—Gonzolos and Joe—and I ran into Stewart Bryant in the hotel lobby. Soon enough, Paul Mockapetris joined the conversation. At some point, Dave Oran, Ignas B, and George Swallow joined the conversation. There are few places in the world you can get some collection of folks who had a hand in the creation of technologies like DNS, psuedowires, MPLS/TE, SMTP, IS-IS, IP fast reroute, and probably a dozen other technologies, standing around talking about “the good old days,” or even where to go for dinner. Across this week, I’ve chatted with Tony Li, Tony P, Jeff T, Alvaro Retana, Russ Housley, Fred Baker, Alia Atlas, and… more than I can remember.

If there is one that is striking about all of these people, it is that they are all more interested in solving problems than taking credit. They all live by the old IETF mantra: “it is amazing what can get done when no-one cares who gets the credit.” None of them are obsessed with getting their names on drafts, or with inventing something new that will change the world. They see problems, they develop solutions; that is all.

This, then, is what is right about the IETF. People who care about the challenges users have with networks, and have spent their lives finding solutions. So people are what’s wrong with the IETF, and people are also what’s right. The point?

You can choose to participate in the IETF. In fact, I hope to see you at a future meeting. But if you choose to participate, be a part of the solution, rather than a part of the problem. Be someone who looks on the land rush with skepticism, who doesn’t care about getting their name on a draft, who just wants to help solve a problem that has been fairly explained and defined to the community. Don’t be afraid to work on small things, and to insist that solutions be small and well scoped, even if that means your name is not put up in lights.

Even better advice: carry this into all the communities in which you live in your life. We live in an age that values name recognition far too much, that worries too much about being left out of the latest gold rush, that worries too much about our “rightly deserved” fifteen minutes of fame. This goes far beyond network engineering, the ethos of the “old way” in the IETF. It’s a lesson we can all take away from this little community of engineers who have worked so hard across the years to build something on which we all rely every day—to the very formats of the packets which carry this screed to your computer screen, your email box, or however else you are reading it.

DFS and Low Points

On a recent history of networking episode, Alia talked a little about Maximally Redundant Trees (MRTs), and the concept of Depth First Search (DFS) numbering, along with the idea of a low point. While low points are quickly explained in my new book in the context of MRTs, I thought it worthwhile to revisit the concept in a blog post. Take a look at the following network:

On the left side is a small network with the nodes (think of these as routers) being labeled from A through G. On the right side is the same network, only each node has been numbered by traversing the graph, starting at A. This process, in a network, would either require some device which knows about every node and edge (link) in the network, or it would require a distributed algorithm that “walks” the network from one node to another, numbering each node as it is touched, and skipping any node that has already been visited (again, for more details on this, please see the book).

Once this numbering has been done, the numbers now produce this interesting property: if you remove the parent of any node, and the node can still reach a number lower than its own number, the network is two-connected. Take E, numbered as 5, as an example. E’s parent is D, labeled as 3 on the numbered side of the illustration. If you remove D from the network, what is the lowest numbered node E can reach? Start by jumping to the lowest numbered neighbor. In this case, E only has one neighbor remaining, C, which is numbered 6. From here, what is the lowest numbered neighbor of C? It is A, with a number of 1.

D, then, can reach a node which is numbered 1 through some other neighbor than its parent. This means D has some other path to the parent than through its parent, which means D is part of a topology with at least two connections to some other node in the network—it is two connected.

Using this sort of calculation, you can find alternate paths in a network. The problem with using DFS numbering for this is what was stated above—the calculation requires either a “walk through the network” protocol, or it requires some device with a complete view of the network (an LSDB, in link state terms). Neither of these are really conducive to real time calculation during a topology change. MRT solves this by using low points from DFS numbering with Dijkstra’s SPF algorithm to allow the calculation of disjoint paths in near real time in a distributed control plane.

SLAAC and DHCPv6

When deploying IPv6, one of the fundamental questions the network engineer needs to ask is: DHCPv6, or SLAAC? As the argument between these two has reached almost political dimensions, perhaps a quick look at the positive and negative attributes of each solution are. Originally, the idea was that IPv6 addresses would be created using stateless configuration (SLAAC). The network parts of the address would be obtained by listening for a Router Advertisement (RA), and the host part would be built using a local (presumably unique) physical (MAC) address. In this way, a host can be connected to the network, and come up and run, without any manual configuration. Of course, there is still the problem of DNS—how should a host discover which server it should contact to resolve domain names? To resolve this part, the DHCPv6 protocol would be used. So in IPv6 configuration, as initially conceived, the information obtained from RA would be combined with DNS information from DHCPv6 to fully configure an IPv6 host when it is attached to the network.

There are several problems with this scheme, as you might expect. The most obvious is that most network operators do not want to deploy two protocols to solve a single problem—configuring IPv6 hosts. What might not be so obvious, however, is that many network operators care a great deal about whether hosts are configured statelessly or through a protocol like DHCPv6.

Why would an operator want stateful configuration? Primarily because they want to control which devices can receive an IPv6 address, and hence communicate with other devices on the network. When using DHCPv6, just like DHCP with IPv4, the operator can set parameters around what kinds of devices, or perhaps even which specific devices, will be able to receive an IPv6 address. Further, the DHCPv6 server can be tied to the DNS server, so each host which connects to the network can also be given a DNS entry. Proper DNS entries are often a requirement for many applications. There are Dynamic DNS (DDNS) implementations that can solve this problem, but they are not often considered secure enough for a controlled network environment.

Why would an operator want stateless autoconfiguration? First, because they want any random user who can successfully connect to the network to be able to get an IPv6 address without any other configuration, and without the provider needing run any sort of special protocol or configuration to allow this. In fact, DHCPv6, in some environments, at least, can be seen as an attack surface, or rather a hole through which attacks can potentially be driven. Second, stateful configuration also has a failover problem; if the DHCPv6 server fails, then hosts can no longer obtain an IPv6 address, and the network no longer works. This could be, to say the least, problematic for service providers. Finally, SLAAC has a set of privacy extensions outlined in RFC4941 that (theoretically) prevent a host from being tracked based on its IPv6 address over time. This is a very attractive property for edge facing service providers.

The original set of drafts, however, only provided for DNS information to be carried through DHCPv6, and had no failover mechanism for DHCPv6. These two things, together, made it impossible to use just one of these two options. More recent work, however, has remedied both parts of this problem, making either option able to stand on its own. RFC6106, which is a bit older (2010), provides for DNS advertisement in the RA protocol. This allows an operator who would like to run everything completely stateless to do so, including hosts learning which DNS resolver to use. On the other side, RFC8156, which was just ratified in July of 2017, allows a pair of DHCPv6 servers to act as a failover pair. While this is more complex than simple DHCPv6, it does solve the problem of a host failing to operate correctly simply because the DHCPv6 server has failed.

Which of the two is now the best choice? If you do not have any requirement to restrict the hosts that can attach to the network using IPv6, then SLAAC, combined with DNS advertisement in the RA, and possibly with DDNS (if needed), would be the right choice. However, if the environment must be more secure, then DHCPv6 is likely to be the better solution.

A word of warning, though—using DHCPv6 to ensure each host received an IPv6 address that can be used anyplace in the network, and then stretching layer 2 to allow any host to roam “anywhere,” is really just not a good idea. I have worked on networks where this kind of thing has been taken to a global scale. It might seem cute at first, but this kind of solution will ultimately become a monster when it grows up.

If you haven’t found the tradeoff…

This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:

[time-span]

Just like how every action has an equal and opposite reaction, each “positive” design decision necessarily creates a “negative” compromise. Insofar as designs necessarily create compromises, those compromises are very much intentional. (And in the same vein, unintentional compromises are a sign of bad design.)

In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.

Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.

Cross posted at CircleID

But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.

What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.

Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.

What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.

In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.

The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.

Leave Your Ego at the Door

You are just about to walk into the interview room. Regardless of whether you are being interviewed, or interviewing—what are you thinking about? Are you thinking about winning? Are you thinking about whining? Or are you thinking about engaging? I have noticed, on many mailing lists, and in many other forums, that interviews in our world have devolved into a contest of egos.

The person on the other side of the table has some certification I don’t care about—how can I prove they are dumb, not as smart as their certification might indicate, or… The person on the other side of the table claims to know some protocol, can I find some bit of information they don’t know? These kinds of questions are really just ego questions—and you need to leave them at the door. This is particularly acute with certifications right now—a lot of people doubt the value of certifications, claiming folks who have them don’t know anything, the certifications are worthless, they don’t reflect the real world, etc.

I will agree that we have a problem with the depth and level of knowledge of network engineers at the moment. We all need to grow up a little, learn technologies rather than CLIs, and actually learn how to be engineers. On the other hand, when you interview someone, or when you are being interviewed…

Leave your ego at the door.

Is it really worth losing a really good hire because you needed your ego stroked by “beating” someone in an interview?

No, I didn’t think so.