Anyone who has worked with OSPF for any length of time has at least heard of areas—but perhaps before diving into Topology Transparent Zones (TTZs), a short review is in order.
In this diagram, routers A and B are in area 0, routers C and D are Area Border Routers (ABRs), and routers E, F, G, H, and K are all in area 1. The ABRs, C and D, do not advertise the existence of E, F, G, H, or K to the routers in area 0, nor the links to or between any of those routers. Any reachable destinations in area 1 are advertised using a em>summary LSA, or a type 3 LSA, towards A and B. From the perspective of A and B, 100::/64 and 101::/64 would be advertised by C and D as directly connected destinations, using the cost from C and D to each of these two destinations, based on a summary LSA.
What if you wanted to place H and K in their own area, with G as an ABR, behind the existing area 1? You cannot do this in OSPF using any form of a standard flooding domain, or area. There is no way to take information about a destination out of a summary LSA and place it in another summary LSA without risking the formation of a routing loop. Once the topology information is lost, which OSPF relies on to prevent the formation of routing loops, there is no way to gain it back.
Why might you want to do this? The primary reasons for stacking multiple areas in this way relate to mobile ad-hoc networks; you cannot control the way routers might be connected together, nor where flooding domain boundaries might be drawn. Assuming you have a requirement, you would be out of luck if the only tool you have is a standard OSPF area.
However, you can stack areas using TTZs. What is a TTZ? It is a flooding domain that does not appear to exist to the routers around it—or rather, connected to either side of the TTZ. In the illustration below, routers C, D, E, F, and G have been configured so they form a TTZ.
Note that when the network is configured this way, routers A, B, H, and K all believe they are in a single flooding domain. From the perspective of these four routers, C, D, and G appear to be directly connected by a set of point-to-point links. Traffic sent from B to 101::/64, for instance, would pass into the TTZ at D, through the TTZ, and back out the other side at G. This traffic would be forwarded normally through the TTZ, as the routers within the TTZ have a full view of the network topology, as well as reachable destinations.
The destination within the TTZ, 100::/64, would be advertised by each TTZ edge router as if it were a directly connected destination. Hence G, C, and D would each advertise 100::/64 as a connected interface in their router LSA, or their type 1 advertisement. This allows routers outside the TTZ to reach destinations within the TTZ, without revealing the internal topology of the TTZ.
Information about the topology within the TTZ is carried in a set of opaque LSAs which are not forwarded outside the TTZ. This allows the TTZ to maintain consistent internal state, without burdening the rest of the network with internal topology information, or topology changes.
TTZ’s are a rather narrowly focused solution; it is not likely you will see these used in an OSPF network near you any time soon. On the other hand, they are an interesting, experimental, addition to OSPF that can be used to solve a set of edge cases.
Nick Russo and I talk about the OSPF routing protocol, covering some of those things you’ve probably never thought about, and giving away one of my favorite interview questions in the process. You can see the original episode at the Network Collective here.
Looking at the capabilities of any given protocol running in our networks today, it certainly seems there are few use cases left the protocol cannot support. In fact, modern interior gateway protocols have become so capable that it almost seems like we only need one to support everything. This is not reality, of course—there are many places where a specialized protocol would do better than a general purpose one, and there are still many use cases current protocols cannot support. One such use case, for OSPF, illustrated below, uses a two part metric to solve a very specific problem, as illustrated below.
On the left side of this diagram you can see the “typical” broadcast network. Originally common in what used to be called local area networks, these sorts of broadcast segments are actually more common on metro edges and wireless networks today than in a campus or data center. Anyone familiar with OSPF should already know what the problem is with this sort of configuration—if you build an adjacency between every pair of routers illustrated here, you end up with just too much state. For instance—
- Each pair of routers in the network will form an adjacency, and hence exchange a full copy of the link state database. This means each change in the network topology or reachability information will cause at least three different flooded packets to cross this segment to report the change (and it is likely more!).
- When building the shortest path tree (SPT), each router in the network will need to find the path through each one of the routers connected to this broadcast segment. This means the SPT will show three possible paths through this single network from the perspective of every router in the network.
This problem is resolved in OSPF through the creation of a pseudonode (pnode), as shown on the right side of the illustration above. One of the routers connected to the network is elected to advertise a network node Link State Advertisement (LSA). Each of the routers connected to the network then advertises a link to this network LSA (a type 2 LSA in OSPFv2, for reference). The pnode then advertises a zero cost link back to each router connected to the network.
So far, so good—this is all fairly standard stuff.
How is the total cost through this network calculated? When moving from any node to the pnode, the cost the node is advertising is used to calculate the cost. Moving from the pnode to the other side of the network, however, the cost is set to zero.
But what if the cost from, say, router A to the network is different from the cost from router B? In some situations, it is quite possible for these two routers to have different speed links onto the common broadcast network. Think metro Ethernet, for instance, or a wireless network, or perhaps a satellite link.
In this situation, there is no way to express the reality that the cost from A to B is different than the cost from A to C because B and C have different access speeds. The only metric calculated into the cost to cross the network is from A to the pnode; the cost from the pnode to either B or C is always 0.
RFC8042 provides a fix for this problem. But before getting to the fix, there is one more problem to be addressed. Assume A is elected to build and transmit the pnode (A is the designated router). How should A discover the cost to every other router connected to the network from to perspective of the pnode? This, in fact, is a major problem.
How does RFC8042 fix this problem? By adding a new TLV into the OSPF router advertisment that allows a router to advertise its cost to reach the pnode. This is called a two part metric. When B advertises its connection to the pnode, then, it can include the cost from the pnode to itself in this new TLV. In the case of OSPFv2, this new information is carried in a TLV in an opaque LSA; in the case of OSPFv3, this new metric is carried as part of the traffic engineering TLV.
When a router calculates the cost through the network, when it encounters a network LSA (which represents a pnode), it will—
- Calculate the cost to the pnode using the metric advertised by the node pointing to the pnode. This is the “entry point” into the broadcast network; this cost is already included in the calculation of the SPT today.
- Add the cost from the new TLV in the information provided by the router the pnode points to. This is the new information that allows the calculating router to determine the best path through the network including the cost from each node to the network itself.
This might not be really useful in a lot of networks, but where it is useful, it is almost essential to properly calculating a truly optimal shortest path through the network. From a complexity perspective, state has been added to increase optimization; that these two are being traded off here is no surprise.
OSPF was originally designed in an age when processors were much less capable, available memory was much smaller, and link bandwidths were much lower. To conserve processing power, memory, and n-the-wire bandwidth, OSPF was designed using fixed length fields (FLFs). TLVs are more difficult to process than an FLF; to process a set of FLFs, you build a structure that mimics the FLF formatting, and simple “impose” it on the memory location where you have stored the data to be decoded, as shown below.
In the FLF model, the structure can simply be imposed on the memory locations, and the values can be read directly. In the TLV model, each type code must be read to determine the kind of information and the length must be read to determine the size of the field. Only once these two items in the TLV header have been read can the actual data be related to a particular field in the resulting data structure.
In the intervening years, however, compute, storage, and network capabilities have increased dramatically; the following chart, taken from a book I’m working on, shows this growth since about the start of the “network era.”
As compute, storage, and network capabilities have increased, the arguments against using TLVs have faded into the background. Keeping with this, OSPF is moving towards a TLV based structure, slowly replacing the FLF structure it was originally designed with (OSPF TLVs!). The first big move in this direction was actually taken in RFC5340 in 2008, when OSPFv3 was designed to carry IPv6 reachability information. Two little noticed, but very important, changes were made between OSPFv2 and OSPFv3, both covered in section 2.8, LSA Format Changes.
The first of these is the removal of address (or reachability) information from network LSAs. While OSPF has always built its SPT in precisely the same way as IS-IS always has—it builds a tree, and hangs reachability off the tree. Many engineers are tricked into thinking SPF integrates reachability into the tree because “everything is IP addresses,” but this simply is not true. Changing the LSA format in this way clearly separates reachability information from topology information in OSPFv3. The second change was this: “Router interface information MAY be spread across multiple router-LSAs. Receivers MUST concatenate all the router-LSAs originated by a given router when running the SPF calculation.” Router information can now be fragmented across multiple packets.
The reason they are important is the frame a current draft that takes OSPF one step further, draft-ietf-ospf-ospfv3-lsa-extend, titled OSPFv3 LSA Extendibility. The general idea is this: build new LSA types that convert the fields that seem likely to need some form of extension in the future into TLVs. The process chosen is simple, on reflection.
Each of the existing LSA types are replicated in a new format. To differentiate between the old and the new, 0x20 (or rather 32 decimal) is added to each LSA type. Hence a type 1 router LSA becomes a type 33 extended router LSA, etc. A chart taken from the draft is below.
In each case, a number of fields have been converted to TLVs (though not all fields). Mostly the entire old LSA type has been converted into a TLV carried within the new LSA type, which allows for new TLV types within each LSA in the future. For instance, the Intra-area prefix TLV, defined in section 3.4, looks like this—
The individual fields are not TLVs; the entire “old LSA” has been put into a TLV. This allows the new inter area ruote LSA to carry new kinds of reachability information in the future by simply including new TLV types in the new LSA type. If a new address family needs to be supported, it is just a matter of creating a new TLV that can be used to describe this new reachability and putting it into the existing inter-area LSA. The LSA provides the context, while the TLV provides the data.
Migration between the old and new TLVs is managed much like migration between narrow and wide metrics in IS-IS; routers capable of generating the new LSA types do so; these new LSAs are marked so routers that do not understand them simply reflood them (they don’t process them in any way). So long as there is any router flooding the older style TLVs within the netwrk (or flooding domain, there are options for both), the new TLVs will not be used to compute loop free paths. Once all the routers in the network (or, again, flooding domain) are advertising the new TLVs, then preference can be given to the new TLVs, and, finally, the old TLVs removed from the network.
Hopefully vendors implement these new TLV based LSA formats quickly, as they will make OSPF much more extensible over the long run.
OSPF and IS-IS, both link state protocols, use mechanisms that manage flooding on a broadcast link, as well as simplify the shortest path tree passing through the broadcast link. OSPF elects a Designated Router (or DR) to simplify broadcast links, and IS-IS elects a Designated Intermediate System (or DIS—a topic covered in depth in the IS-IS Livelesson I recently recorded). Beyond their being used in two different protocols, there are actually subtle differences in the operation of the two mechanisms. So what is the difference?
Before we dive into differences, let’s discuss the similarities. We’ll use the illustration below as a basis for discussion.
Q1 and Q2 illustrate the operation of a link state protocol without any optimization on a broadcast network, with Q1 showing the network, and Q2 showing the resulting shortest path tree. Q3 and Q4 illustrate link state operation with optimization over a broadcast link. It’s important to differentiate between building a shortest path tree (SPT) across the broadcast link and flooding across the broadcast link—flooding is where the primary differences lie in the handling of broadcast links in the two protocols.
Let’s consider building the SPT first. Both protocols operate roughly the same in this area, so I’ll describe both at the same time. In Q1, there is no DIS (DR)—called a pseudonode from this point forward, so each pair of intermediate systems (routers) connected to the link will advertise connectivity to every other IS (router) connected to the broadcast link (so A will advertise B, C, and D as connected; B will advertise A, C, and D as connected, etc.). From E’s perspective, then, the broadcast link will appear to be a full mesh network, as shown in Q2. Full mesh connectivity adds a good bit of complexity to the tree, as you can see in the diagram.
To reduce this complexity, the intermediate systems (routers) connected to the broadcast link can elect an intermediate system (router) to generate a pseudonode, as shown in Q3. Regardless of their actual adjacency state, each intermediate system (router) reports it is only connected to the pseudonode, and the pseudonode reports what appears to be a set of point-to-point links to each of the intermediate systems (routers) connected to the broadcast link. The result is an SPT that looks like Q4; there is an extra hop in the SPT, but it is much simpler to calculate (even in reaction to topology changes).
For calculating the SPT, then, OSPF and IS-IS act much the same. The difference between the two actually lies in the way flooding is handled across the broadcast link.
Assume, for a moment, the illustrated network is running OSPF, and router A receives an updated LSA. Router A will flood this new LSA to a special multicast address that only the DR (and BDR) listens to. Once the DR has received (and acknowledged) the LSA, it will reflood the new LSA to the “all routers” multicast address on the broadcast segment.
Now let’s change the situation, and say the network is running IS-IS. Again, intermediate system A receives an updated LSP—but rather than sending this new information just to the DIS, it floods the LSP onto the entire link. So what does the DIS do in terms of flooding? The only thing a DIS does on the flooding side of things in IS-IS is to send out periodic packets describing its database (Complete Sequence Number Packets, or CSNPs). If an IS happens to fail to receive a particular LSP that was flooded by another IS connected to the same link, it will notice the missing LSP in its database, and request it from the DIS.
The flooding mechanisms, then, are completely different between the two protocols—differences that show up in the implementation details. For instance, IS-IS doesn’t elect a backup DIS, but OSPF does—why? Because if the OSPF DR fails, some router connected to the link must take over flooding changed LSPs, or the database can become desynchronized. On the other hand, if the DIS fails, there’s not much chance of anything bad happening. If one intermediate system drops an LSP, when a new DIS is elected and sends a CSNP, the loss will be noticed and taken care of. For much the same reason, the IS-IS can be preemptively replaced, while the OSPF DR cannot be.
In 2000, Eric Brewer was observing and discussing the various characteristics of database systems. Through this work, he observed that a database generally has three characteristics—
- Consistency, which means the database will produce the same result for any two readers who happen to read a value at the same moment in time.
- Availability, which means the database can be read by any reader at any moment in time.
- Partion Tolerance, which means the database can be partitioned.
Brewer, in explaining the relationship between the three in a 2012 article, says—
The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C (consistency). Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A (availability).
The CAP theorem, therefore, represents a two out of three situation—yet another two out of three “set” we encounter in the real world, probably grounded someplace in the larger space of complexity. We’ll leave the relationship to complexity on the side for the moment, however, and just look at how CAP relates to routing.
Network engineers familiar with the concepts of complexity, and their application in the network design realm, should quickly realize that this “two out of three” paradigm almost never represents a set of “hard and fast realities,” but is better represented as a set of continuums along which a system designer can choose. Moving towards one pole in the two out of three space inevitably means giving up some measure along one of the other two; the primary choices you can make are which one and how much.
One more detail is important before diving into a deeper discussion: A partition is normally thought of as a hard divide in database theory, but for the purposes of explaining how CAP applies to routing, I’m going to soften the definition a little. Instead of strict partitions, I want to consider partitioning as relating to the distribution of the database across a set of nodes separated by a network.
So how does CAP apply to routing?
To see the application, let’s ignore the actual algorithm used to compute the best path in any given routing protocol—whether Dijkstra’s SPF, or Garcia’s DUAL, or path vector—and focus, for a moment, on just getting the information required to compute loop free paths to every router running in a single failure (and/or flooding) domain. This latter point is important, because failure and flooding domains actually represent an intentional form of information hiding that causes the database to be correct for the purposes of routing (most of the time), but inconsistent in strict database consistency terms on a permanent basis. We need, therefore, to treat aggregation and summarization as a separate topic when considering how CAP impacts routing. Since the tradeoffs are easiest to see in link state protocols, we’re actually going to focus in that area.
Once we’ve softened the definition of partition tolerance, it’s immediately obvious that all forms of flooding, for instance, are simply the synchronization process of a highly partitioned database. Further, we can observe that the database flooded through the network must be, largely, always available (as in always able to be read by a local device). Given CAP, we should immediately suspect there is some sort of consistency tradeoff, as we’ve already defined two of the three legs of the two out of three set pretty strongly.
And, in fact, we do. In link state protocols, the result of inconsistency is most easily seen in the problem of microloops during convergence. To understand this, let’s look at the network just below.
Assume, for a moment, that the link between routers A and B fails. At this moment, let’s assume:
- The best path for B is through A
- The best path for C is through B
- The best path for D is through C
Let’s also assume that each router in the network takes precisely the same amount of time to flood a packet, and to run SPF. When the link fails, C is going to detect the failure immediately, and hence will calculate SPF first. If we assume that D can only process this information about the topology change once it’s received an update to its database, then we can see where C will always process the new information before D.
But we can also observe this point: once C has processed the topology change—C has run SPF—router C will point at router D for its new best path. Until router D receives the new topology information and runs SPF, D will continue to point to C as the best path.
This is the microloop.
It happens any time there is a ring in the topology of more than four hops with any link state protocol. And it’s a direct result of the relationship between consistency and availability in the distributed database. We can verify this by examining the way in which the microloop can be resolved—specifically, by using ordered FIB.
In ordered FIB, each router in the network calculates an amount of time it should wait before installing newly calculated routes. This calculation is, in turn, based on the number of routers relying on the local router for the best path. If you back up and think through how the microloop happens, you can see that what’s happening is C is now waiting for D to calculate before installing the new route into its local forwarding table.
In CAP terms, what ordered FIB does is reduce availability some small amount—in effect by not allowing the forwarding plane at C to read the newly calculated routes for some period of time—in order to ensure consistency in the database. This observation should confirm microloops are related to the relationship between CAP and distributed routing.
There are a number of other relationships between CAP and routing; I’ll put some of these on my list for later posts—but this one has reached the 1000 word mark, and so is long enough for a starter on the topic.