The OSPF Two Part Metric

Looking at the capabilities of any given protocol running in our networks today, it certainly seems there are few use cases left the protocol cannot support. In fact, modern interior gateway protocols have become so capable that it almost seems like we only need one to support everything. This is not reality, of course—there are many places where a specialized protocol would do better than a general purpose one, and there are still many use cases current protocols cannot support. One such use case, for OSPF, illustrated below, uses a two part metric to solve a very specific problem, as illustrated below.

On the left side of this diagram you can see the “typical” broadcast network. Originally common in what used to be called local area networks, these sorts of broadcast segments are actually more common on metro edges and wireless networks today than in a campus or data center. Anyone familiar with OSPF should already know what the problem is with this sort of configuration—if you build an adjacency between every pair of routers illustrated here, you end up with just too much state. For instance—

  • Each pair of routers in the network will form an adjacency, and hence exchange a full copy of the link state database. This means each change in the network topology or reachability information will cause at least three different flooded packets to cross this segment to report the change (and it is likely more!).
  • When building the shortest path tree (SPT), each router in the network will need to find the path through each one of the routers connected to this broadcast segment. This means the SPT will show three possible paths through this single network from the perspective of every router in the network.

This problem is resolved in OSPF through the creation of a pseudonode (pnode), as shown on the right side of the illustration above. One of the routers connected to the network is elected to advertise a network node Link State Advertisement (LSA). Each of the routers connected to the network then advertises a link to this network LSA (a type 2 LSA in OSPFv2, for reference). The pnode then advertises a zero cost link back to each router connected to the network.

So far, so good—this is all fairly standard stuff.

How is the total cost through this network calculated? When moving from any node to the pnode, the cost the node is advertising is used to calculate the cost. Moving from the pnode to the other side of the network, however, the cost is set to zero.

But what if the cost from, say, router A to the network is different from the cost from router B? In some situations, it is quite possible for these two routers to have different speed links onto the common broadcast network. Think metro Ethernet, for instance, or a wireless network, or perhaps a satellite link.

In this situation, there is no way to express the reality that the cost from A to B is different than the cost from A to C because B and C have different access speeds. The only metric calculated into the cost to cross the network is from A to the pnode; the cost from the pnode to either B or C is always 0.

RFC8042 provides a fix for this problem. But before getting to the fix, there is one more problem to be addressed. Assume A is elected to build and transmit the pnode (A is the designated router). How should A discover the cost to every other router connected to the network from to perspective of the pnode? This, in fact, is a major problem.

How does RFC8042 fix this problem? By adding a new TLV into the OSPF router advertisment that allows a router to advertise its cost to reach the pnode. This is called a two part metric. When B advertises its connection to the pnode, then, it can include the cost from the pnode to itself in this new TLV. In the case of OSPFv2, this new information is carried in a TLV in an opaque LSA; in the case of OSPFv3, this new metric is carried as part of the traffic engineering TLV.

When a router calculates the cost through the network, when it encounters a network LSA (which represents a pnode), it will—

  • Calculate the cost to the pnode using the metric advertised by the node pointing to the pnode. This is the “entry point” into the broadcast network; this cost is already included in the calculation of the SPT today.
  • Add the cost from the new TLV in the information provided by the router the pnode points to. This is the new information that allows the calculating router to determine the best path through the network including the cost from each node to the network itself.

This might not be really useful in a lot of networks, but where it is useful, it is almost essential to properly calculating a truly optimal shortest path through the network. From a complexity perspective, state has been added to increase optimization; that these two are being traded off here is no surprise.

You an read all of RFC8042 here.

DR versus DIS: What’s the Diff?

OSPF and IS-IS, both link state protocols, use mechanisms that manage flooding on a broadcast link, as well as simplify the shortest path tree passing through the broadcast link. OSPF elects a Designated Router (or DR) to simplify broadcast links, and IS-IS elects a Designated Intermediate System (or DIS—a topic covered in depth in the IS-IS Livelesson I recently recorded). Beyond their being used in two different protocols, there are actually subtle differences in the operation of the two mechanisms. So what is the difference?

Before we dive into differences, let’s discuss the similarities. We’ll use the illustration below as a basis for discussion.

Broadcast network operation in link state protocols

Q1 and Q2 illustrate the operation of a link state protocol without any optimization on a broadcast network, with Q1 showing the network, and Q2 showing the resulting shortest path tree. Q3 and Q4 illustrate link state operation with optimization over a broadcast link. It’s important to differentiate between building a shortest path tree (SPT) across the broadcast link and flooding across the broadcast link—flooding is where the primary differences lie in the handling of broadcast links in the two protocols.

Let’s consider building the SPT first. Both protocols operate roughly the same in this area, so I’ll describe both at the same time. In Q1, there is no DIS (DR)—called a pseudonode from this point forward, so each pair of intermediate systems (routers) connected to the link will advertise connectivity to every other IS (router) connected to the broadcast link (so A will advertise B, C, and D as connected; B will advertise A, C, and D as connected, etc.). From E’s perspective, then, the broadcast link will appear to be a full mesh network, as shown in Q2. Full mesh connectivity adds a good bit of complexity to the tree, as you can see in the diagram.

To reduce this complexity, the intermediate systems (routers) connected to the broadcast link can elect an intermediate system (router) to generate a pseudonode, as shown in Q3. Regardless of their actual adjacency state, each intermediate system (router) reports it is only connected to the pseudonode, and the pseudonode reports what appears to be a set of point-to-point links to each of the intermediate systems (routers) connected to the broadcast link. The result is an SPT that looks like Q4; there is an extra hop in the SPT, but it is much simpler to calculate (even in reaction to topology changes).

For calculating the SPT, then, OSPF and IS-IS act much the same. The difference between the two actually lies in the way flooding is handled across the broadcast link.

Assume, for a moment, the illustrated network is running OSPF, and router A receives an updated LSA. Router A will flood this new LSA to a special multicast address that only the DR (and BDR) listens to. Once the DR has received (and acknowledged) the LSA, it will reflood the new LSA to the “all routers” multicast address on the broadcast segment.

Now let’s change the situation, and say the network is running IS-IS. Again, intermediate system A receives an updated LSP—but rather than sending this new information just to the DIS, it floods the LSP onto the entire link. So what does the DIS do in terms of flooding? The only thing a DIS does on the flooding side of things in IS-IS is to send out periodic packets describing its database (Complete Sequence Number Packets, or CSNPs). If an IS happens to fail to receive a particular LSP that was flooded by another IS connected to the same link, it will notice the missing LSP in its database, and request it from the DIS.

The flooding mechanisms, then, are completely different between the two protocols—differences that show up in the implementation details. For instance, IS-IS doesn’t elect a backup DIS, but OSPF does—why? Because if the OSPF DR fails, some router connected to the link must take over flooding changed LSPs, or the database can become desynchronized. On the other hand, if the DIS fails, there’s not much chance of anything bad happening. If one intermediate system drops an LSP, when a new DIS is elected and sends a CSNP, the loss will be noticed and taken care of. For much the same reason, the IS-IS can be preemptively replaced, while the OSPF DR cannot be.

Flooding Domains versus Areas

At a fundamental level, OSPF and IS-IS are similar in operation. They both build neighbor adjacencies. They both use Dijkstra’s shortest path first (SPF) to find the shortest path to every destination in the network. They both advertise the state of each link connected to a network device. There are some differences, of course, such as the naming (OSI addresses versus IP addresses, intermediate systems versus routers). Many of the similarities and differences don’t play too much in the design of a network, though.

One difference that does play into network design, however, is the way in which the two protocols break up a single failure domain into multiple failure domains. In OSPF we have areas, while in IS-IS we have flooding domains. What’s the difference between these two, and how does it effect network design? Let’s use the illustration below as a helpful reference point for the two different solutions.


In the upper network, we have an illustration of how OSPF areas work. Each router at the border of a flooding domain (an Area Border Router, or ABR), has a certain number of interfaces in each area. Another way of saying this is that an OSPF ABR is never fully within a single area, but rather participates in many areas—thus each area is bounded within or on a router. Thus OSPF areas can be seen as a collection of flooding domains connected via hard boundaries at the ABRs.

In the lower network, we have an illustration of how IS-IS flooding domains work. Each intermediate system is entirely within a single flooding domain; every interface on the intermediate system is part of each flooding domain the IS itself is a part of. Any given IS may be a part of multiple flooding domains, and hence provide connectivity between the flooding domains it’s in. The easiest way to understand IS-IS flooding domains is as a set of overlapping instances of IS-IS; some intermediate systems just happen to be connected to more than one instance, and thus can provide routing between them.

The difference between these two is, by the way, related to the protocol suite in which each was developed. In the ISO suite, each device has a single address; in IP, each interface has an address. Hence, in OSPF, area boundaries are thought of as occurring between interfaces, while in ISO, flooding domain boundaries are thought of as occurring between devices. The old saw is “OSPF areas break on devices, IS-IS flooding domains break on wires.” This isn’t absolutely true, as IS-IS flooding domains still “break” on devices—it’s just that each device is entirely contained in a single flooding domain.

This fundamental difference leads to other, less obvious differences. For instance—

  • Links in an IS-IS network can be configured in multiple flooding domains, as can a device. In OSPF, each link is in precisely one area (although there are workarounds to place a single link in multiple areas).
  • OSPF assumes reachability information at the link (or interface) level, which means reachability information is automatically carried from one area to another. IS-IS assumes reachability at the device level, which means information about link or interface reachability is not automatically carried between flooding domains. Rather, interface level reachability must be leaked, or more properly redistributed, between flooding domains (remember, each flooding domain can be seen as a separate instance of IS-IS, and devices just happen to be in more than one flooding domain).

Learning to design with IS-IS isn’t just “OSPF on stilts;” it’s a different way of looking at the problem space. If you want to know more, take a look at my recently published LiveLesson on the IS-IS protocol.

Slicing and Dicing Flooding Domains (2)

The first post in this series is here.

Finally, let’s consider the first issue, the SPF run time. First, if you’ve been keeping track of the SPF run time in several locations throughout your network (you have been, right? Right?!? This should be a regular part of your documentation!), then you’ll know when there’s a big jump. But a big jump without a big change in some corresponding network design parameter (size of the network, etc.), isn’t a good reason to break up a flooding domain. Rather, it’s a good reason to go find out why the SPF run time changed, which means a good session of troubleshooting what’s probably an esoteric problem someplace.

Assume, however, that we’re not talking about a big jump. Rather, the SPF run time has been increasing over time, or you’re just looking at a particular network without any past history. My rule of thumb is to start really asking questions when the SPF run time gets to around 100ms. I don’t know where that number came from—it’s a “seat of the pants thing,” I suppose. Most networks today seem to run SPF in less than 10ms, though I’ve seen a few that seem to run around 30ms, so 100ms seems excessive. I know a lot of people do lots of fancy calculations here (the speed of the processor and the percentage of processor used for other things and the SPF run time and…), but I’m not one for doing fancy stuff when a simple rule of thumb seems to work to alert me to problems going into a situation.

But before reaching for my flooding domain slicing tools because of a 100ms SPF run time, I’m going to try and bring the time down in other ways.

First, I’m going to make certain incremental and partial SPF are enabled. There’s little to no cost here, so just do it. Second, I’m going to look at using exponential timers to batch up large numbers of changes. Third, I’m going to make certain I’m removing all the information I can from the link state database—see the answer to the third question on the LSDB size, above.

If you’ve done all this—keeping in mind that you need to consider the trade offs (if you don’t see the trade offs, you’re not looking hard enough), then I would consider splitting the flooding domain. If it sounds like I would never split a flooding domain for purely performance or technical reasons, you’ve come to the right conclusion on reading these two posts.

All that said, let me tell you the real reasons I would split a flooding domain.

First, just to make my life easier when troubleshooting the network. The router has a lot larger capacity for looking through screens full of link state information than I do. At 2AM, when the network is down, any little advantage I can give myself to troubleshoot the network faster is worth considering.

Second, again, to make my life easier in the troubleshooting process. Go back and think about the OODA loop. Where can I observe the network to best understand what’s going on? If you thought, “at the flooding domain boundary,” you earn a gold star. You can pick it up at the local office supply store.

Third, to break apart the network in case of a real failure—to provide a “firewall” (in the original sense of the word, rather than the appliance sense) to keep one part of the network from going down when another part falls apart.

Finally, to provide a “choke point” where you can implement policy.

So in the end—you shouldn’t build the world’s largest flooding domain just because you can, and you shouldn’t build a ton of tiny flooding domains just because you can. The technical reasons for slicing and dicing a flooding domain aren’t really that strong, but don’t discount using flooding domains on a more practical level.

Slicing and Dicing Flooding Domains (1)

This week two different folks have asked me about when and where I would split up a flooding domain (IS-IS) or area (OSPF); I figured a question asked twice in one week is worth a blog post, so here we are…

Before I start on the technical reasons, I’m going to say something that might surprise long time readers: there is rarely any technical reason to split a single flooding domain into multiple flooding domains. That said, I’ll go through the technical reasons anyway.

There are really three things to think about when considering how a flooding domain is performing:

  • SPF run time
  • flooding frequency
  • LSDB size

Let’s look at the third issue first, the database size. This is theoretically an issue, but it’s really only an issue if you have a lot of nodes and routes. I can’t ever recall bumping up against this problem, but what if I did? I’d start by taking the transit links out of the database entirely—for instance, by configuring all the interfaces that face actual host devices as passive interfaces (which you should be doing anyway!), and configuring IS-IS to advertise just the passive interfaces. You can pull similar tricks in OSPF. Another trick here is to make certain point-to-point Ethernet links aren’t electing a DIS or DR; this just clogs the database up with meaningless information.

The second issue, the flooding frequency, is more interesting. Before I split a flooding domain because there is “too much flooding,” I would want to look at several things to make certain I’m not doing a lot of work for nothing. Specifically, I would want to look at:

  • Why am I getting all these LSAs/LSPs? A lot of flooding means a lot of changes, which generally means instability someplace or another. I would either want to be able to justify the instability or stop it, rather than splitting a flooding domain to react to it. Techniques I would look at here include interface dampening (if it’s available) and roping off a flapping network behind a nailed up redistributed route of some sort.
  • If the rate of flooding can only be controlled to some degree, or it’s valid, then I would want to look at how I can configure the network to control the flooding in a way that makes sense. Specifically, I’m going to look at using exponential backoff to manage bursts of flooding events while keeping my convergence time down as much as I can, and I’m going to consider my LSP generation intervals to make certain I account for bursts of changes on a single intermediate system. This is where we get into tradeoffs, however—at some point you need to ask if tuning the timers is easier/simpler than breaking the flooding domain into two flooding domains, particularly if you can isolate the bursty parts of the network from the more stable parts.

There are probably few networks in the world where tuning flooding will not hold the rate of flooding down to a reasonable level.

Continued next week…