Why not OSPF for the Internet Core?

Every now and again (not often enough, if I’m to be honest), someone will write me with what might seem like an odd question that actually turns out to be really interesting. This one is from Surya Ahuja, a student at NC State, where I occasionally drop by to do a guest lecture.

We were recently working on an example design problem in one of our courses, and like a dedicated student I was preparing the State, optimisation, surface sheet 🙂 One of the design decisions was to explain the selection of the routing protocol. This got me thinking. When BGP was being created, were there modifications to OSPF itself considered? … it could have been made possible to just use OSPF across enterprise and the internet. Then why not use it?

A quick answer might go somethign like this: OSPF did not exist when BGP was invented. The IETF working group for OSPF was, in fact, started in 1988, while the original version of BGP, called EGP, was originally specified by Eric Rosen in 1982, some 6 years earlier. The short answer might be, then, that OSPF was not considered because OSPF had not been invented yet. 🙂

But this quick answer is really more tongue-in-cheek than useful, because the concept of a link state protocol is actually much older than the concept of a path vector protocol. EGP and BGP are, in fact, the very first path vector protocols, while the fundamental algorithms that underpin link state protocols were invented in the mid-1950’s into the 1960’s. For instance, Dijkstra’s shortest path first algorithm was initial invented in 1956.

So I’m going to back up a little and rephrase the question: if link state protocols existed at the time EGP and BGP were invented, why didn’t the original designers create a link state protocol, like OSPF, to run the core of the ‘net?”

The initial response is likely to be: because link state protocols just won’t scale to the point of supporting the core of the ‘net. This answer, however, is wrong. With optimized flooding, topology summarization, and route aggregation, it is quite possible to scale a link state protocol to the scale of the DFZ. Simply force the entire ‘net into a hierarchical model—the topology of the ‘net was, in fact, largely hierarchical in the early days—intelligently assign addresses, and you could, in fact, scale a link state protocol to the entire Internet. It would have been difficult, and it would have constrained the development of the Internet in several important ways, but it would have been possible.

To understand why a link state protocol was not chosen, you have to dig into the design considerations behind BGP. Go back to “the napkin” and consider what is drawn there.

Click to enlarge, if you need to.

Mostly what you will see is policy. The original design of BGP was primarily concerned with implementing interdomain policy.

Now step back one level, and ask a simple question: how good is a link state protocol at implementing policy, especially on a per-node basis? Before answering, consider the way in which link state protocols operate. Each router must have a complete copy of the Link State Database (LSDB) in order to calculate loop free paths through the network. To go even a bit deeper in theory, the “shortest path” is really just a short hand for a “loop free path.” Once you inject policy into the decision process, you are implying it is okay to take a slightly longer path (a stretched path) instead of the shortest path—so long as it is still loop free. How would you implement policy on a protocol in which every router (or node) in the network must share a common view of the network, when the concept of policy itself implies different views of the available paths for different nodes?

You don’t.

So the primary reason a link state protocol was not ultimately used to build the core of the Internet is not scale. It’s policy.

I know people pooh-pooh theory and history in the networking world, but … perhaps this will serve as a simple illustration of how understanding the protocols themselves, and how they work, can help you understand what to deploy where and why.

Network Collective: Peering with Providers

In this episode of the Network Collective Community Roundtable, the panel discusses the nuances of getting your organization connected to the internet. Is it as simple as connecting a cable and calling it a day, or is there more to think about when designing your Internet edge? Joining the Network Collective team for this conversation is Dr. Pete Welcher and Tom Ammon.

Optimal Route Reflection: Next Hop Self

Recently, I posted a video short take I did on BGP optimal route reflection. A reader wrote in the comments to that post:

…why can’t Router set next hop self to updates to router E and avoid this suboptimal path?

To answer this question, it is best to return to the scene of the suboptimality—

To describe the problem again: A and C are sending the same route to B, which is a route reflector. B selects the best path from its perspective, which is through B, and sends this route to each of its clients. In this case, E will learn the path with a next hop of A, even though the path through C is closer from E’s perspective. In the video, I discuss several ways to solve this problem; one option I do not talk about is allowing B to set the next hop to itself. Would this work?

Before answering the question, however, it is important to make one observation: I have drawn this network with B as a router in the forwarding path. In many networks, the route reflector is a virtual machine, or a *nix host, and is not capable of forwarding the traffic required to self the next-hop to itself. There are many advantages to intentionally removing the route reflector from the forwarding path. So while setting nexthop-self might work in this situation, it will not work in all situations.

But will it work in this situation? Not necessarily. The shortest path, for D, is through C, rather than through A. B setting its next hop to itself is going to draw E’s traffic towards 100::/64 towards itself, which is still the longer path from E’s perspective. So while there are situations where setting nexthop-self will resolve this problem, this particular network is not one of them.

Research: BGP Routers and Parrots

The BGP specification suggests implementations should have three tables: the adj-rib-in, the loc-rib, and the adj-rib-out. The first of these three tables should contain the routes (NLRIs and attributes) transmitted by each of the speaker’s peers. The second table should contain the calculated best paths; these are the routes that will be (or are) installed in the local routing table and used to build a forwarding table. The third table contains the routes which have been sent to each peering speaker. Why three tables? Routing protocols standards are (sometimes—not always) written to provide the maximum clarity to how the protocol works to someone who is writing an implementation. Not every table or process described in the specification is implemented, or implemented the way it is described.

What happens when you implement things in a different way than the specification describes? In the case of BGP and the three RIBs, you can get duplicated BGP updates. What do parrots and BGP have in common describes two situations where the lack of a adj-rib-out can cause duplicate BGP updates to be sent.

David Hauweele, Bruno Quoitin, Cristel Pelsser, and Randy Bush. 2016. “What Do Parrots and BGP Rotuers Have in Common?” Computer Communications Review, July.

The authors of this paper begin by observing BGP updates from a full feed off the default free zone. The configuration of the network, however, is designed to provide not only the feed from a BGP speaker, but also the routes received by a BGP speaker, as shown in the illustration below.

In this figure, all the labeled routers are in separate BGP autonomous systems, and the links represent physical connections as well as eBGP sessions. The three BGP updates received by D are stored in three different logs which are time stamped so they can be correlated. The researchers found two instances where duplicate BGP updates were received at D.

In the first case, the best path at C switches between A and B because of the Multiple Exit Discriminator (MED), but the remainder of the update remains the same. C, however, strips the MED before transmitting the route to D, so D simply sees what appears to be duplicate updates. In the second case, the next hop changes because of an implicit withdraw based on a route change for the previous best path. For instance, C might choose A as the best path, but then A implicitly withdraws its path, leaving the path through B as the best. When this occurs, C recalculates the best path and sends it to D; since the next hop is stripped when C advertises the new route to D, this appears to be a duplicate at D.

In both of these cases, if C had an adj-rib-out, it would find the duplicate advertisement and squash it. However, since C has no record of what it has sent to D in the past, it must send information about all local best path changes to D. While this might seem like a trivial amount of processing, these additional updates can add enough load during link flap situations to make a material difference in processor utilization or speed of convergence.

Why do implementors decide not to include an adj-rib-out in their implementations, or why, when one is provided, do operators disable the adj-rib-out? Primarily because the adj-rib-out consumes local memory; it is cheaper to push the work to a peer than it is to keep local state that might only rarely be used. This is a classic case of reducing the complexity of the local implementation by pushing additional state (and hence complexity) into the overall system. The authors of the paper suggest a better balance might be achieved if implementations kept a small cache of the most recent updates transmitted to an adjacent speaker; this would allow the implementation to reduce memory usage, while also allowing it to prevent repeating recent updates.