Every now and again (not often enough, if I’m to be honest), someone will write me with what might seem like an odd question that actually turns out to be really interesting. This one is from Surya Ahuja, a student at NC State, where I occasionally drop by to do a guest lecture.
We were recently working on an example design problem in one of our courses, and like a dedicated student I was preparing the State, optimisation, surface sheet 🙂 One of the design decisions was to explain the selection of the routing protocol. This got me thinking. When BGP was being created, were there modifications to OSPF itself considered? … it could have been made possible to just use OSPF across enterprise and the internet. Then why not use it?
A quick answer might go somethign like this: OSPF did not exist when BGP was invented. The IETF working group for OSPF was, in fact, started in 1988, while the original version of BGP, called EGP, was originally specified by Eric Rosen in 1982, some 6 years earlier. The short answer might be, then, that OSPF was not considered because OSPF had not been invented yet. 🙂
But this quick answer is really more tongue-in-cheek than useful, because the concept of a link state protocol is actually much older than the concept of a path vector protocol. EGP and BGP are, in fact, the very first path vector protocols, while the fundamental algorithms that underpin link state protocols were invented in the mid-1950’s into the 1960’s. For instance, Dijkstra’s shortest path first algorithm was initial invented in 1956.
So I’m going to back up a little and rephrase the question: if link state protocols existed at the time EGP and BGP were invented, why didn’t the original designers create a link state protocol, like OSPF, to run the core of the ‘net?”
The initial response is likely to be: because link state protocols just won’t scale to the point of supporting the core of the ‘net. This answer, however, is wrong. With optimized flooding, topology summarization, and route aggregation, it is quite possible to scale a link state protocol to the scale of the DFZ. Simply force the entire ‘net into a hierarchical model—the topology of the ‘net was, in fact, largely hierarchical in the early days—intelligently assign addresses, and you could, in fact, scale a link state protocol to the entire Internet. It would have been difficult, and it would have constrained the development of the Internet in several important ways, but it would have been possible.
To understand why a link state protocol was not chosen, you have to dig into the design considerations behind BGP. Go back to “the napkin” and consider what is drawn there.
Click to enlarge, if you need to.
Mostly what you will see is policy. The original design of BGP was primarily concerned with implementing interdomain policy.
Now step back one level, and ask a simple question: how good is a link state protocol at implementing policy, especially on a per-node basis? Before answering, consider the way in which link state protocols operate. Each router must have a complete copy of the Link State Database (LSDB) in order to calculate loop free paths through the network. To go even a bit deeper in theory, the “shortest path” is really just a short hand for a “loop free path.” Once you inject policy into the decision process, you are implying it is okay to take a slightly longer path (a stretched path) instead of the shortest path—so long as it is still loop free. How would you implement policy on a protocol in which every router (or node) in the network must share a common view of the network, when the concept of policy itself implies different views of the available paths for different nodes?
So the primary reason a link state protocol was not ultimately used to build the core of the Internet is not scale. It’s policy.
I know people pooh-pooh theory and history in the networking world, but … perhaps this will serve as a simple illustration of how understanding the protocols themselves, and how they work, can help you understand what to deploy where and why.
Recently, I posted a video short take I did on BGP optimal route reflection. A reader wrote in the comments to that post:
…why can’t Router set next hop self to updates to router E and avoid this suboptimal path?
To answer this question, it is best to return to the scene of the suboptimality—
To describe the problem again: A and C are sending the same route to B, which is a route reflector. B selects the best path from its perspective, which is through B, and sends this route to each of its clients. In this case, E will learn the path with a next hop of A, even though the path through C is closer from E’s perspective. In the video, I discuss several ways to solve this problem; one option I do not talk about is allowing B to set the next hop to itself. Would this work?
Before answering the question, however, it is important to make one observation: I have drawn this network with B as a router in the forwarding path. In many networks, the route reflector is a virtual machine, or a *nix host, and is not capable of forwarding the traffic required to self the next-hop to itself. There are many advantages to intentionally removing the route reflector from the forwarding path. So while setting nexthop-self might work in this situation, it will not work in all situations.
But will it work in this situation? Not necessarily. The shortest path, for D, is through C, rather than through A. B setting its next hop to itself is going to draw E’s traffic towards 100::/64 towards itself, which is still the longer path from E’s perspective. So while there are situations where setting nexthop-self will resolve this problem, this particular network is not one of them.
One problem I’ve heard in the past is that much of the career advice given in the networking world is not practical. In this short take, I take this problem on, explaining why it might be more practical than it initially seems.
Two different readers, in two different forums, asked me some excellent questions about some older posts on mircoloops. Unfortunately I didn’t take down the names or forums when I noted the questions, but you know who you are! For this discussion, use the network show below.
In this network, assume all link costs are one, and the destination is the 100::/64 Ipv6 address connected to A at the top. To review, a microloop will form in this network when the A->B link fails:
- B will learn about the link failure
- B will send an updated router LSP or LSA towards D, with the A->B link removed
- At about the same time, B will recalculate its best path to 100::/64, so its routing and forwarding tables now point towards D as the best path
- D, in the meantime, receives the updated information, runs SPF, and installs the new routing information into its forwarding table, with the new path pointing towards E
Between the third and fourth steps, B will be using D as its best path, while D is using B as its best path. Hence the microloop. The first question about microloops was—
Would BFD help prevent the microloop (or perhaps make it last a shorter period of time)?
Consider what happens if we have BFD running between A and B in this network. While A and B discover the failure faster—perhaps a lot faster—none of the other timing points change. No matter how fast A and B discover the link failure, B is still going to take some time to flood the change to the topology to D, and D is going to take some time to compute a new set of shortest paths, and install them into its local routing and forwarding tables.
Essentially, if you look at convergence as a four step process—
Then you can see the microloop forms because different routers are “doing” steps 3 and 4 at different times. Discovery, which is what BFD is aimed at, is not going to change this dynamic. The second question was—
Can microloops occur on a link coming up?
For this one, let’s start in a different place. Assume, for a moment, that the A->B link is down. Now someone goes in and configures the A->B link to make it operational. At the moment the link comes up, B’s shortest path to 100::/64 is through D. When the new link comes up, it will learn about the new link, calculated a new shortest path tree, and then install the new route through A towards 100::/64. E will also need to calculate a new route, using B as its next hop.
The key point to consider is this: who tells E about this new path to the destination? It is B. To form a microloop, we need D to install a route through B towards 100::/64 before B does. This is theoretically possible in this situation, but unlikely, because D is dependent on B for information about this new path. Things would need to be pretty messed up for B to learn about the new path first, but not recalculate its shortest path tree and install the route before D can. So—while it is possible, it is not likely.
Thanks for sending these terrific questions in.
A while back I posted on section 10 routing loops; Daniel responded to the post with this comment:
I am curious how these things are discovered. You said that this is a contrived example, but I assume researchers have some sort of methodology to discover issues like this. I am sure some things have been found through operational mishap, but is there some “standardized” way of testing graph logic for the possibility of loops? I trust this is much easier to do today than even a decade ago.
You would think there would be some organized way to discover these kinds of routing loops, something every researcher and/or protocol designer might follow. The reality is far different—there is no systematic way that I know of to find this sort of problem. What happens, in real life, is that people with a lot of experience at the intersection of protocol design, the bounds of different ways of finding loop free paths (solving the loop free path problem), and a lot of experience in deploying and operating a network using these protocols, will figure these things out because they know enough about the solution space to look for them in the first place.
I don’t know who actually discovered this problem; it is “just” a comment in the RFC, and these kinds of comments are not normally attributed. It might have even been something that developed on a mailing list, or in private conversation between folks sitting at a table drawing diagrams on a napkin. But I would bet it was the normal sort of process—one of two ways:
- Someone thinks: “given the way this works, there should be a loop in there…” They sit down with someone else, and think through how it could happen. Then they go find examples of it in the real world, by talking to folks who have seen the loop but could not figure out how it happened.
- Someone sees a loop, and thinks: “now why did that happen??” They talk to some other folks who know the protocol, sketch the problem out on a napkin, and they work together to figure it out.
There are three key points here. The first is the importance of knowing not only how to configure the protocol, but how the protocol really works. The second is not only knowing how the protocol works, but enough of the theory behind why it works to be able to relate the theory to the reality you are seeing in the network. The third is having someone to talk to with the same sort of understanding, who can hash out what you are seeing, and why.
In other words: operational experience, theoretical understanding, and community.
If these three sound familiar—they should.
[fusion_sharing link=”https://rule11.tech/responding-to-readers-how-are-these-thing-discovered/” /]
What’s your thoughts on how Network Design itself can be Automated and validated. Also from Intent based Networking at some stage Network should re-look into itself and adjust to meet design goals or best practices or alternatively suggest the design itself in green field situation for example. APSTRA seems to be moving into this direction.
The answer to this question, as always, is—how many balloons fit in a bag? 🙂 I think it depends on what you mean when you use the term design. If we are talking about the overlay, or traffic engineering, or even quality of service, I think we will see a rising trend towards using machine learning in network environments to help solve those problems. I am not convinced machine learning can solve these problems, in the sense of leaving humans out of the loop, but humans could set the parameters up, let the neural network learn the flows, and then let the machine adjust things over time. I tend to think this kind of work will be pretty narrow for a long time to come.
There will be stumbling blocks here that need to be solved. For instance, if you introduce a new application into the network, do you need to re-teach the machine learning network? Or can you somehow make some adjustments? Or are you willing to let the new application underperform while the neural network adjusts? There are no clear answers to these questions, and yet we are going to need clear answers to them before we can really start counting on machine learning in this way.
If, on the other hand, you think of design as figuring out what the network topology should look like in the first place, or what kind of bandwidth you might need to build into the physical topology and where, I think machine learning can provide hints, but it is not going to be able to “design” a network in this way. There is too much intent involved here. For instance, in your original question, you noted the network can “look into itself” and “make adjustments” to better “meet the original design goals.” I’m not certain those “original design goals” are ever going to come from machine learning.
If this sounds like a wishy-washy answer, that’s because it is, in the end… It is always hard to make predictions of this kind—I’m just working off of what I know of machine learning today, compared to what I understand of the multi-variable problem of network designed, which is then mushed into the almost infinite possibilities of business requirements.
In a recent comment, Dave Raney asked:
Russ, I read your latest blog post on BGP. I have been curious about another development. Specifically is there still any work related to using BGP Flowspec in a similar fashion to RFC1998. In which a customer of a provider will be able to ask a provider to discard traffic using a flowspec rule at the provider edge. I saw that these were in development and are similar but both appear defunct. BGP Flowspec-ORF https://www.ietf.org/proceedings/93/slides/slides-93-idr-19.pdf BGP Flowspec Redirect https://tools.ietf.org/html/draft-ietf-idr-flowspec-redirect-ip-02.
This is a good question—to which there are two answers. The first is this service does exist. While its not widely publicized, a number of transit providers do, in fact, offer the ability to send them a flowspec community which will cause them to set a filter on their end of the link. This kind of service is immensely useful for countering Distributed Denial of Service (DDoS) attacks, of course. The problem is such services are expensive. The one provider I have personal experience with charges per prefix, and the cost is high enough to make it much less attractive.
Why would the cost be so high? The same reason a lot of providers do not filter for unicast Reverse Path Forwarding (uRPF) failures at scale—per packet filtering is very performance intensive, sometimes requiring recycling the packet in the ASIC. A line card normally able to support x customers without filtering may only be able to support x/2 customers with filtering. The provider has to pay for additional space, power, and configuration (the flowspec rules must be configured and maintained on the customer facing router). All of these things are costs the provider is going to pass on to their customers. The cost is high enough that I know very few people (in fact, so few as to be 0) network operators who will pay for this kind of service.
The second answer is there is another kind of service that is similar to what Dave is asking about. Many DDoS protection services offer their customers the ability to signal a request to the provider to block traffic from a particular source, or to help them manage a DDoS in some other way. This is very similar to the idea of interdomain flowspec, only using a different signaling mechanism. The signaling mechanism, in this case, is designed to allow the provider more leeway in how they respond to the request for help countering the DDoS. This system is called DDoS Open Threats Signaling; you can read more about it at this post I wrote at the ECI Telecom blog. You can also head over to the IETF DOTS WG page, and read through the drafts yourself.
Yes, I do answer reader comments… Sometimes just in email, and sometimes with a post—so comment away, ask questions, etc.