On July 21st I’ll be teaching BGP Policy over at Safari Books Online. From the description:
This course begins by simplifying the entire BGP policy space into three basic kinds of policies that operators implement using BGP—selecting the outbound path, selecting the inbound path, and “do not transit.” A use case is given for each of these three kinds, or classes, of policies from the perspective of a transit provider, and another from the perspective of a nontransit operator connected to the edge of the ‘net. With this background in place, the course will then explore each of the many ways these classes of policy may be implemented using local preference, AS Path prepending, various communities, AS Path poisoning, and other techniques. Positive and negative aspects of each implementation path will be considered.
My courses are going through a bit of updating, but I think August and September will be How the Internet Really Works, followed by an updated course on troubleshooting. I’m incorporating more tools into the course, including (of course!) ChatGPT. Watch this space for upcoming announcements.
If you advertise routes through a provider to the global Internet, you might be wondering if you should go through the trouble of registering in the RPKI and advertising ROAs. What is the tradeoff for the work involved in what seems like a complex process? Cecelia Testart joins Jeremy White and Russ White to discuss recent work in measuring the value of the RPKI.
From the question pile: Route servers (as opposed to route reflectors) don’t change anything about a BGP route when re-advertising it to a peer, whether iBGP or eBGP. Why don’t route servers cause routing loops (or other problems) in a BGP network?
Route servers are often used by Internet Exchange Points (IXPs) to distribute routes between connected BGP speakers. BGP route servers
- Don’t change anything about a received BGP route when advertising the route to its peers (other BGP speakers)
- Don’t install routes received through BGP into the local routing table
Shouldn’t using route servers in a network—pontentially, at least—cause routing loops or other BGP routing issues? Maybe a practical example will help.
Assume b, e, and s are all route servers in their respective networks. Starting at the far left, a receives some route, 101::/64, and sends it on to b,, which then sends the unmodified route to c. When c receives traffic destined to 101::/64, what will happen? Regardless of whether these routers are running iBGP or eBGP, b will not change the next hop, so when c receives the route, a is still the next hop. If there’s no underlying routing protocol, c won’t know how to reach A, so it will ignore the route and drop the traffic. Even if there is an underlying routing protocol, c’s route to 101::/64’s route passes through b, and b isn’t installing any routing information learned from BGP into its local routing table (because it’s a route server). b is going to drop traffic destined to 101::/64.
We can solve this simple problem by adding a new link between the two clients of the route server, as shown in the center diagram. Here, d sends 101::/64 to e, which then sends the unchanged route to g. Since g has a direct connection to d, we can assume g will send traffic destined to 101::/64 directly to d, where it will be forwarded to the destination. Why wouldn’t d and g peer directly instead of counting on e to carry routes between them? In most cases this kind of indirect peering is done to increase network scale. If there are thousand routes like d and g, it will be simpler for them all to peer to e than to build a full mesh of connections.
Why not use a route reflector rather than a route server in this situation? Route reflectors can only be used to carry routes between iBGP peers. If d, e, and g are all in different autonomous systems, route reflectors cannot be used to solve this problem.
But this brings us back to the original question—route reflectors use the cluster list to prevent loops within an AS (the cluster list is similar in form and function to the AS path carried between autonomous systems, but it uses router ID’s rather than AS numbers to describe the path)?
If you have multiple route servers connected to one another you can, in fact, form routing loops.
In this network, a is sending 101::/64 to b, which is then sending the route, unmodified, to e. Because of some local policy, e is choosing the path through a, which means e forwards traffic destined to 101::/64 to c. At the same time, e is advertising 101::/64 to b, which is then sending the route (unmodified) to a, and a is choosing the path through c. In this case, a permanent (persistent) routing loop is formed through the control plane, primarily because no single BGP speaker has a complete view of the topology. The two route servers, by hiding the real path to 101::/64, makes is possible to form a routing loop.
The deploy route servers without forming these kinds of loops—
- BGP speakers learning routes from route servers should be directly connected—there should not be destinations reachable via some “hidden” intermediate hop
- Route servers should send all the routes they learn from clients; they should not use bestpath to choose which routes to send to clients
These restrictions prevent routing loops from forming when deploying route servers—but they also restrict the use of route servers to situations like carrying routes between BGP speakers connected to a single fabric.
Cisco filed a patent some time back describing a method to prevent routing loops when using BGP route servers; it makes interesting reading for folks who want to dive a little deeper.
I recorded the beginnings of a BGP training series over at Packet Pushers a short while back; they’ve released these onto youtube (so you can find the entire series there). I’m highlighting one of these every couple of weeks ’til I’ve gone through the entire set of recordings. In this recording, I’m talking through some more interesting aspects of BGP peering, including challenges with IPv6 link local nexthops, promiscuous peering, and capabilities.
Since BGP is designed to be an overlay protocol, it doesn’t really have good mechanisms for carrying routes within an autonomous system. In this video, I’m discussing some of the techniques developed to carry routes within an AS, including route reflectors.
My video on BGP convergence elicited a lot of . . . feedback, mainly concerning the difference between convergence in a data center fabric and convergence in the DFZ. Let’s begin here—BGP hunt and the impact of the MRAI are very real in the DFZ. Withdrawing a route can take several minutes.
What about the much more controlled environment of a data center fabric?
Several folks pointed out that the MRAI is often set to 0 in DC fabrics (and many implementations by default). Further, almost all implementations will use an MRAI of 0 for the first received update, holding the second and subsequent advertisements by the MRAI. Several folks also pointed out that all the paths through a DC fabric are the same length, so the second part of the equation is also very small.
These are good points—how do they impact BGP convergence? Let’s use the network below, a small slice of a five-stage butterfly fabric, to think it through. Assume every router is in a different AS, so all the peering sessions are eBGP.
Start with A losing its connection to 101::/64—
- T1: A withdraws its route from B and C
- T2: B withdraws its route from D and E, C withdraws its route from F and G
- T3: D and E withdraw their routes from H, F and G withdraw their routes from K
- T4: H and K withdraw their routes from L
Note that L cannot receive one withdraw to remove the route from its local table; it must receive withdraws from both H and K. There’s no way at L to tell whether a withdraw from H means 101::/64 is no longer reachable at all or it is no longer reachable through H. For path-vector protocols, like distance-vector, the neighbor through each path must be considered independently.
What does an MRAI of 0 do? Each of the routers in the network will process the withdraw as soon as they receive it and send a withdraw to their peers as soon as they’re done processing it. The process still takes the same number of steps but each step is much faster.
What is the impact of all the paths’ equal length? So long as every router processes the withdraw at around the same speed, there is no hunt. If H and K send their withdraws simultaneously, L should receive them simultaneously and remove the route to 101::/64 from its table rather than switching from one path to the other. Even if they send their withdraws at different times, L removes entries from its ECMP table until it receives the last withdrawal.
If MRAI slows down convergence, why set it to anything other than 0? Because it’s improbable that every router in the network will process each withdraw simultaneously.
Before 101::/64 is withdrawn, H will be using the paths through D and E for ECMP, but it is only going to be advertising one of these two routes to L—say the path through E. When B sends withdraws to D and E, assume E processes the withdraw just a little faster than D. When H receives D’s withdraw, it will send an implicit withdraw to L, updating the AS path to include D rather than E. A few moments later, D sends a withdraw. H processes this withdraw and sends a withdraw to L.
L has received one implicit withdraw and one withdraw from H because of processing time differentials. In a larger fabric, with a much larger fan-out, the likelihood of differences in timing is much higher and spread across a broader range of possibilities. You can (generally) expect H to send about half as many implicit withdraws as it has paths towards the destination before sending an actual withdraw. If there are eight paths between B and H, H would likely send 3 or 4 implicit withdraws before sending a withdraw.
What if the MRAI were set to 1 second at H? H would receive E’s withdrawal and set the MRAI timer. Assuming D’s withdraw arrives within that 1-second MRAI, H will receive D’s withdraw, squash the implicit withdraw, and send a single withdraw to L instead. Setting the MRAI to something other than 0 reduces the number of updates and reduces processing.
Setting the MRAI to 1 second, and forcing it to trigger across all updates, might improve convergence time—or not. Without experimenting with setting the MRAI to different values at different places in a real network, it is hard to know. Replacing the routers, link speeds, changing processor load, and increasing memory can all have an impact on the “best” settings for optimal convergence.
the bottom line
There will be no hunt in BGP convergence in a network with multiple equal-length/equal-cost paths. This is what we should expect. Because the maximum path length minus the best (current) path length will always be 0, the network will converge as quickly as each router can process and advertise withdraws, bounded by the MRAI.
Setting the MRAI to 0 improves convergence speed at the cost of additional updates, especially in wide fan-out data center fabrics. It’s hard to know whether setting the MRAI to 0 or 1 will give you better convergence speeds; you have to try it to see.
I still think we should be moving away from BGP as our underlay protocol in all but the largest data center fabrics. IGPs (like IS-IS and RIFT) will converge more quickly, are easier to configure and manage, and using different protocols for the underlay and overlay breaks up failure and security domains in useful ways. I know I’m tilting at a windmill on this point, but still …
This lesson in Russ White’s BGP course gets into withdrawing a route, MRAI time, implicit withdraws, BGP Hunt, graceful restart, and other topics.