Archive for 2022
Revisiting BGP Convergence
My video on BGP convergence elicited a lot of . . . feedback, mainly concerning the difference between convergence in a data center fabric and convergence in the DFZ. Let’s begin here—BGP hunt and the impact of the MRAI are very real in the DFZ. Withdrawing a route can take several minutes.
What about the much more controlled environment of a data center fabric?
Several folks pointed out that the MRAI is often set to 0 in DC fabrics (and many implementations by default). Further, almost all implementations will use an MRAI of 0 for the first received update, holding the second and subsequent advertisements by the MRAI. Several folks also pointed out that all the paths through a DC fabric are the same length, so the second part of the equation is also very small.
These are good points—how do they impact BGP convergence? Let’s use the network below, a small slice of a five-stage butterfly fabric, to think it through. Assume every router is in a different AS, so all the peering sessions are eBGP.

Start with A losing its connection to 101::/64—
- T1: A withdraws its route from B and C
- T2: B withdraws its route from D and E, C withdraws its route from F and G
- T3: D and E withdraw their routes from H, F and G withdraw their routes from K
- T4: H and K withdraw their routes from L
Note that L cannot receive one withdraw to remove the route from its local table; it must receive withdraws from both H and K. There’s no way at L to tell whether a withdraw from H means 101::/64 is no longer reachable at all or it is no longer reachable through H. For path-vector protocols, like distance-vector, the neighbor through each path must be considered independently.
What does an MRAI of 0 do? Each of the routers in the network will process the withdraw as soon as they receive it and send a withdraw to their peers as soon as they’re done processing it. The process still takes the same number of steps but each step is much faster.
What is the impact of all the paths’ equal length? So long as every router processes the withdraw at around the same speed, there is no hunt. If H and K send their withdraws simultaneously, L should receive them simultaneously and remove the route to 101::/64 from its table rather than switching from one path to the other. Even if they send their withdraws at different times, L removes entries from its ECMP table until it receives the last withdrawal.
If MRAI slows down convergence, why set it to anything other than 0? Because it’s improbable that every router in the network will process each withdraw simultaneously.
Before 101::/64 is withdrawn, H will be using the paths through D and E for ECMP, but it is only going to be advertising one of these two routes to L—say the path through E. When B sends withdraws to D and E, assume E processes the withdraw just a little faster than D. When H receives D’s withdraw, it will send an implicit withdraw to L, updating the AS path to include D rather than E. A few moments later, D sends a withdraw. H processes this withdraw and sends a withdraw to L.
L has received one implicit withdraw and one withdraw from H because of processing time differentials. In a larger fabric, with a much larger fan-out, the likelihood of differences in timing is much higher and spread across a broader range of possibilities. You can (generally) expect H to send about half as many implicit withdraws as it has paths towards the destination before sending an actual withdraw. If there are eight paths between B and H, H would likely send 3 or 4 implicit withdraws before sending a withdraw.
What if the MRAI were set to 1 second at H? H would receive E’s withdrawal and set the MRAI timer. Assuming D’s withdraw arrives within that 1-second MRAI, H will receive D’s withdraw, squash the implicit withdraw, and send a single withdraw to L instead. Setting the MRAI to something other than 0 reduces the number of updates and reduces processing.
Setting the MRAI to 1 second, and forcing it to trigger across all updates, might improve convergence time—or not. Without experimenting with setting the MRAI to different values at different places in a real network, it is hard to know. Replacing the routers, link speeds, changing processor load, and increasing memory can all have an impact on the “best” settings for optimal convergence.
the bottom line
There will be no hunt in BGP convergence in a network with multiple equal-length/equal-cost paths. This is what we should expect. Because the maximum path length minus the best (current) path length will always be 0, the network will converge as quickly as each router can process and advertise withdraws, bounded by the MRAI.
Setting the MRAI to 0 improves convergence speed at the cost of additional updates, especially in wide fan-out data center fabrics. It’s hard to know whether setting the MRAI to 0 or 1 will give you better convergence speeds; you have to try it to see.
I still think we should be moving away from BGP as our underlay protocol in all but the largest data center fabrics. IGPs (like IS-IS and RIFT) will converge more quickly, are easier to configure and manage, and using different protocols for the underlay and overlay breaks up failure and security domains in useful ways. I know I’m tilting at a windmill on this point, but still …
Hedge 132: DNS Complexity and the DNAME

We all intuitively know the DNS is complex—and becoming more complex over time. Describing just how complex, however, is difficult. Siva Kesava and Ryan Beckett just published a research paper taking on the task of describing DNS complexity, particularly in light of the new DNAME record type. It turns out its complex enough that you can no longer really validate zone files.
How BGP Really Converges

This lesson in Russ White’s BGP course gets into withdrawing a route, MRAI time, implicit withdraws, BGP Hunt, graceful restart, and other topics.
Hedge 131: Easier for the Computer or the Person?
One of the mainstays of scripting—and now network management—are increasingly focused on making things “easier” for the human operator. Does this focus on making things “easier” for the operator produce a better experience, though? Or does it create frustration as humans try to “outguess” the computer’s programming and process? Join Tom Ammon and Russ White as they discuss the problems with scripting, automation, and ease-of-use.
Hedge 130: The Importance of Network Inventories
Inventories are generally hard, and hence don’t tend to be where you’d like to spend your time. The importance of having a good inventory, however, can hardly be overstated. Malcom Booden joins Tom Ammon and Russ White to talk about the importance of inventories and inventory ideas.
OT’N: BGP Loop Free Paths
Hedge 129: Open Source Mentoring
Mentoring is a topic we return to time and again—because it’s one of the most important things we can talk about in terms of building your people skills, your knowledge, and your career. On this episode of the Hedge, Guedis Cardenas joins Tom Ammon and Russ White to talk about open source mentoring. We discuss how this is different than “regular” mentoring, and how it’s the same. Join us as we talk about one of the most important career and personal growth things you can do.
