Rethinking BGP on the DC Fabric (part 2)
In my last post on this topic, I laid out the purpose of this series—to start a discussion about whether BGP is the ideal underlay control plane for a DC fabric—and gave some definitions. Here, I’d like to dive into the reasons to not use BGP as a DC fabric underlay control plane—and the first of these reasons is BGP converges very slowly and requires a lot of help to converge at all.
Examples abound. I’ve seen the results of two testbeds in the last several years where a DC fabric was configured with each router (switch, if you prefer) in a separate AS, and some number of routes pushed into the network. In both cases—one large-scale, the other a more moderately scaled network on physical hardware—BGP simply failed to converge. Why? A quick look at how BGP converges might help explain these results.
Assume we are watching the 110::/64 route (attached to A, on the left side of the diagram), at P. What happens when A loses it’s connection to 110::/64? Assuming every router in this diagram is in a different AS, and the AS path length is the only factor determining the best path at every router.
Watching the route to 110::/64 at P, you would see the route move from G to M as the best path, then from M to K, then from K to N, and then finally completely drop out of P’s table. This is called the hunt because BGP “hunts,” apparently trying every path from the current best path to the longest possible path before finally removing the route from the network entirely. BGP isn’t really “hunting;” this is just an artifact of the way BGP speakers receive, process, and send updates through the network.
If you consider a more complex topology, like a five-stage butterfly fabric, you will find there are many (very many) alternate longer-length paths available for BGP to hunt through on a withdraw. Withdrawing thousands of routes at the same time, combined with the impact of the hunt, can put BGP in a state where it simply never converges.
To get BGP to converge, various techniques must be used. For instance, placing all the routers in the spine so they are in the AS, configuring path filters at ToR switches so they are never used as a transit path, etc. Even when these techniques are used, however, BGP can still require a minute or so to perform a withdraw.
This means the BGP configuration cannot be the same on every device—it is determined by where the device is located—which harms repeatability, the BGP configuration must contain complex filters, and messing up the configuration can bring the entire fabric down.
There are several counters to the problem of slow convergence, and the complex configurations required to make BGP converge more quickly, but this post is pushing against its limit … so I’ll leave these until next time.