Thoughts on Auto Disaggregation and Complexity
Way in the past, the EIGRP team (including me) had an interesting idea–why not aggregate routes automatically as much as possible, along classless bounds, and then deaggregate routes when we could detect some failure was causing a routing black hole? To understand this concept better, consider the network below.
In this network, B and C are connected to four different routers, each of which is advertising a different subnet. In turn, B and C are aggregating these four routes into 2001:db8:3e8:10::/60, and advertising this aggregate towards A. From a control plane state perspective, this is a major win. The obvious gain is that the amount of state is reduced from four routes to one. The less obvious gain is A doesn’t need to know about any changes in the state for the four destinations aggregated into the /60. Depending on how often these links change state, the reduction in the rate of change is, perhaps, more important than the reduction in the amount of control plane state.
We always know there will be a tradeoff when reducing state; what is the tradeoff here? If C somehow loses its connection to one of the four routers, say the router advertising 11::/64, C’s 10::/60 aggregate will not change. Since A thinks C still has a route to every subnet within 10::/60, it will continue sending traffic destined to addresses in the 11::/64 towards both B and C. C will not have a route towards these destinations, so it will drop the traffic.
We have a routing black hole.
This much is pretty simple. The harder part is figuring out to eliminate this routing black hole. Our first choice is to just not aggregate these routes. While you might be cringing right now, this isn’t such a bad option in many networks. We often underestimate the amount of state and the speed of state change modern routing protocols running on modern processors can support. I’ve seen networks running IS-IS in a single flooding domain with tens of thousands of routes and thousands of nodes running “in the wild.” I’ve seen IS-IS networks with thousands of nodes and hundreds of thousands of routes running in lab environments. These networks still converge.
But what if we really think we need to reduce the amount and speed of state, so we really need to aggregate these routes?
One solution that has been proposed a number of times through the years is auto disaggregation.
In this case, suppose D somehow realizes C cannot reach one of the components of a shared aggregate route. D could simply stop advertising the aggregate, advertising each of the components instead. The question here might be: is this a good idea? Looking at this from the perspective of the SOS triad, the aggregation replaced four routes with a single route. In the auto disaggregation case, the single route change is replaced by four route changes. The amount of state is variable, and in some cases the rate of change in state is actually higher than without the aggregation.
So…
I don’t hold that auto disaggregation is either good nor bad—it just presents a different set of challenges to the network designer. Instead of designing for average rates of change and given table sizes, you can count on much smaller tables, but you might find there are times when the rate of change is dramatically higher than you expect. A good question to ask, before deploying this kind of technology, might be: can I forsee a chain of events that will cause a high enough rate of state change that auto disaggregation is actually more destabilizing than just not summarizing at all in this network?
A real danger with auto disaggregation, by the way, is using summarization to dramatically reduce table sizes without understanding how a goldilocks failure (what we used to call in telco a mother’s day event, or perhaps a black swan) can cascade into widespread failures. If you’re counting on particular devices in your network only have a dozen or two dozen table entries, but just the right set of failures can cause them to have several thousand entries because of auto disaggregation, what kinds of failures modes should you anticipate? Can you anticipate or mitigate this kind of problem?
The idea of automatically summarizing and disaggregating routes is an interesting study in complexity, state, and optimization. It’s a good brain exercise in thinking through what-if situations, and carefully thinking about when and where to deploy this kind of thing.
What do you think about this idea? When would you deploy it, where, and why? When and where would you be cautious about deploying this kind of technology?