Slicing and Dicing Flooding Domains (2)

The first post in this series is here.

Finally, let’s consider the first issue, the SPF run time. First, if you’ve been keeping track of the SPF run time in several locations throughout your network (you have been, right? Right?!? This should be a regular part of your documentation!), then you’ll know when there’s a big jump. But a big jump without a big change in some corresponding network design parameter (size of the network, etc.), isn’t a good reason to break up a flooding domain. Rather, it’s a good reason to go find out why the SPF run time changed, which means a good session of troubleshooting what’s probably an esoteric problem someplace.

Assume, however, that we’re not talking about a big jump. Rather, the SPF run time has been increasing over time, or you’re just looking at a particular network without any past history. My rule of thumb is to start really asking questions when the SPF run time gets to around 100ms. I don’t know where that number came from—it’s a “seat of the pants thing,” I suppose. Most networks today seem to run SPF in less than 10ms, though I’ve seen a few that seem to run around 30ms, so 100ms seems excessive. I know a lot of people do lots of fancy calculations here (the speed of the processor and the percentage of processor used for other things and the SPF run time and…), but I’m not one for doing fancy stuff when a simple rule of thumb seems to work to alert me to problems going into a situation.

But before reaching for my flooding domain slicing tools because of a 100ms SPF run time, I’m going to try and bring the time down in other ways.

First, I’m going to make certain incremental and partial SPF are enabled. There’s little to no cost here, so just do it. Second, I’m going to look at using exponential timers to batch up large numbers of changes. Third, I’m going to make certain I’m removing all the information I can from the link state database—see the answer to the third question on the LSDB size, above.

If you’ve done all this—keeping in mind that you need to consider the trade offs (if you don’t see the trade offs, you’re not looking hard enough), then I would consider splitting the flooding domain. If it sounds like I would never split a flooding domain for purely performance or technical reasons, you’ve come to the right conclusion on reading these two posts.

All that said, let me tell you the real reasons I would split a flooding domain.

First, just to make my life easier when troubleshooting the network. The router has a lot larger capacity for looking through screens full of link state information than I do. At 2AM, when the network is down, any little advantage I can give myself to troubleshoot the network faster is worth considering.

Second, again, to make my life easier in the troubleshooting process. Go back and think about the OODA loop. Where can I observe the network to best understand what’s going on? If you thought, “at the flooding domain boundary,” you earn a gold star. You can pick it up at the local office supply store.

Third, to break apart the network in case of a real failure—to provide a “firewall” (in the original sense of the word, rather than the appliance sense) to keep one part of the network from going down when another part falls apart.

Finally, to provide a “choke point” where you can implement policy.

So in the end—you shouldn’t build the world’s largest flooding domain just because you can, and you shouldn’t build a ton of tiny flooding domains just because you can. The technical reasons for slicing and dicing a flooding domain aren’t really that strong, but don’t discount using flooding domains on a more practical level.