Slicing and Dicing Flooding Domains (2)

The first post in this series is here.

Finally, let’s consider the first issue, the SPF run time. First, if you’ve been keeping track of the SPF run time in several locations throughout your network (you have been, right? Right?!? This should be a regular part of your documentation!), then you’ll know when there’s a big jump. But a big jump without a big change in some corresponding network design parameter (size of the network, etc.), isn’t a good reason to break up a flooding domain. Rather, it’s a good reason to go find out why the SPF run time changed, which means a good session of troubleshooting what’s probably an esoteric problem someplace.

Assume, however, that we’re not talking about a big jump. Rather, the SPF run time has been increasing over time, or you’re just looking at a particular network without any past history. My rule of thumb is to start really asking questions when the SPF run time gets to around 100ms. I don’t know where that number came from—it’s a “seat of the pants thing,” I suppose. Most networks today seem to run SPF in less than 10ms, though I’ve seen a few that seem to run around 30ms, so 100ms seems excessive. I know a lot of people do lots of fancy calculations here (the speed of the processor and the percentage of processor used for other things and the SPF run time and…), but I’m not one for doing fancy stuff when a simple rule of thumb seems to work to alert me to problems going into a situation.

But before reaching for my flooding domain slicing tools because of a 100ms SPF run time, I’m going to try and bring the time down in other ways.

First, I’m going to make certain incremental and partial SPF are enabled. There’s little to no cost here, so just do it. Second, I’m going to look at using exponential timers to batch up large numbers of changes. Third, I’m going to make certain I’m removing all the information I can from the link state database—see the answer to the third question on the LSDB size, above.

If you’ve done all this—keeping in mind that you need to consider the trade offs (if you don’t see the trade offs, you’re not looking hard enough), then I would consider splitting the flooding domain. If it sounds like I would never split a flooding domain for purely performance or technical reasons, you’ve come to the right conclusion on reading these two posts.

All that said, let me tell you the real reasons I would split a flooding domain.

First, just to make my life easier when troubleshooting the network. The router has a lot larger capacity for looking through screens full of link state information than I do. At 2AM, when the network is down, any little advantage I can give myself to troubleshoot the network faster is worth considering.

Second, again, to make my life easier in the troubleshooting process. Go back and think about the OODA loop. Where can I observe the network to best understand what’s going on? If you thought, “at the flooding domain boundary,” you earn a gold star. You can pick it up at the local office supply store.

Third, to break apart the network in case of a real failure—to provide a “firewall” (in the original sense of the word, rather than the appliance sense) to keep one part of the network from going down when another part falls apart.

Finally, to provide a “choke point” where you can implement policy.

So in the end—you shouldn’t build the world’s largest flooding domain just because you can, and you shouldn’t build a ton of tiny flooding domains just because you can. The technical reasons for slicing and dicing a flooding domain aren’t really that strong, but don’t discount using flooding domains on a more practical level.

Slicing and Dicing Flooding Domains (1)

This week two different folks have asked me about when and where I would split up a flooding domain (IS-IS) or area (OSPF); I figured a question asked twice in one week is worth a blog post, so here we are…

Before I start on the technical reasons, I’m going to say something that might surprise long time readers: there is rarely any technical reason to split a single flooding domain into multiple flooding domains. That said, I’ll go through the technical reasons anyway.

There are really three things to think about when considering how a flooding domain is performing:

  • SPF run time
  • flooding frequency
  • LSDB size

Let’s look at the third issue first, the database size. This is theoretically an issue, but it’s really only an issue if you have a lot of nodes and routes. I can’t ever recall bumping up against this problem, but what if I did? I’d start by taking the transit links out of the database entirely—for instance, by configuring all the interfaces that face actual host devices as passive interfaces (which you should be doing anyway!), and configuring IS-IS to advertise just the passive interfaces. You can pull similar tricks in OSPF. Another trick here is to make certain point-to-point Ethernet links aren’t electing a DIS or DR; this just clogs the database up with meaningless information.

The second issue, the flooding frequency, is more interesting. Before I split a flooding domain because there is “too much flooding,” I would want to look at several things to make certain I’m not doing a lot of work for nothing. Specifically, I would want to look at:

  • Why am I getting all these LSAs/LSPs? A lot of flooding means a lot of changes, which generally means instability someplace or another. I would either want to be able to justify the instability or stop it, rather than splitting a flooding domain to react to it. Techniques I would look at here include interface dampening (if it’s available) and roping off a flapping network behind a nailed up redistributed route of some sort.
  • If the rate of flooding can only be controlled to some degree, or it’s valid, then I would want to look at how I can configure the network to control the flooding in a way that makes sense. Specifically, I’m going to look at using exponential backoff to manage bursts of flooding events while keeping my convergence time down as much as I can, and I’m going to consider my LSP generation intervals to make certain I account for bursts of changes on a single intermediate system. This is where we get into tradeoffs, however—at some point you need to ask if tuning the timers is easier/simpler than breaking the flooding domain into two flooding domains, particularly if you can isolate the bursty parts of the network from the more stable parts.

There are probably few networks in the world where tuning flooding will not hold the rate of flooding down to a reasonable level.

Continued next week…