What is a Failure Domain?
“No, I wouldn’t do that, it will make the failure domain too large…”
“We need to divide this failure domain up…”
Okay, great—we all know we need to use failure domains, because without them our networks will be unstable, too complex, and all that stuff, right? But what, precisely, is a failure domain? It seems to have something to do with aggregation, because just about every network design book in the world says things like, “aggregating routes breaks up failure domains.” It also seems to have something to do with flooding domains in link state protocols, because we’re often informed that you need to put in flooding domain boundaries to break up large failure domains. Maybe these two things contain a clue: what is common between flooding domain boundaries and aggregating reachability information?
Hiding information.
But how does hiding information create failure domain boundaries?
If Router B is aggregating 2001:db8:0:1::/64 and 2001:db8:0:2::/64 to 2001:db8::/61, then changes in the more specific routes will be hidden from Router A. This hiding of information means a failure of one of these two more specific routes does not cause Router A to recalculate what it knows about reachability in the network. Hence a failure at 200:db8:0:1::/64 doesn’t impact Router A—which means Router A is in a different failure domain than 2001:db8:0:1::/64. Based on this, we can venture a simple definition:
A failure domain is any group of devices that will share state when the network topology changes.
This definition doesn’t seem to work all the time, though. For example, what if the metric of the 2001:db8::/61 aggregate at Router B depends on the higher cost more specific among the routes covered (or hidden)? If the aggregate metric is taken from the 2001:db8:0:1::/64 route attached to Router C, then when that link fails, the aggregate cost will also change, and Router A will need to recalculate reachability. This situation, however, doesn’t change our definition of what a failure domain is, it just alerts us that failure domains can “leak” information if they’re not constructed carefully. In fact, we can trace this back to the law of leaky abstractions— hiding information is just a form of abstraction, and all abstractions leak information in some way to at least one other subsystem within the larger system.
Another, harder, example, might be that of the flooding domain boundary in a link state protocol. Assume, for a moment, that Router A is in Level 2, Routers C and D are in Level 1, and Router B is in both Level 1 and Level 2. Further assume no route aggregation is taking place. What will happen when 2001:db8:0:1::/64 fails? As Router B is advertising 2001:db8:0:1::/64 as if it were directly connected, Router A will see the destination disappear, but it will not see the network topology change. The state of the topology seems to be in one failure domain, while the state of reachability seems to be in another, overlapping, failure domain. This appearance is, in fact, a reflection of reality. Failure domains can—and do—overlap in this way all the time. There’s nothing wrong with overlapping failure domains, so long as you recognize they exist, and therefore you actually look (and plan) for them.
Finally, consider what happens if some link attached to Router A fails. Unless routes are being intentionally leaked into the Lelvel 1 flooding domain at Router B, Router C won’t see any changes to the network, either in topology or reachability. After all Router C is just depending on Router B’s attached bit to build a default route it uses to reach any destination outside the local flooding domain. This means failure domains can be assymetric. What breaks a failure domain for one router doesn’t always break it for another. Again, this is okay, so long as you’re aware of this situation, and recognize it when and where it happens.
So given these caveats, the definition of a failure domain above seems to work well. We can refine it a little, but the general idea of a failure domain as a set of devices that will (or must) react to a change in the state of the network is a good place to start.
Hey Russ. You, I and it seems like all the others who ran away from Cisco, have had many a discussion about failure domain. Hint hint to CCDE candidates…… You better know how to isolate
Steve — a ghost from the past! Thanks for stopping by… Isolation is the key point, it would seem. Just watch those leaky abstraction.
🙂
Russ
He Russ,
Nice complete overview of how failure domains need a bit more thought than we in general give them.
I think one point we often overlook when talking about failure domains is that we (un)consciously tend to think of layer 2 broadcast domains (and most discussions/blogs seem to only look at this aspect).
Might also that because we expect *GP’s to handle failures and dynamically adept the topology accordingly we don’t regard it as a “failure domain”.
And even though we do keep in mind things as flooding domains, the failure domain might be bigger than the flooding domain (as your example shows due to the potential for a black hole) so the “refined definition” is definetly a good “rule of thumb”..
I think we do tend to think that “if the routing protocol handles it, there’s no failure, and hence no failure domain…” And we often conflate failure domains with layer 2 broadcast domains.
Thanks for stopping by!
🙂
Russ