Simplification is a constant theme not only here, and in my talks, but across the network engineering world right now. But what does this mean practically? Looking at a complex network, how do you begin simplifying?
The first option is to abstract, abstract again, and abstract some more. But before diving into deep abstraction, remember that abstraction is both a good and bad thing. Abstraction can reduce the amount of state in a network, and reduce the speed at which that state changes. Abstraction can cover a multitude of sins in the legacy part of the network, but abstractions also leak!!! In fact, all nontrivial abstractions leak. Following this logic through: all non-trivial abstractions leak; the more non-trivial the abstraction, the more it will leak; the more complexity an abstraction is covering, the less trivial the abstraction will be. Hence: the more complexity you are covering with an abstraction, the more it will leak.
Abstraction, then, is only one part of the solution. You must not only abstract, but you must also simplify the underlying bits of the system you are covering with the abstraction. This is a point we often miss.
Which returns us to our original question. The first answer to the question is this: minimize.
Minimize the number of technologies you are using. Of course, minimization is not so … simple … because it is a series of tradeoffs. You can minimize the number of protocols you are using to build the network, or you can minimize the number of things you are using each protocol for. This is why you layer things, which helps you understand how and where to modularize, focusing different components on different purposes, and then thinking about how those components interact. Ultimately, what you want is precisely the number of modules required to do the job to a specific level of efficiency, and not one module more (or less).
Minimize the kinds of “things” you are using. Try to use one data center topology, one campus topology, one regional topology, etc. Try to use one kind of device (whether virtual or physical) in each “role.” Try to reduce the number of “roles” in the network.
Think of everything, from protocols to “places,” as “modules,” and then try to reduce the number of modules. Modules should be chosen for repeatability, functional division, and optimal abstraction.
The second answer to the original question is: architecture should move slowly, components quickly.
The architecture is not the network, nor even the combination of all the modules.
Think of a building. Every building has bathrooms (I assume). All those bathrooms have sinks (I assume). The sinks need to fit the style of the building. The number of sinks need to match the needs of the building overall. But—the sinks can change rapidly, and in response to the changing architecture of the building, but the building, it’s purpose, and style, change much more slowly. Architecture should change slowly, components more rapidly.
This is another reason to create modules: each module can change as needed, but the architecture of the overall system needs to change more slowly and intentionally. Thinking in systemic terms helps differentiate between the architecture and the components. Each component should fit within the overall architecture, and each component should play a role in shaping the architecture. Does the organization you support rely on deep internal communication across a wide geographic area? Or does it rely on lots of smaller external communications across a narrow geographic area? The style of communication in your organization makes a huge difference in the way the network is built, just like a school or hospital has different needs in terms of sinks than a shopping mall.
So these are, at least, two rules for simplification you can start thinking about how to apply in practical ways: modularize, choose modules carefully, reduce the number of the kinds of modules, and think about what things need to change quickly and what things need to change slowly.
Throwing abstraction at the problem does not, ultimately, solve it. Abstraction must be combined with a lot of thinking about what you are abstracting and why.
This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:
Just like how every action has an equal and opposite reaction, each “positive” design decision necessarily creates a “negative” compromise. Insofar as designs necessarily create compromises, those compromises are very much intentional. (And in the same vein, unintentional compromises are a sign of bad design.)
In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.
Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.
But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.
What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.
Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.
What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.
In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.
The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.
Maybe my excuse should be that it was somewhere around two in the morning. Or maybe it was just unclear thinking, and that was that. Sgt P. and I were called out to fix the AN/FPS-77 RADAR system just at the end of our day (I normally came into the shop around 6:30AM after swimming a mile in the Ft. Dix pool, showering, and eating breakfast, so I truly had an early start), so we’d been fighting this problem for some seven or eight hours already. For some reason, a particular fuse down in the high voltage power supply kept blowing. Given this is the circuit that fed the magnetron with 250,000 volts at around 10 amps (yes, that’s a lot of power, especially for a device originally built in 1964), it made for some interesting discussion with the folks in base weather, who were thus dependent on surrounding weather RADAR systems to continue flight operations.
They weren’t happy.
We traced the problem back, using our best half splitting skills in a high voltage circuit that took minutes to power up and down, and finally decided it was a particular resistor located over on a corner of one assembly (we had boards back then, but this particular power supply was actually built on a small metal cage. We ordered another one and went to our respective houses, to sleep.
The next morning, I zoomed back over to the shop — skipping my morning swim, of course — and installed the part. Power on, and… the fuse blew. I should have seen that coming, right? In the midst of the storm, we’d totally jumped outside the half split, measured something wrong, and ended up fingering the wrong component.
Back to square one. What happened? We were looking for facts that would guide us to the right component. But the facts, while interesting, were ultimately irrelevant.
It’s not what we knew that led us wrong, it’s what we didn’t know. But at two in the morning, desperate to get the station chief off our backs, and desperate to get test equipment shelved and the to crawl into a warm bed, we started looking at what we knew, rather than what we didn’t know. Rather than seeking out what we didn’t know, we started thinking, “well, if this is true, and that is true, then this over here must be true.”
Fish often says that troubleshooting is like playing detective — and she’s right. The key problem in troubleshooting (and engineering in general, in fact), is that we often tend to end up watching the show rather than being the detective. If you really watch any detective show (and I’ve watched hundreds, as it’s just about the only sort of on-screen entertainment I will watch), you’ll discover one interesting thing. The twist is dependent on getting you to focus on one set of facts so you’ll jump to a conclusion about who committed the crime.
But the story is carefully set up so one more fact will change the entire face of the mystery. There’s even a Scooby Doo that plays on this — they get to the end, the part where Fred pulls the monster mask off the perpetrator of some heinous crime, and it’s someone that’s not even been in the show up to this point. Thelma screams about how unfair this is, how it’s just not right for someone they hadn’t even met to be the perpetrator, etc.
There’s a reality behind this, though. The facts, while interesting, are irrelevant. What’s relevant is what you don’t know. From design to troubleshooting, the entire point is to find out what you don’t know, not to focus on what you do know.