This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:
Just like how every action has an equal and opposite reaction, each “positive” design decision necessarily creates a “negative” compromise. Insofar as designs necessarily create compromises, those compromises are very much intentional. (And in the same vein, unintentional compromises are a sign of bad design.)
In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.
Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.
But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.
What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.
Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.
What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.
In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.
The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.
Maybe my excuse should be that it was somewhere around two in the morning. Or maybe it was just unclear thinking, and that was that. Sgt P. and I were called out to fix the AN/FPS-77 RADAR system just at the end of our day (I normally came into the shop around 6:30AM after swimming a mile in the Ft. Dix pool, showering, and eating breakfast, so I truly had an early start), so we’d been fighting this problem for some seven or eight hours already. For some reason, a particular fuse down in the high voltage power supply kept blowing. Given this is the circuit that fed the magnetron with 250,000 volts at around 10 amps (yes, that’s a lot of power, especially for a device originally built in 1964), it made for some interesting discussion with the folks in base weather, who were thus dependent on surrounding weather RADAR systems to continue flight operations.
They weren’t happy.
We traced the problem back, using our best half splitting skills in a high voltage circuit that took minutes to power up and down, and finally decided it was a particular resistor located over on a corner of one assembly (we had boards back then, but this particular power supply was actually built on a small metal cage. We ordered another one and went to our respective houses, to sleep.
The next morning, I zoomed back over to the shop — skipping my morning swim, of course — and installed the part. Power on, and… the fuse blew. I should have seen that coming, right? In the midst of the storm, we’d totally jumped outside the half split, measured something wrong, and ended up fingering the wrong component.
Back to square one. What happened? We were looking for facts that would guide us to the right component. But the facts, while interesting, were ultimately irrelevant.
It’s not what we knew that led us wrong, it’s what we didn’t know. But at two in the morning, desperate to get the station chief off our backs, and desperate to get test equipment shelved and the to crawl into a warm bed, we started looking at what we knew, rather than what we didn’t know. Rather than seeking out what we didn’t know, we started thinking, “well, if this is true, and that is true, then this over here must be true.”
Fish often says that troubleshooting is like playing detective — and she’s right. The key problem in troubleshooting (and engineering in general, in fact), is that we often tend to end up watching the show rather than being the detective. If you really watch any detective show (and I’ve watched hundreds, as it’s just about the only sort of on-screen entertainment I will watch), you’ll discover one interesting thing. The twist is dependent on getting you to focus on one set of facts so you’ll jump to a conclusion about who committed the crime.
But the story is carefully set up so one more fact will change the entire face of the mystery. There’s even a Scooby Doo that plays on this — they get to the end, the part where Fred pulls the monster mask off the perpetrator of some heinous crime, and it’s someone that’s not even been in the show up to this point. Thelma screams about how unfair this is, how it’s just not right for someone they hadn’t even met to be the perpetrator, etc.
There’s a reality behind this, though. The facts, while interesting, are irrelevant. What’s relevant is what you don’t know. From design to troubleshooting, the entire point is to find out what you don’t know, not to focus on what you do know.