Letting go of Clean Design
What is the best way to build a large-scale network—in two words? Ask ten networking folks (engineers, designers, or whatever else), and you’re likely to get the same answer from at least nine: clean abstractions. They might not say the word abstraction, of course; instead, they might say words like build things in modules, using summarization and aggregation to divide the modules up. Or they might say make certain to reduce the failure domain to the smallest you possible can everywhere you can. Or they might say use hierarchical design. These answers are, however, variants of the single word: abstraction.
This response came to mind when I was reading an article on clean code this last week (it’s amazing how often software architecture overlaps with network architecture):
I have been teaching network design for many, many years. I co-authored my first book on network design, Advanced IP Network Design, with Don Slice and Alvaro Retana; it was published in 1999, and it typically takes about a year to write a book, so we probably started working on it in the middle of 1998. The entire object that book was to teach hierarchical network design, which relies on modularization through aggregation and summarization to separate complexity from complexity (though I didn’t really use this wording until many years later) in order to break up failure domains.
It has been twenty-two years since Don, Alvaro, and I wrote that book—and hierarchical network design is still as relevant today as it was then. But in the last 22 years, I think I’ve learned just a little more about network design.
Among the things I’ve picked up in that 22 years is this one: if you haven’t found the tradeoffs, you haven’t looked hard enough. Or perhaps there is no such thing as a free lunch. Abstraction is a superpower, and it can make your network a lot cleaner, even when you’re using it correctly (not using it to paper over complexity). But building the perfectly clean network can mean reducing the agility of the design to the point of fragility. For instance, in the article linked above, Dan Abramov notes changing requirements made his “clean revision” of the code much more complex—a classic sign of fragility.
Perhaps an example would be helpful here. If you think of RIP as a link state protocol with summarization (abstraction of topology) at every hop, given you understand how link state and distance-vector protocols work, you can probably quickly grasp what you have gained by summarizing at every hop—and what you have lost.
You should still use abstraction to break up failure domains. You should still use abstraction to separate complexity from complexity. But you should use abstraction like you would any other tool. You should decide the best places and times to use abstraction after understanding the whole system.
For instance—a lot of people really insist on aggregating routing information in their data center fabric, especially in the underlay control plane. Why? The underlay is a constrained routing domain with known properties. Aggregation in this environment can cause routing black holes and unpredictable traffic flow behavior—both of which require added complexity to “work around.” If there is another solution available, it might be best to use it.
At the same time, I see a lot of people insisting BGP is the only option for data center underlays, or that it is the simplest option because you can use a single protocol for the underlay and overlay. This, in my opinion, is wrong, as well—because it does not properly separate two different parts of the network, each with their own purpose, into separate failure domains.
Rather than looking at a network and saying, “we can abstract here, so we should abstract here,” you should look at a network and say, “what are the modules here, and what purposes do they serve?” Once you know that, you can start thinking about when and were abstraction makes sense.
To paraphrase Dan, don’t be a clean network design zealot. Clean network design is not a goal. It’s a good guide when you don’t understand the network; such guides are often useful, but they are guides rather than rules.
Hello,
I could read this kind of blog post on fundamental design concepts all day long, and even more when it’s applied to Datacenter networks, thank you Russ !
On the part when you write about aggregation in DC fabric, you assume this is a Clos fabric with no crosslinks between spines, right ?
Speaking of Clos fabric, it reminds me of something that made me think a lot since I read about the basic models on your last book (Computer Networking Problems and Solutions).
Those basic models are listed as : rings, meshes & triangles.
Which basic model would you apply to define a clos fabric ? Is it some kind of partial mesh ? An incomplete triangle ? A hub&spoke (which also looks like some kind of partial mesh?) ? Is partial mesh an actual basic model with its own properties (since it’s not exactly inheriting full meshes properties, nor triangle ones) ?
Thank you very much again !
/bows