According to Maor Rudick, in a recent post over at Cloud Native, programming is 10% writing code and 90% understanding why it doesn’t work. This expresses the art of deploying network protocols, security, or anything that needs thought about where and how. I’m not just talking about the configuration, either—why was this filter deployed here rather than there? Why was this BGP community used rather than that one? Why was this aggregation range used rather than some other? Even in a fully automated world, the saying holds true.
So how can you improve the understandability of your network design? Maor defines understandability as “the dev who creates the software is to effortlessly … comprehend what is happening in it.” Continuing—“the more understandable a system is, the easier it becomes for the developers who created it to change it in a way that is safe and predictable.” What are the elements of understandability?
Documentation must be complete, clear, concise, and organized. The two primary failings I encounter in documentation are completeness and organization. Why something is done, when it was last changed, and why it was changed are often missing. The person making the change just assumes “I’ll remember this, or someone will figure it out.” You won’t, and they won’t. Concise is the “other side” of complete … Recording unsubstantial changes just adds information that won’t ever be needed. You have to balance between enough and too much, of course.
Organization is another entire problem in documentation—most people have a favorite way to organize things. When you get a team of people all organizing things based on their favorite way, you end up with a mess. Going back in time … I remember that just about everyone who was assigned to the METNAV shop began their time by re-organizing the tools. Each time the re-organization made things so much easier to find, and improved the MTTR for the airfield equipment we supported … After a while, you’d think someone would ask, “Does re-organizing all the tools every year really help? Or are you just making stuff up for new folks to do?”
Moving beyond documentation, what else can we do to make our networks more understandable?
First, we can focus on actually making networks simpler. I don’t mean just glossing things over with a pretty GUI, or automating thousands of lines of configuration using Python. I mean taking steps by using protocols that are simpler to run, require less configuration, and produce more information you can use for troubleshooting—choose something like IS-IS for your DC fabric underlay rather than BGP, unless you really have several hundred thousand of underlay destinations (hint, if you’ve properly separated “customer” routes in the overlay from “infrastructure” routes in the underlay, you shouldn’t have this kind of routing tangle in the underlay anyway).
What about having multiple protocols that do the same job? Do you really need three or four routing protocols, four or five tunneling protocols, and five or six … well, you get the idea. Reducing the sheer number of protocols running in your network can make a huge difference in the tooling troubleshooting time. What about having four or five kinds of boxes in your network that fulfill the same role? Okay—so maybe you have three DC fabrics, and you run each one using a different vendor. But is there is any reason to have three DC fabrics, each of which has a broad mix of equipment from five different vendors? I doubt it.
Second, you can think about what you would measure in the case of failure, how you would measure it, and put the basic piece in place in the design phase to do those measurements. Don’t wait until you need the data to figure out how to get at it, and what the performance results of trying to get it are going to be.
Third, you can think about where you put policy in your network. There is no “right” answer to this question, other than … be consistent. The first option is to put all your policy in one place—say, on the devices that connect the core to the aggregation, or the devices in the distribution layer. The second option is to always put the policy as close to the source or destination of the traffic impacted by the policy. In a DC fabric, you should always put policy and external connectivity in the T0 or ToR, never in the spine (it’s not a core, it’s a spine).
Maybe you have other ideas on how to improve understandability in networks … If you do, get in touch and let’s talk about it. I’m always looking for practical ways to make networks more understandable.