In Systemantics: How Systems Really Work and How They Fail, John Gall says:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.
In the software development world, this is called Gall’s Law (even though Gall himself never calls it a law) and is applied to organizations and software systems. How does this apply to network design and engineering? The best place to begin in answering this question is to understand what, precisely, Gall is arguing for; there is more here than what is visible on the surface.
What does a simple system mean? It is, first of all, an argument for underspecification. This runs counter to the way we instinctively want to design systems. We want to begin by discovering all the requirements (problems to be solved and constraints), and then move into an orderly discussion of all the possible solutions and sets of solutions, and then into an orderly discussion of an overall architecture, then into a nice UML chart showing all the interaction surfaces and how they work, and … finally … into building or buying the individual components.
This is beautiful on paper, but it does not often work in real life. What Gall is arguing for is building a small, simple system first that only solves some subset of the problems. Once that is done, add onto the core system until you get to a solution that solves the problem set. The initial system, then, needs to be underspecified. The specification for the initial system does not need to be “complete;” it just needs to be “good enough to get started.”
If this sounds something like agile development, that’s because it is something like agile development.
This is also the kind of thinking that has been discussed on the history of networking series (listen to this episode on the origins of DNS with Paul Mockapetris as an example). There are a number of positive aspects to this way of building systems. First, you solve problems in small enough chunks to see real progress. Second, as you solve each problem (or part of the problem), you are creating a useable system that can be deployed and tested and solves a specific problem. Third, you are more likely to “naturally” modularize a system if you build it in pieces. Once some smaller piece is in production, it is almost always going to be easier to build another small piece than to try to add new functionality and deploy the result.
How can this be applied to network design and operations?
The most obvious answer is to build the network in chunks, starting with the simple things first. For instance, if you are testing a new network design, focus on building just a campus or data center fabric, rather than trying to replace the entire existing network with a new one. This use of modularization can be extended to use cases beyond topologies within the network, however. You could allow multiple overlays to co-exist, each one solving a specific problem, in the data center.
This latter example, however—multiple overlays—shows how and where this kind of strategy can go wrong. In building multiple overlays you might be tempted to build multiple kinds of overlays by using different kinds of control planes, or different kinds of transport protocols. This kind of ad-hoc building can fit well within the agile mindset but can result in a system that is flat-out unmaintainable. I have been in two-day meetings where the agenda was just to go over every network management related application currently deployed in the network. A printed copy of the spreadsheet, on tool per line, came out to tens of pages. This is agile gone wildly wrong, driving unnecessary complexity.
Another problem with this kind of development model, particularly in network engineering, is it is easy to ignore lateral interaction surfaces, particularly among modules that do not seem to interact. For instance, IS-IS and BGP are both control planes, and hence seem to fit at the same “layer” in the design. Since they are lateral modules, each one providing different kinds of reachability information, it is easy to forget they also interact with one another.
Gall’s law, like all laws in the engineering world, can be a good rule of thumb—so long as you keep up a system-level view of the network, and maintain discipline around a basic set of rules (such as “don’t use different kinds of overlays, even if you use multiple overlays”).
Replace “software” with “network,” and think about it. How often do network engineers select the chassis-based system that promises to “never need to be replaced?” How often do we build networks like they will be “in use” 20+ years from now? Now it does happen from time to time; I have heard of devices with many years of uptime, for instance. I have worked on AT&T Brouters in production—essentially a Cisco AGS+ rebranded and resold by AT&T—that were some ten or fifteen years old even back when I worked on them. These things certainly happen, and sometimes they even happen for good reasons.
But knowing such things happen and planning for such things to happen are two different mindsets. At least some of the complexity in networks comes from just this sort of “must make it permanent: thinking:
Many developers like to write code which handles any problem which might appear at any point in the future. In that regard, they are fortune tellers, trying to find a solution for eventual problems. This can work out very well if their predictions are right. Most of the time, however, this flexibility only causes unneeded complexity in the code which gets in the way and does not actually solve any problems. This is not surprising as telling the future is a messy business.
Let’s refactor: many network engineers like to build networks that can handle any problem or application that might appear at any point in the future. I know I’m close to the truth, because I’ve been working on networks since the mid- to late-1980’s.
So now you are reading this and thinking: “but it is important to plan for the future.” You are not wrong—but there is a balance that often is not well thought out. You should not build for the immediate problem ignoring the future; down this path leads technical debt. You should not plan for the distant future, because this injects complexity that does not need to be there.
How do you find the balance? The place to begin is knowing how things work, rather than just how to make them work. If you know how and why things work, then you can see what things might last for a long time, and what might change quickly.
When you are designing a protocol, does it make sense to use TLVs rather than fixed length fields? Protocols last for 20+ years and are used across many different network devices. Protocols are often extended to solve new problems, rather than being replaced wholesale. Hence, it makes sense to use TLVs.
When you are designing a data center or campus network, does it make sense to purchase chassis boxes that are twice as large as you foresee needing over the next three years to future proof the design? Hardware changes are likely to make a device more than three years old easier to replace than upgrade—if you can even get the parts you need in three years. Hence, it makes more sense to plan for the immediate future and leave the crystal ball gazing to someone else.
If you haven’t found the tradeoffs, then you haven’t looked hard enough.
But to look hard enough, you need to go beyond the hype and “future proofing,” beyond how to make things work. You need to ask how and why things work the way they do so you know where to accept complexity to make the design more flexible, and where to limit complexity by planning for today.
Information technology design often follows a common sort of process. First you gather the business requirements. While you do your best to gather all the requirements, you know you will miss some, so you assume the project in hand will change over time. As such, you build some “slop” into the schedule and scaling numbers to account for possible changes, based on prior experience. @ECI
Over at the ACM blog, there is a terrific article about software design that has direct application to network design and architecture.
What do monkeys and clubs have to do with software or network design? The primary point of interaction is security. The club you intend to make your network operator’s life easier is also a club an attacker can use to break into your network, or damage its operation. Clubs are just that way. If you think of the collection of tools as not just tools, but also as an attack surface, you can immediately see the correlation between the available tools and the attack surface. One way to increase security is to reduce the attack surface, and one way to reduce the attack surface is tools, reduce the number of tools—or the club.
The best way to reduce the attack surface of a piece of software is to remove any unnecessary code.
Consider this: the components of any network are actually made up of code. So to translate this to the network engineering world, you can say:
The best way to reduce the attack surface of a network is to remove any unnecessary components.
What kinds of components? Routing protocols, transport protocols, and quality of service mechanisms come immediately to mind, but the number and kind of overlays, the number and kind of virtual networks might be further examples.
There is another issue here that is not security related specifically, but rather resilience related. When you think about network failures, you probably think of bugs in the code, failed connectors, failed hardware, and other such causes. The reality is far different, however—the primary cause of network failures in real life is probably user error in the form of misconfiguration (or misconfiguration spread across a thousand routers through the wonders of DevOps!). The Mean Time Between Mistakes (MTBM) is a much larger deal than most realize. Giving the operator too many knobs to solve a single problem is the equivalent of giving the monkey a club.
Simplicity in network design has many advantages—including giving the monkey a smaller club.
What’s your thoughts on how Network Design itself can be Automated and validated. Also from Intent based Networking at some stage Network should re-look into itself and adjust to meet design goals or best practices or alternatively suggest the design itself in green field situation for example. APSTRA seems to be moving into this direction.
The answer to this question, as always, is—how many balloons fit in a bag? 🙂 I think it depends on what you mean when you use the term design. If we are talking about the overlay, or traffic engineering, or even quality of service, I think we will see a rising trend towards using machine learning in network environments to help solve those problems. I am not convinced machine learning can solve these problems, in the sense of leaving humans out of the loop, but humans could set the parameters up, let the neural network learn the flows, and then let the machine adjust things over time. I tend to think this kind of work will be pretty narrow for a long time to come.
There will be stumbling blocks here that need to be solved. For instance, if you introduce a new application into the network, do you need to re-teach the machine learning network? Or can you somehow make some adjustments? Or are you willing to let the new application underperform while the neural network adjusts? There are no clear answers to these questions, and yet we are going to need clear answers to them before we can really start counting on machine learning in this way.
If, on the other hand, you think of design as figuring out what the network topology should look like in the first place, or what kind of bandwidth you might need to build into the physical topology and where, I think machine learning can provide hints, but it is not going to be able to “design” a network in this way. There is too much intent involved here. For instance, in your original question, you noted the network can “look into itself” and “make adjustments” to better “meet the original design goals.” I’m not certain those “original design goals” are ever going to come from machine learning.
If this sounds like a wishy-washy answer, that’s because it is, in the end… It is always hard to make predictions of this kind—I’m just working off of what I know of machine learning today, compared to what I understand of the multi-variable problem of network designed, which is then mushed into the almost infinite possibilities of business requirements.
“No, I wouldn’t do that, it will make the failure domain too large…”
“We need to divide this failure domain up…”
Okay, great—we all know we need to use failure domains, because without them our networks will be unstable, too complex, and all that stuff, right? But what, precisely, is a failure domain? It seems to have something to do with aggregation, because just about every network design book in the world says things like, “aggregating routes breaks up failure domains.” It also seems to have something to do with flooding domains in link state protocols, because we’re often informed that you need to put in flooding domain boundaries to break up large failure domains. Maybe these two things contain a clue: what is common between flooding domain boundaries and aggregating reachability information?
But how does hiding information create failure domain boundaries?
If Router B is aggregating 2001:db8:0:1::/64 and 2001:db8:0:2::/64 to 2001:db8::/61, then changes in the more specific routes will be hidden from Router A. This hiding of information means a failure of one of these two more specific routes does not cause Router A to recalculate what it knows about reachability in the network. Hence a failure at 200:db8:0:1::/64 doesn’t impact Router A—which means Router A is in a different failure domain than 2001:db8:0:1::/64. Based on this, we can venture a simple definition:
A failure domain is any group of devices that will share state when the network topology changes.
This definition doesn’t seem to work all the time, though. For example, what if the metric of the 2001:db8::/61 aggregate at Router B depends on the higher cost more specific among the routes covered (or hidden)? If the aggregate metric is taken from the 2001:db8:0:1::/64 route attached to Router C, then when that link fails, the aggregate cost will also change, and Router A will need to recalculate reachability. This situation, however, doesn’t change our definition of what a failure domain is, it just alerts us that failure domains can “leak” information if they’re not constructed carefully. In fact, we can trace this back to the law of leaky abstractions— hiding information is just a form of abstraction, and all abstractions leak information in some way to at least one other subsystem within the larger system.
Another, harder, example, might be that of the flooding domain boundary in a link state protocol. Assume, for a moment, that Router A is in Level 2, Routers C and D are in Level 1, and Router B is in both Level 1 and Level 2. Further assume no route aggregation is taking place. What will happen when 2001:db8:0:1::/64 fails? As Router B is advertising 2001:db8:0:1::/64 as if it were directly connected, Router A will see the destination disappear, but it will not see the network topology change. The state of the topology seems to be in one failure domain, while the state of reachability seems to be in another, overlapping, failure domain. This appearance is, in fact, a reflection of reality. Failure domains can—and do—overlap in this way all the time. There’s nothing wrong with overlapping failure domains, so long as you recognize they exist, and therefore you actually look (and plan) for them.
Finally, consider what happens if some link attached to Router A fails. Unless routes are being intentionally leaked into the Lelvel 1 flooding domain at Router B, Router C won’t see any changes to the network, either in topology or reachability. After all Router C is just depending on Router B’s attached bit to build a default route it uses to reach any destination outside the local flooding domain. This means failure domains can be assymetric. What breaks a failure domain for one router doesn’t always break it for another. Again, this is okay, so long as you’re aware of this situation, and recognize it when and where it happens.
So given these caveats, the definition of a failure domain above seems to work well. We can refine it a little, but the general idea of a failure domain as a set of devices that will (or must) react to a change in the state of the network is a good place to start.
So far, in our investigation of the design mindset, we’ve—
- Observed, specifically asking what, applying questions about state, surface, and optimization in our examination of the network as it’s actually deployed.
- Oriented, asking why, really focusing in on the questions around what we’re optimizing, and how that drives state and surface in the design.
- Decided by matching technology to requirement, and making certain we really think through and justify our decisions
We also considered the problem of interaction surfaces in some detail along the way. This week I want to wrap this little series up by considering the final step in design, act. Yes, you finally get to actually buy some stuff, rack it up, cable it, and then get to the fine joys of configuring it all up to see if it works. But before you do… A couple of points to consider.
It’s important, when acting, to do more than just, well, act. It’s right at this point that it’s important to be metacongnitive—to think about what we’re thinking about. Or, perhaps, to consider the process of what we’re doing as much as actually doing it. To give you two specific instances…
First, when you’re out there configuring all that new stuff you’ve been unpacking, racking/stacking, and cabling, are you thinking about how to automate what you’re doing? If you have to do it more than once, then it’s probably a candidate for at least thinking about automating. If you have to do it several hundred times, then you should have spent that time automating it in the first place. But just don’t think automation—there’s nothing wrong with modifying your environment to make your production faster and more efficient. I have sets of customized tool sets, macros, and work flows I’ve built in common software like MS Word and Corel Draw that I’ve used, modified, and carried from version to version over the years. It might take me several hours to build a new ribbon in a word processor, or write a short script that does something simple and specific—but spending that time, more often than not, pays itself back many times over as I move through getting things done.
In other words, there is more to acting than just acting. You need to observe what you’re doing, describe it as a process, and then treat it as a process. As Deming once said—If you can’t describe what you are doing as a process, you don’t know what you’re doing.
Second, are you really thinking about what you’ll need to measure for the next round of observation? This is a huge problem in our data driven world—
Being data driven is important, but we can get so lost in being doing what we’re doing that we forget what we actually set out to do. We get caught up in the school of fish, and lose sight of the porpoise. Remember this: when you’re acting, always think about what you’re going to be doing next, which is observing. The more you work being able to observe, think about what you’re going to need to observe and why.