Gall’s Law and the Network

In Systemantics: How Systems Really Work and How They Fail, John Gall says:

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

In the software development world, this is called Gall’s Law (even though Gall himself never calls it a law) and is applied to organizations and software systems. How does this apply to network design and engineering? The best place to begin in answering this question is to understand what, precisely, Gall is arguing for; there is more here than what is visible on the surface.

What does a simple system mean? It is, first of all, an argument for underspecification. This runs counter to the way we instinctively want to design systems. We want to begin by discovering all the requirements (problems to be solved and constraints), and then move into an orderly discussion of all the possible solutions and sets of solutions, and then into an orderly discussion of an overall architecture, then into a nice UML chart showing all the interaction surfaces and how they work, and … finally … into building or buying the individual components.

This is beautiful on paper, but it does not often work in real life. What Gall is arguing for is building a small, simple system first that only solves some subset of the problems. Once that is done, add onto the core system until you get to a solution that solves the problem set. The initial system, then, needs to be underspecified. The specification for the initial system does not need to be “complete;” it just needs to be “good enough to get started.”

If this sounds something like agile development, that’s because it is something like agile development.

This is also the kind of thinking that has been discussed on the history of networking series (listen to this episode on the origins of DNS with Paul Mockapetris as an example). There are a number of positive aspects to this way of building systems. First, you solve problems in small enough chunks to see real progress. Second, as you solve each problem (or part of the problem), you are creating a useable system that can be deployed and tested and solves a specific problem. Third, you are more likely to “naturally” modularize a system if you build it in pieces. Once some smaller piece is in production, it is almost always going to be easier to build another small piece than to try to add new functionality and deploy the result.

How can this be applied to network design and operations?

The most obvious answer is to build the network in chunks, starting with the simple things first. For instance, if you are testing a new network design, focus on building just a campus or data center fabric, rather than trying to replace the entire existing network with a new one. This use of modularization can be extended to use cases beyond topologies within the network, however. You could allow multiple overlays to co-exist, each one solving a specific problem, in the data center.

This latter example, however—multiple overlays—shows how and where this kind of strategy can go wrong. In building multiple overlays you might be tempted to build multiple kinds of overlays by using different kinds of control planes, or different kinds of transport protocols. This kind of ad-hoc building can fit well within the agile mindset but can result in a system that is flat-out unmaintainable. I have been in two-day meetings where the agenda was just to go over every network management related application currently deployed in the network. A printed copy of the spreadsheet, on tool per line, came out to tens of pages. This is agile gone wildly wrong, driving unnecessary complexity.

Another problem with this kind of development model, particularly in network engineering, is it is easy to ignore lateral interaction surfaces, particularly among modules that do not seem to interact. For instance, IS-IS and BGP are both control planes, and hence seem to fit at the same “layer” in the design. Since they are lateral modules, each one providing different kinds of reachability information, it is easy to forget they also interact with one another.

Gall’s law, like all laws in the engineering world, can be a good rule of thumb—so long as you keep up a system-level view of the network, and maintain discipline around a basic set of rules (such as “don’t use different kinds of overlays, even if you use multiple overlays”).