The paper we are looking at in this post is tangential to the world of network engineering, rather than being directly targeted at network engineering. The thesis of On Understanding Software Agility—A Social Complexity Point of View, is that at least some elements of software development are a wicked problem, and hence need to be managed through complexity. The paper sets the following criteria for complexity—
- Interaction: made up of a lot of interacting systems
- Autonomy: subsystems are largely autonomous within specified bounds
- Emergence: global behavior is unpredictable, but can be explained in subsystem interactions
- Lack of equilibrium: events prevent the system from reaching a state of equilibrium
- Nonlinearity: small events cause large output changes
- Self-organization: self-organizing response to disruptive events
- Co-evolution: the system and its environment adapt to one another
It’s pretty clear network design and operation would fit into the 7 points made above; the control plane, transport protocols, the physical layer, hardware, and software are all subsystems of an overall system. Between these subsystems, there is clearly interaction, and each subsystem acts autonomously within bounds. The result is a set of systemic behaviors that cannot be predicted from examining the system itself. The network design process is, itself, also a complex system, just as software development is.
Trying to establish computing as an engineering discipline led people to believe that managing projects in computing is also an engineering discipline. Engineering is for the most part based on Newtonian mechanics and physics, especially in terms of causality. Events can be predicted, responses can be known in advance and planning and optimize for certain attributes is possible from the outset. Effectively, this reductionist approach assumes that the whole is the sum of the parts, and that parts can be replaced wherever and whenever necessary to address problems. This machine paradigm necessitates planning everything in advance because the machine does not think. This approach is fundamentally incapable of dealing with the complexity and change that actually happens in projects.
In a network, some simple input into one subsystem of the network can cause major changes in the overall system state. The question is: how should engineers deal with this situation? One solution is to try to nail each system down more precisely, such as building a “single source of truth,” and imposing that single view of the world onto the network. The theory is to change control, ultimately removing the lack of equilibrium, and hence reducing complexity. Or perhaps we can centralize the control plane, moving all the complexity into a single point in the network, making it manageable. Or maybe we can automate all the complexity out of the network, feeding the network “intent,” and having, as a result, a nice clean network experience.
Good luck slaying the complexity dragon in any of these ways. What, then, is the solution? According to Pelrine, the right solution is to replace our Newtonian view of software development with a different model. To make this shift, the author suggests moving to the complexity quadrant of the Cynefin framework of complexity.
The correct way to manage software development, according to Pelrine, is to use a probe/sense/respond model. You probe the software through testing iteratively as you build it, sensing the result, and then responding to the result through modification, etc.
Application to network design
The same process is actually used in the design of networks, once beyond the greenfield stage or initial deployment. Over time, different configurations are tested to solve specific problems, iteratively solving problems while also accruing complexity and ossifying. The problem with network designs is much the same as it with software projects—the resulting network is not ever “torn down,” and rethought from the ground up. The result seems to be that networks will become more complex over time, until they either fail, or they are replaced because the business fails, or some other major event occurs. There needs to be some way to combat this—but how?
The key counter to this process is modularization. By modularizing, which is containing complexity within bounds, you can build using the probe/sense/respond model. There will still be changes in the interaction between the modules within the system, but so long as there the “edges” are held constant, the complexity can be split into two domains: within the module and outside the module. Hence the rule of thumb: separate complexity from complexity.
The traditional hierarchical model, of course, provides one kind of separation by splitting control plane state into smaller chunks, potentially at the cost of optimal traffic flow (remember the state/optimization/surface trilemma here, as it describes the trade-offs). Another form of separation is virtualization, with the attendant costs of interaction surfaces and optimization. A third sort of separation is to split policy from reachability, which attempts to preserve optimization at the cost of interaction surfaces. Disaggregation provides yet another useful separation between complex systems, separating software from hardware, and (potentially) the control plane from the network operating system, and even the monitoring software from the network operating system.
These types of modularization can be used together, of course; topological regions of the network can be separated via control plane state choke points, while either virtualization or splitting policy from reachability can be used within a module to separate complexity from complexity within a module. The amount and kind of separations deployed is entirely dependent on specific requirements as well as the complexity of the overall network. The more complex the overall network is, the more kinds of separation that should be deployed to contain complexity into manageable chunks where possible.
Each module, then, can be replaced with a new one, so long as it provides the same set of services, and any changes in the edge are manageable. Each module can be developed iteratively, by making changes (probing), sensing (measuring the result), and then adjusting the module according to whether or not it fits the requirements. This part would involve using creative destruction (the chaos monkey) as a form of probing, to see how the module and system react to controlled failures.
Nice Theory, but So What?
This might all seem theoretical, but it is actually extremely practical. Getting out of the traditional model of network design, where the configuration is fixed, there is a single source of truth for the entire network, the control plane is tied to the software, the software is tied to the hardware, and policy is tied to the control plane, can open up new ways to build massive networks against very complex requirements while managing the complexity and the development and deployment processes. Shifting from a mindset of controlling complexity by nailing everything down to a single state, and to a mindset of managing complexity by finding logical separation points, and building in blocks, then “growing” each module using the appropriate process, whether iterative or waterfall.
Even is scale is not the goal of your network—you “only” have a couple of hundred network devices, say—these principles can still be applied. First, complexity is not really about scale; it is about requirements. A car is not really any less complex than a large truck, and a motor home (or camper) is likely more complex than either. The differences are not in scale, but in requirements. Second, these principles still apply to smaller networks; the primary question is which forms of separation to deploy, rather than whether complexity needs to be separated from complexity.
Moving to this kind of design model could revolutionize the thinking of the network engineering world.