Applying Software Agility to Network Design

The paper we are looking at in this post is tangential to the world of network engineering, rather than being directly targeted at network engineering. The thesis of On Understanding Software Agility—A Social Complexity Point of View, is that at least some elements of software development are a wicked problem, and hence need to be managed through complexity. The paper sets the following criteria for complexity—

  • Interaction: made up of a lot of interacting systems
  • Autonomy: subsystems are largely autonomous within specified bounds
  • Emergence: global behavior is unpredictable, but can be explained in subsystem interactions
  • Lack of equilibrium: events prevent the system from reaching a state of equilibrium
  • Nonlinearity: small events cause large output changes
  • Self-organization: self-organizing response to disruptive events
  • Co-evolution: the system and its environment adapt to one another

It’s pretty clear network design and operation would fit into the 7 points made above; the control plane, transport protocols, the physical layer, hardware, and software are all subsystems of an overall system. Between these subsystems, there is clearly interaction, and each subsystem acts autonomously within bounds. The result is a set of systemic behaviors that cannot be predicted from examining the system itself. The network design process is, itself, also a complex system, just as software development is.

Trying to establish computing as an engineering discipline led people to believe that managing projects in computing is also an engineering discipline. Engineering is for the most part based on Newtonian mechanics and physics, especially in terms of causality. Events can be predicted, responses can be known in advance and planning and optimize for certain attributes is possible from the outset. Effectively, this reductionist approach assumes that the whole is the sum of the parts, and that parts can be replaced wherever and whenever necessary to address problems. This machine paradigm necessitates planning everything in advance because the machine does not think. This approach is fundamentally incapable of dealing with the complexity and change that actually happens in projects.

In a network, some simple input into one subsystem of the network can cause major changes in the overall system state. The question is: how should engineers deal with this situation? One solution is to try to nail each system down more precisely, such as building a “single source of truth,” and imposing that single view of the world onto the network. The theory is to change control, ultimately removing the lack of equilibrium, and hence reducing complexity. Or perhaps we can centralize the control plane, moving all the complexity into a single point in the network, making it manageable. Or maybe we can automate all the complexity out of the network, feeding the network “intent,” and having, as a result, a nice clean network experience.

Good luck slaying the complexity dragon in any of these ways. What, then, is the solution? According to Pelrine, the right solution is to replace our Newtonian view of software development with a different model. To make this shift, the author suggests moving to the complexity quadrant of the Cynefin framework of complexity.

The correct way to manage software development, according to Pelrine, is to use a probe/sense/respond model. You probe the software through testing iteratively as you build it, sensing the result, and then responding to the result through modification, etc.

Application to network design

The same process is actually used in the design of networks, once beyond the greenfield stage or initial deployment. Over time, different configurations are tested to solve specific problems, iteratively solving problems while also accruing complexity and ossifying. The problem with network designs is much the same as it with software projects—the resulting network is not ever “torn down,” and rethought from the ground up. The result seems to be that networks will become more complex over time, until they either fail, or they are replaced because the business fails, or some other major event occurs. There needs to be some way to combat this—but how?

The key counter to this process is modularization. By modularizing, which is containing complexity within bounds, you can build using the probe/sense/respond model. There will still be changes in the interaction between the modules within the system, but so long as there the “edges” are held constant, the complexity can be split into two domains: within the module and outside the module. Hence the rule of thumb: separate complexity from complexity.

The traditional hierarchical model, of course, provides one kind of separation by splitting control plane state into smaller chunks, potentially at the cost of optimal traffic flow (remember the state/optimization/surface trilemma here, as it describes the trade-offs). Another form of separation is virtualization, with the attendant costs of interaction surfaces and optimization. A third sort of separation is to split policy from reachability, which attempts to preserve optimization at the cost of interaction surfaces. Disaggregation provides yet another useful separation between complex systems, separating software from hardware, and (potentially) the control plane from the network operating system, and even the monitoring software from the network operating system.

These types of modularization can be used together, of course; topological regions of the network can be separated via control plane state choke points, while either virtualization or splitting policy from reachability can be used within a module to separate complexity from complexity within a module. The amount and kind of separations deployed is entirely dependent on specific requirements as well as the complexity of the overall network. The more complex the overall network is, the more kinds of separation that should be deployed to contain complexity into manageable chunks where possible.

Each module, then, can be replaced with a new one, so long as it provides the same set of services, and any changes in the edge are manageable. Each module can be developed iteratively, by making changes (probing), sensing (measuring the result), and then adjusting the module according to whether or not it fits the requirements. This part would involve using creative destruction (the chaos monkey) as a form of probing, to see how the module and system react to controlled failures.

Nice Theory, but So What?

This might all seem theoretical, but it is actually extremely practical. Getting out of the traditional model of network design, where the configuration is fixed, there is a single source of truth for the entire network, the control plane is tied to the software, the software is tied to the hardware, and policy is tied to the control plane, can open up new ways to build massive networks against very complex requirements while managing the complexity and the development and deployment processes. Shifting from a mindset of controlling complexity by nailing everything down to a single state, and to a mindset of managing complexity by finding logical separation points, and building in blocks, then “growing” each module using the appropriate process, whether iterative or waterfall.

Even is scale is not the goal of your network—you “only” have a couple of hundred network devices, say—these principles can still be applied. First, complexity is not really about scale; it is about requirements. A car is not really any less complex than a large truck, and a motor home (or camper) is likely more complex than either. The differences are not in scale, but in requirements. Second, these principles still apply to smaller networks; the primary question is which forms of separation to deploy, rather than whether complexity needs to be separated from complexity.

Moving to this kind of design model could revolutionize the thinking of the network engineering world.

Liskov Substitution and Modularity in Network Design

Furthering the thoughts I’ve put into the forthcoming book on network complexity…

One of the hardest things for designers to wrap their heads around is the concept of unintended consequences. One of the definitional points of complexity in any design is the problem of “push button on right side, weird thing happens over on the left side, and there’s no apparent connection between the two.” This is often just a result of the complexity problem in its base form — the unsolvable triangle (fast/cheap/quality — choose two). The problem is that we often don’t see the third leg of the triangle.

The Liskov substitution principle is one of the mechanisms coders use to manage complexity in object oriented design. The general idea is this: suppose I build an object that describes rectangles. This object can hold the width and the height of the rectangle, and it can return the area of the rectangle. Now, assume I build another object called “square” that overloads the rectangle object, but it forces the width and height to be the same (a square is type of rectangle that has all equal sides, after all). This all seems perfectly normal, right?

Now let’s say I do this:

  • declare a new square object
  • set the width to 10
  • set the height to 5
  • read the area

What’s the answer going to be? Most likely 25 — because the order of operations set the height after the width, and internally the object sets the width and height to be equal, so the last value input into either field wins.

What’s the problem? Isn’t this what I should expect? The confusion is this — the square class is based on the rectangle class, so which behavior wins? But the result is pushing a button over here, and ending up with an unexpected result over there. Taking this one step further, what if you modified the rectangle class to include depth, and then added a function that returns volume? A user might expect the square class to represent a perfectly formed cube (all sides equal), based on the it’s behavior in the past — but that’s not what is going to happen. The solution, from a coding perspective, is to build a new class that underlies both the square and the rectangle — to find a more fundamental construct, and use that as a foundation.

In general, you want to find a foundation which will not change no matter what you build on it — in other words, you want to find a foundation that, when substituted for another foundation in the future, will not modify the objects sitting on top of the foundation.

Hopefully, you’ve tracked me this far. I know this is a bit abstract, but it comes back to network design in an important way. The simplest place to see this is in the data center, where you have an underlay and an overlay. To apply Liskov’s substitution principle here, you could say, “I want to build a physical underlay that will allow me to change it in the future without impacting the overlay.” Or, “I want to be able to change the overlay without impacting how the applications run on the fabric.” Now — take this concept and apply it to the entire network, wide area to data center fabric.

You should always strive to build a physical infrastructure that can be replaced without impacting the control plane. You should also strive to build a control plane that can be replaced without impacting the operation of the applications running on the network. Just like you should be able to replace the physical layer under IP, and not impact the operation of TCP on top in any meaningful way.

Now — the real world is always messier than the virtual worlds we build in our minds. Abstractions are always going to leak, and the interaction surface between any pair of underlying and overlying layers is always going to be deeper and broader than you think when you first look at the problem. None of this negates the end goals, however. Keep the interaction surfaces in a design shallow and narrow, and thinking through “what happens if I replace this piece with a new one later on?”

Hierarchical and modular design, by the way, already operate on these sorts of principles (in theory). They’re just rules of thumb, or design patterns, laid on top of the more foundational concepts. The closer we get to the foundational principles in play, the more we can take this sort of thinking and apply it along every interaction surface in a design, and the more we can move from black art to science in designing networks that work.