Information technology design often follows a common sort of process. First you gather the business requirements. While you do your best to gather all the requirements, you know you will miss some, so you assume the project in hand will change over time. As such, you build some “slop” into the schedule and scaling numbers to account for possible changes, based on prior experience. @ECI
Over at the ACM blog, there is a terrific article about software design that has direct application to network design and architecture.
What do monkeys and clubs have to do with software or network design? The primary point of interaction is security. The club you intend to make your network operator’s life easier is also a club an attacker can use to break into your network, or damage its operation. Clubs are just that way. If you think of the collection of tools as not just tools, but also as an attack surface, you can immediately see the correlation between the available tools and the attack surface. One way to increase security is to reduce the attack surface, and one way to reduce the attack surface is tools, reduce the number of tools—or the club.
The best way to reduce the attack surface of a piece of software is to remove any unnecessary code.
Consider this: the components of any network are actually made up of code. So to translate this to the network engineering world, you can say:
The best way to reduce the attack surface of a network is to remove any unnecessary components.
What kinds of components? Routing protocols, transport protocols, and quality of service mechanisms come immediately to mind, but the number and kind of overlays, the number and kind of virtual networks might be further examples.
There is another issue here that is not security related specifically, but rather resilience related. When you think about network failures, you probably think of bugs in the code, failed connectors, failed hardware, and other such causes. The reality is far different, however—the primary cause of network failures in real life is probably user error in the form of misconfiguration (or misconfiguration spread across a thousand routers through the wonders of DevOps!). The Mean Time Between Mistakes (MTBM) is a much larger deal than most realize. Giving the operator too many knobs to solve a single problem is the equivalent of giving the monkey a club.
Simplicity in network design has many advantages—including giving the monkey a smaller club.
What’s your thoughts on how Network Design itself can be Automated and validated. Also from Intent based Networking at some stage Network should re-look into itself and adjust to meet design goals or best practices or alternatively suggest the design itself in green field situation for example. APSTRA seems to be moving into this direction.
The answer to this question, as always, is—how many balloons fit in a bag? 🙂 I think it depends on what you mean when you use the term design. If we are talking about the overlay, or traffic engineering, or even quality of service, I think we will see a rising trend towards using machine learning in network environments to help solve those problems. I am not convinced machine learning can solve these problems, in the sense of leaving humans out of the loop, but humans could set the parameters up, let the neural network learn the flows, and then let the machine adjust things over time. I tend to think this kind of work will be pretty narrow for a long time to come.
There will be stumbling blocks here that need to be solved. For instance, if you introduce a new application into the network, do you need to re-teach the machine learning network? Or can you somehow make some adjustments? Or are you willing to let the new application underperform while the neural network adjusts? There are no clear answers to these questions, and yet we are going to need clear answers to them before we can really start counting on machine learning in this way.
If, on the other hand, you think of design as figuring out what the network topology should look like in the first place, or what kind of bandwidth you might need to build into the physical topology and where, I think machine learning can provide hints, but it is not going to be able to “design” a network in this way. There is too much intent involved here. For instance, in your original question, you noted the network can “look into itself” and “make adjustments” to better “meet the original design goals.” I’m not certain those “original design goals” are ever going to come from machine learning.
If this sounds like a wishy-washy answer, that’s because it is, in the end… It is always hard to make predictions of this kind—I’m just working off of what I know of machine learning today, compared to what I understand of the multi-variable problem of network designed, which is then mushed into the almost infinite possibilities of business requirements.
In this short video I work through two kinds of design, or two different ways of designing a network. Which kind of designer are you? Do you see one as better than the other? Which would you prefer to do, are you right now?
“No, I wouldn’t do that, it will make the failure domain too large…”
“We need to divide this failure domain up…”
Okay, great—we all know we need to use failure domains, because without them our networks will be unstable, too complex, and all that stuff, right? But what, precisely, is a failure domain? It seems to have something to do with aggregation, because just about every network design book in the world says things like, “aggregating routes breaks up failure domains.” It also seems to have something to do with flooding domains in link state protocols, because we’re often informed that you need to put in flooding domain boundaries to break up large failure domains. Maybe these two things contain a clue: what is common between flooding domain boundaries and aggregating reachability information?
But how does hiding information create failure domain boundaries?
If Router B is aggregating 2001:db8:0:1::/64 and 2001:db8:0:2::/64 to 2001:db8::/61, then changes in the more specific routes will be hidden from Router A. This hiding of information means a failure of one of these two more specific routes does not cause Router A to recalculate what it knows about reachability in the network. Hence a failure at 200:db8:0:1::/64 doesn’t impact Router A—which means Router A is in a different failure domain than 2001:db8:0:1::/64. Based on this, we can venture a simple definition:
A failure domain is any group of devices that will share state when the network topology changes.
This definition doesn’t seem to work all the time, though. For example, what if the metric of the 2001:db8::/61 aggregate at Router B depends on the higher cost more specific among the routes covered (or hidden)? If the aggregate metric is taken from the 2001:db8:0:1::/64 route attached to Router C, then when that link fails, the aggregate cost will also change, and Router A will need to recalculate reachability. This situation, however, doesn’t change our definition of what a failure domain is, it just alerts us that failure domains can “leak” information if they’re not constructed carefully. In fact, we can trace this back to the law of leaky abstractions— hiding information is just a form of abstraction, and all abstractions leak information in some way to at least one other subsystem within the larger system.
Another, harder, example, might be that of the flooding domain boundary in a link state protocol. Assume, for a moment, that Router A is in Level 2, Routers C and D are in Level 1, and Router B is in both Level 1 and Level 2. Further assume no route aggregation is taking place. What will happen when 2001:db8:0:1::/64 fails? As Router B is advertising 2001:db8:0:1::/64 as if it were directly connected, Router A will see the destination disappear, but it will not see the network topology change. The state of the topology seems to be in one failure domain, while the state of reachability seems to be in another, overlapping, failure domain. This appearance is, in fact, a reflection of reality. Failure domains can—and do—overlap in this way all the time. There’s nothing wrong with overlapping failure domains, so long as you recognize they exist, and therefore you actually look (and plan) for them.
Finally, consider what happens if some link attached to Router A fails. Unless routes are being intentionally leaked into the Lelvel 1 flooding domain at Router B, Router C won’t see any changes to the network, either in topology or reachability. After all Router C is just depending on Router B’s attached bit to build a default route it uses to reach any destination outside the local flooding domain. This means failure domains can be assymetric. What breaks a failure domain for one router doesn’t always break it for another. Again, this is okay, so long as you’re aware of this situation, and recognize it when and where it happens.
So given these caveats, the definition of a failure domain above seems to work well. We can refine it a little, but the general idea of a failure domain as a set of devices that will (or must) react to a change in the state of the network is a good place to start.
So far, in our investigation of the design mindset, we’ve—
- Observed, specifically asking what, applying questions about state, surface, and optimization in our examination of the network as it’s actually deployed.
- Oriented, asking why, really focusing in on the questions around what we’re optimizing, and how that drives state and surface in the design.
- Decided by matching technology to requirement, and making certain we really think through and justify our decisions
We also considered the problem of interaction surfaces in some detail along the way. This week I want to wrap this little series up by considering the final step in design, act. Yes, you finally get to actually buy some stuff, rack it up, cable it, and then get to the fine joys of configuring it all up to see if it works. But before you do… A couple of points to consider.
It’s important, when acting, to do more than just, well, act. It’s right at this point that it’s important to be metacongnitive—to think about what we’re thinking about. Or, perhaps, to consider the process of what we’re doing as much as actually doing it. To give you two specific instances…
First, when you’re out there configuring all that new stuff you’ve been unpacking, racking/stacking, and cabling, are you thinking about how to automate what you’re doing? If you have to do it more than once, then it’s probably a candidate for at least thinking about automating. If you have to do it several hundred times, then you should have spent that time automating it in the first place. But just don’t think automation—there’s nothing wrong with modifying your environment to make your production faster and more efficient. I have sets of customized tool sets, macros, and work flows I’ve built in common software like MS Word and Corel Draw that I’ve used, modified, and carried from version to version over the years. It might take me several hours to build a new ribbon in a word processor, or write a short script that does something simple and specific—but spending that time, more often than not, pays itself back many times over as I move through getting things done.
In other words, there is more to acting than just acting. You need to observe what you’re doing, describe it as a process, and then treat it as a process. As Deming once said—If you can’t describe what you are doing as a process, you don’t know what you’re doing.
Second, are you really thinking about what you’ll need to measure for the next round of observation? This is a huge problem in our data driven world—
Being data driven is important, but we can get so lost in being doing what we’re doing that we forget what we actually set out to do. We get caught up in the school of fish, and lose sight of the porpoise. Remember this: when you’re acting, always think about what you’re going to be doing next, which is observing. The more you work being able to observe, think about what you’re going to need to observe and why.
Before talking the final point in the network design mindset, ,act, I wanted to answer an excellent question from the comments from the last post in this series: what is surface?
The concept of interaction surfaces is difficult to grasp primarily because it covers such a wide array of ideas. Let me try to clarify by giving a specific example. Assume you have a single function that—
- Accepts two numbers as input
- Adds them
- Multiplies the resulting sum by 100
- Returns the result
This single function can be considered a subsystem in some larger system. Now assume you break this single function into two functions, one of which does the addition, and the other of which does the multiplication. You’ve created two simpler functions (each one only does one thing), but you’ve created an interaction surface between the two functions—you’ve created two interacting subsystems within the system where there only used to be one. This is a really simple example, I know, but consider a few more that might help.
- The routing information carried in OSPF is split up into external routes being carried in BGP, and internal routes being carried in OSPF. You’ve gone from one system with more state to two systems with less state, but you’ve created an interaction surface between the two protocols—they must now work together to build a complete forwarding table.
- A single set of hosts with different access policies are split onto multiple virtual topologies on the same physical network. You’ve simplified the amount of state in filtering, but you’ve created an interaction surface between the two virtual topologies, between the two topologies and the control plane, and you’ve exposed new shared risk groups where a single physical failure can cause multiple logical ones. Hence you’ve traded state in one control plane for interaction surfaces between multiple control planes.
Even two routers communicating within a single control plane can be considered an interaction surface. This breadth of definition is what makes it so very difficult to define what an interaction surface is. To understand how interaction surfaces cause technical debt, I want to point you to a recent paper on machine learning and technical debt.
In this paper, we focus on the system-level interaction between machine learning code and larger systems as an area where hidden technical debt may rapidly accumulate. At a system-level, a machine learning model may subtly erode abstraction boundaries. It may be tempting to re-use input signals in ways that create unintended tight coupling of otherwise disjoint systems. Machine learning packages may often be treated as black boxes, resulting in large masses of “glue code” or calibration layers that can lock in assumptions. Changes in the external world may make models or input signals change behavior in unintended ways, ratcheting up maintenance cost and the burden of any debt. Even monitoring that the system as a whole is operating as intended may be difficult without careful design.
Most systems are designed for a specific “world,” or set of circumstances at a specific point in time. As this “world” changes (over time), subsystems are sheared off and replaced, requirements are changed for each individual subsystem, and external interfaces the original designer counted on are changed and/or replaced to meet updated requirements.
Interaction surfaces aren’t a bad thing; they help us divide and conquer in any given problem space, from modeling to implementation. At the same time, interaction surfaces are all to easy to introduce without thought—hence their deep connection to technical debt.
Next time, I’ll (hopefully) finish this series on the design mindset.