There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.
Let me try to recap some of the arguments I’ve heard used against my assertion.
The theory that enterprise and service provider networks require completely different technologies and implementations is often grounded in scale. Service provider networks are so large that they simply must use different solutions—solutions that you cannot apply to any network running at a smaller scale.
The problem with this line of thinking is it throws the baby out with the bathwater. Google is using automation to run their network? Well, then… you shouldn’t use automation because Google’s problems are not your problems. Microsoft is deploying 100g Ethernet over fiber? Then clearly enterprise networks should be using Token Ring or ARCnet because… Microsoft’s problems are not your problems.
The usual answer is—“I’m not saying we shouldn’t take good ideas when we see them, but we shouldn’t design networks the way someone else does just because.” I don’t see how this clarifies the solution, though—when is it a good idea or a bad one? What is our criterion to decide what to adopt and what not to adopt? Simply saying “X’s problems aren’t your problems” doesn’t really give me any actionable information—or at least I’m not getting it if it’s buried in there someplace.
Instead—maybe—just maybe—we are looking at this all wrong. Maybe there is some other way classify networks that will help us see the problem set better.
I don’t think networks are undifferentiated—I think the enterprise/service provider/hyerpscaler divide is not helpful to understand how different networks are … different, and how to correctly identify an environment and build to it. Reading a classic paper in software design this week—Programs, Life Cycles, and Laws of Software Evolution—brought all this to mind. In writing this paper, Meir Lehman was facing many of the same classification problems, just in software development rather than in building networks.
Rather than saying “enterprise software is different than service provider software”—an assertion absolutely no-one makes—or even “commercial software is different than private software, and developers working in these two areas cannot use the same tools and techniques,” Lehman posits there are three kinds of software systems. He calls these S-Programs, in which the problem and solution can be fully specified; P-Programs, in which the problem can be fully specified, but the program can only be partially specified because of complexity and scale; and E-Programs, where the program itself become part of the world it models. Lehman thinks most software will move towards S-Program status as time moves on—something that hasn’t happened (the reasons are out of scope for this already-too-long-blog-post).
But the classification is useful. For S-Programs, the inputs and outputs can be fully specified, full-on testing can take place before the software is deployed, and lifecycle management is largely about making the software more fully conform to its original conditions. Maybe there are S-Networks, too? Single-purpose networks which are aimed at fulfilling on well-defined thing, and only that thing. Lehman talks about learning how to breaking larger problems into smaller one so the S-Problems can be dealt with separately—is this anything different than separating out the basic problem of providing IP connectivity in a DC fabric underlay, or even providing basic IP connectivity in a transit or campus network, treating it as a separate module with fairly well design goals and measurements?
Lehman talks about P-Programs, where the problem is largely definable, but the solutions end up being more heuristic. Isn’t this similar to a traffic engineering overlay, where we largely know what the goals are, but we don’t necessarily know what specific solution is going to needed at any moment, and the complete set of solutions is just too large to initially calculate? What about E-Programs, where the software becomes a part of the world it models? Isn’t this like the intent-based stuff we’ve been talking about networking for going one 30 years now?
Looking at it another way, isn’t it possible that some networks are largely just S-Networks? And others are largely E-Networks? And that these classifications have nothing to do with whether the network is being built by what we call an “enterprise” or a “service provider?” Isn’t is possible that S-Networks should probably all use the same basic sort of structure and largely be classified as a “commodity,” while E-Networks will all be snowflakes, and largely classified as having high business importance?
Just like I don’t think the OSI model is particularly helpful in teaching and understanding networks any longer, I don’t find the enterprise/service/hyperscaler model very useful in building and operating networks. The service enterprise/service provider divide tends to artificially limit idea transfer when it wants to be transferred, and artificially “hype up” some networks while degrading others—largely based on perceptions of scale.
Scale != complexity. It’s not about service providers and enterprises. It doesn’t matter if Google’s problems are not your problems; borrowing from the hyperscale is not a “bad thing.” It’s just a “thing.” Think clearly about the problem set, understand the problem set, and borrow liberally. There is no such thing as a “service provider technology,” nor is there any such thing as an “enterprise technology.” There are problems, and there are solutions. To be an engineer is to connect the two.
What is the best way to build a large-scale network—in two words? Ask ten networking folks (engineers, designers, or whatever else), and you’re likely to get the same answer from at least nine: clean abstractions. They might not say the word abstraction, of course; instead, they might say words like build things in modules, using summarization and aggregation to divide the modules up. Or they might say make certain to reduce the failure domain to the smallest you possible can everywhere you can. Or they might say use hierarchical design. These answers are, however, variants of the single word: abstraction.
This response came to mind when I was reading an article on clean code this last week (it’s amazing how often software architecture overlaps with network architecture):
Once we learn how to create abstractions, it is tempting to get high on that ability, and pull abstractions out of thin air whenever we see repetitive code. After a few years of coding, we see repetition everywhere — and abstracting is our new superpower. If someone tells us that abstraction is a virtue, we’ll eat it. And we’ll start judging other people for not worshipping “cleanliness”.
I have been teaching network design for many, many years. I co-authored my first book on network design, Advanced IP Network Design, with Don Slice and Alvaro Retana; it was published in 1999, and it typically takes about a year to write a book, so we probably started working on it in the middle of 1998. The entire object that book was to teach hierarchical network design, which relies on modularization through aggregation and summarization to separate complexity from complexity (though I didn’t really use this wording until many years later) in order to break up failure domains.
It has been twenty-two years since Don, Alvaro, and I wrote that book—and hierarchical network design is still as relevant today as it was then. But in the last 22 years, I think I’ve learned just a little more about network design.
Among the things I’ve picked up in that 22 years is this one: if you haven’t found the tradeoffs, you haven’t looked hard enough. Or perhaps there is no such thing as a free lunch. Abstraction is a superpower, and it can make your network a lot cleaner, even when you’re using it correctly (not using it to paper over complexity). But building the perfectly clean network can mean reducing the agility of the design to the point of fragility. For instance, in the article linked above, Dan Abramov notes changing requirements made his “clean revision” of the code much more complex—a classic sign of fragility.
Perhaps an example would be helpful here. If you think of RIP as a link state protocol with summarization (abstraction of topology) at every hop, given you understand how link state and distance-vector protocols work, you can probably quickly grasp what you have gained by summarizing at every hop—and what you have lost.
You should still use abstraction to break up failure domains. You should still use abstraction to separate complexity from complexity. But you should use abstraction like you would any other tool. You should decide the best places and times to use abstraction after understanding the whole system.
For instance—a lot of people really insist on aggregating routing information in their data center fabric, especially in the underlay control plane. Why? The underlay is a constrained routing domain with known properties. Aggregation in this environment can cause routing black holes and unpredictable traffic flow behavior—both of which require added complexity to “work around.” If there is another solution available, it might be best to use it.
At the same time, I see a lot of people insisting BGP is the only option for data center underlays, or that it is the simplest option because you can use a single protocol for the underlay and overlay. This, in my opinion, is wrong, as well—because it does not properly separate two different parts of the network, each with their own purpose, into separate failure domains.
Rather than looking at a network and saying, “we can abstract here, so we should abstract here,” you should look at a network and say, “what are the modules here, and what purposes do they serve?” Once you know that, you can start thinking about when and were abstraction makes sense.
To paraphrase Dan, don’t be a clean network design zealot. Clean network design is not a goal. It’s a good guide when you don’t understand the network; such guides are often useful, but they are guides rather than rules.
One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—
Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.
Sounds simple in theory—but it is not in practice.
Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core. If you’ve built your network with a decent modular design, you buy enough new routers (or switches—but let’s use routers here) to build out a new aggregation module, the additional firewalls and other middleboxes you need, and the additional line cards to scale the core up. You unit test everything you can in the lab, understanding that you will not be able to fully test in the product network until you arrange a maintenance window. If you’re automating things, you build (and potentially test) the scripts—if you are smart, you will test these scripts in a virtual environment before using them.
You arrange the maintenance window, install the hardware, and … run the scripts. If it works, you go to bed, take a long nap, and get back to work doing “normal maintenance stuff” the next day. Of course, it rarely works, so you preposition some energy bars, make certain you have daycare plans, and put the vendor’s tech support number on speed dial.
What’s wrong with this picture? Well, many things, but primarily: this is not engineering. Was there any thought put into how to test beyond the individual unit level? Is there any way to test realistic traffic flows while connecting the new module to the network without impacting the rest of the network’s operation? Is there any real rollback plan in case things go wrong? Can there be?
In “modern” network design, none of these things tend to exist because they cannot exist. They cannot exist because we have not truly learned to do design life-cycles or truly modular designs. In the software world, if you don’t do modular design, it’s either because you didn’t think it through, or because you thought it through and decided the trade-off just wasn’t worth it. In the networking world, we play around the edges of resilient, modular designs, but networking folks don’t tend to know the underlying technologies—and how they work—well enough to understand how to divide a problem into modules correctly, and the interfaces between those modules.
Let’s consider the same example, but with some engineering principles applied. Instead of a traditional two-layer hierarchy, you have a single-SKU spine and leaf fabric with clearly defined separation between the fabric and pods, clearly defined underlay and overlay protocols, etc. Now you can build a pod and test it against a “fake fabric” before attaching it to the production fabric, including any required automation. Then you can connect the pod to the production fabric and bring up just the underlay protocol, testing the entire underlay before pushing the overlay out to the edge. Then you can push the overlay to the edge and test that before putting any workload on the new pod. Then you can test fake load on the new pod before pushing production traffic onto the pod…
Each of these tests, other than the initial test against a lab environment, can take place on the production network with little or no risk to the entire system. You’re not physically modifying current hardware (except plugging in new cables!), so it’s easy to roll changes back. You know the lower layer parts work before putting the higher layer parts in place. Because the testing happens on the real network, these are canaries rather than traditional “certification” style tests. Because you have real modularization, you can fail fast without causing major harm to any system. Because you are doing things in stages, you can build tests that determine clean and correct operation before moving to the next stage.
This is an engineered solution—thought has been put into proper modules, how those modules connect, what information is carried across those modules, etc. Doing this sort of work requires knowing more than how to configure—or automate—a set of protocols based on what a vendor tells you to do. Doing this sort of work requires understanding what failure looks like at each point in the cycle and deciding whether to fail out or fix it.
It may not meet the “formal” process mathematicians might prefer, but neither is it the “move fast and break stuff” attitude many see in “the Valley.” It is fail fast, but not fail foolishly. And its where we need to move to retain the title of “engineer” and not lose the confidence of the businesses who pay us to build networks that work.
How many 9’s is your network? How about your service provider’s? Now, to ask the not-so-obvious question—why do you care? Does the number of 9’s actually describe the reliability of the network? According to Jeffery Mogul and John Wilkes, nines are not enough. The question is—while this paper was written for commercial relationships and cloud providers, is it something you can apply to running your own network? Let’s dive into the meat of the paper and find out.
While 5 9’s is normally given as a form of Service Level Agreement (SLA), there are two other measures of reliability a network operator needs to consider—the Service Level Objective (SLO), and the Service Level Indicator (SLI). The SLO defines a set of expectations about the level of service; internal SLO’s define “trigger points” where actions should be taken to prevent an external SLO from failing. For instance, if the external SLO says no more than 2% of the traffic will be dropped on this link, the internal SLO might say if more than 1% of the traffic on this link is dropped, you need to act. The SLA, on the other hand, says if more than 2% of the traffic on this link is dropped, the operator will rebate (some amount) to the customer. The SLI says this is how I am going to measure the percentage of packets dropped on this link.
Splitting these three concepts apart helps reveal what is wrong with the entire 5 9’s way of thinking, because it enables you to ask questions like—can my telemetry system measure and report on the amount of traffic dropped on this link? Across what interval should this SLI apply? If I combine all the SLI’s across my entire network, what does the monitoring system need to look like? Can I support the false positives likely to occur with such a monitoring system?
These questions might be obvious, of course, but there are more non-obvious ones, as well. For instance—how do my internal and external SLO’s correlate to my SLI’s? Measuring the amount of traffic dropped on a link is pretty simple (in theory). Measuring something like this application will not perform at less than 50% capacity because of network traffic is going to be much, much harder.
The point Mogul and Wilkes make in this paper is that we just need to rethink the way we write SLO’s and their resulting SLA’s to be more realistic—in particular, we need to think about whether or not the SLI’s we can actually measure and act on can cash the SLO and SLA checks we’re writing. This means we probably need to expose more, rather than less, of the complexity of the network itself—even though this cuts against the grain of the current move towards abstracting the network down to “ports and packets.” To some degree, the consumer of networking services is going to need to be more informed if we are to build realistic SLA’s that can be written and kept.
How does this apply to the “average enterprise network engineer?” At first glance, it might seem like this paper is strongly oriented towards service providers, since there are definite contracts, products, etc., in play. If you squint your eyes, though, you can see how this would apply to the rest of the world. The implicit promise you make to an application developer or owner that their application will, in fact, run on the network with little or no performance degradation is, after all, an SLO. Your yearly review examining how well the network has met the needs of the organization is an SLA of sorts.
The kind of thinking represented here, if applied within an organization, could turn the conversation about whether to out- or in-source on its head. Rather than talking about the 5 9’s some cloud provider is going to offer, it opens up discussions about how and what to measure, even within the cloud service, to understand the performance being offered, and how more specific and nuanced results can be measured against a fuller picture of value added.
This is a short paper—but well worth reading and considering.
If you haven’t found the tradeoffs, you haven’t looked hard enough. Something I say rather often—as Eyvonne would say, a “Russism.” Fair enough, and it’s easy enough to say “if you haven’t found the tradeoffs, you haven’t looked hard enough,” but what does it mean, exactly? How do you apply this to the everyday world of designing, deploying, operating, and troubleshooting networks?
Humans tend to extremes in their thoughts. In many cases, we end up considering everything a zero-sum game, where any gain on the part of someone else means an immediate and opposite loss on my part. In others, we end up thinking we are going to get a free lunch. The reality is there is no such thing as a free lunch, and while there are situations that are a zero-sum game, not all situations are. What we need is a way to “cut the middle” to realistically appraise each situation and realistically decide what the tradeoffs might be.
This is where the state/optimization/surface (SOS) model comes into play. You’ll find this model described in several of my books alongside some thoughts on complexity theory (see the second chapter here, for instance, or here), but I don’t spend a lot of time discussing how to apply this concept. The answer lies in the intersection between looking for tradeoffs and the SOS model.
TL;DR version: the SOS model tells you where you should look for tradeoffs.
Take the time-worn example of route aggregation, which improves the operation of a network by reducing the “blast radius” of changes in reachability. Combining aggregation with summarization (as is almost always the intent), it reduces the “blast radius” for changes in the network topology as well. The way aggregation and summarization reduce the “blast radius” is simple: if you define a failure domain as the set of devices which must somehow react to a change in the network (the correct way to define a failure domain, by the way), then aggregation and summarization reduce the failure domain by hiding changes in one part of the network from devices in some other part of the network.
Note: the depth of the failure domain is relevant, as well, but not often discussed; this is related to the depth of an interaction surface, but since this is merely a blog post . . .
According to SOS, route aggregation (and topology summarization) is a form of abstraction, which means it is a way of controlling state. If we control state, we should see a corresponding tradeoff in interaction surfaces, and a corresponding tradeoff in some form of optimization. Given these two pointers, we can search for your tradeoffs. Let’s start with interaction surfaces.
Observe aggregation is normally manually configured; this is an interaction surface. The human-to-device interaction surface now needs to account for the additional work of designing, configuring, maintaining, and troubleshooting around aggregation—these things add complexity to the network. Further, the routing protocol must also be designed to support aggregation and summarization, so the design of the protocol must also be more complex. This added complexity is often going to come in the form of . . . additional interaction surfaces, such as the not-to-stubby external conversion to a standard external in OSPF, or something similar.
Now let’s consider optimization. Controlling failure domains allows you to build larger, more stable networks—this is an increase in optimization. At the same time, aggregation removes information from the control plane, which can cause some traffic to take a suboptimal path (if you want examples of this, look at the books referenced above). Traffic taking a suboptimal path is a decrease in optimization. Finally, building larger networks means you are also building a more complex network—so we can see the increase in complexity here, as well.
Experience is often useful in helping you have more specific places to look for these sorts of things, of course. If you understand the underlying problems and solutions (hint, hint), you will know where to look more quickly. If you understand common implementations and the weak points of each of those implementations, you will be able to quickly pinpoint an implementation’s weak points. History might not repeat itself, but it certainly rhymes.
I have spent many years building networks, protocols, and software. I have never found a situation where the SOS model, combined with a solid knowledge of the underlying problems and solutions (or perhaps technologies and implementations used to solve these problems) have led me astray in being able to quickly find the tradeoffs so I could see, and then analyze, them.
If you are looking for a good resolution for 2020 still (I know, it’s a bit late), you can’t go wrong with this one: this year, I will focus on making the networks and products I work on truly simpler. Now, before you pull Tom’s take out on me—
There are those that would say that what we’re doing is just hiding the complexity behind another layer of abstraction, which is a favorite saying of Russ White. I’d argue that we’re not hiding the complexity as much as we’re putting it back where it belongs – out of sight. We don’t need the added complexity for most operations.
Three things: First, complex solutions are always required for hard problems. If you’ve ever listened to me talk about complexity, you’ve probably seen this quote on a slide someplace—
[C]omplexity is most succinctly discussed in terms of functionality and its robustness. Specifically, we argue that complexity in highly organized systems arises primarily from design strategies intended to create robustness to uncertainty in their environments and component parts.
You cannot solve hard problems—complex problems—without complex solutions. In fact, a lot of the complexity we run into in our everyday lives is a result of saying “this is too complex, I’m going to build something simpler.” (here I’m thinking of a blog post I read last year that said “when we were building containers, we looked at routing and realized how complex it was… so we invented something simpler… which, of course, turned out to be more complex than dynamic routing!)
Second, abstraction can be used the right way to manage complexity, and it can be used the wrong way to obfuscate or mask complexity. The second great source of complexity and system failure in our world is we don’t abstract complexity so much as we obfuscate it.
Third, abstraction is not a zero-sum game. If you haven’t found the tradeoffs, you haven’t looked hard enough. This is something expressed through the state/optimization/surface triangle, which you should know at this point.
Returning to the top of this post, the point is this: Using abstraction to manage complexity is fine. Obfuscation of complexity is not. Papering over complexity “just because I can” never solves the problem, any more than sweeping dirt under the rug, or papering over the old paint without bothering to fix the wall first.
We need to go beyond just figuring out how to make the user interface simpler, more “intent-driven,” automated, or whatever it is. We need to think of the network as a system, rather than as a collection of bits and bobs that we’ve thrown together across the years. We need to think about the modules horizontally and vertically, think about how they interact, understand how each piece works, understand how each abstraction leaks, and be able to ask hard questions.
For each module, we need to understand how things work well enough to ask is this the right place to divide these two modules? We should be willing to rethink our abstraction lines, the placement of modules, and how things fit together. Sometimes moving an abstraction point around can greatly simplify a design while increasing optimal behavior. Other times it’s worth it to reduce optimization to build a simpler mouse trap. But you cannot know the answer to this question until you ask it. If you’re sweeping complexity under the rug because… well, that’s where it belongs… then you are doing yourself and the organization you work for a disfavor, plain and simple. Whatever you sweep under the rug of obfuscation will grow and multiply. You don’t want to be around when it crawls back out from under that rug.
For each module, we need to learn how to ask is this the right level and kind of abstraction? We need to learn to ask does the set of functions this module is doing really “hang together,” or is this just a bunch of cruft no-one could figure out what to do with, so they shoved it all in a black box and called it done?
Above all, we need to learn to look at the network as a system. I’ve been harping on this for so long, and yet I still don’t think people understand what I am saying a lot of times. So I guess I’ll just have to keep saying it. 😊
The problems networks are designed to solve are hard—therefore, networks are going to be complex. You cannot eliminate complexity, but you can learn to minimize and control it. Abstraction within a well-thought-out system is a valid and useful way to control complexity and understanding how and where to create modules at the edges of which abstraction can take place is a valid and useful way of controlling complexity.
Don’t obfuscate. Think systemically, think about the tradeoffs, and abstract wisely.