Complaining about how slow the IETF is, or how single vendors dominate the standards process, is almost a by-game in the world of network engineering going back to the very beginning. It is one thing to complain; it is another to understand the structure of the problem and make practical suggestions about how to fix it. Join us at the Hedge as Andrew Alston, Tom Ammon, and Russ White reveal some of the issues, and brainstorm how to fix them.
There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.
Let me try to recap some of the arguments I’ve heard used against my assertion.
The theory that enterprise and service provider networks require completely different technologies and implementations is often grounded in scale. Service provider networks are so large that they simply must use different solutions—solutions that you cannot apply to any network running at a smaller scale.
The problem with this line of thinking is it throws the baby out with the bathwater. Google is using automation to run their network? Well, then… you shouldn’t use automation because Google’s problems are not your problems. Microsoft is deploying 100g Ethernet over fiber? Then clearly enterprise networks should be using Token Ring or ARCnet because… Microsoft’s problems are not your problems.
The usual answer is—“I’m not saying we shouldn’t take good ideas when we see them, but we shouldn’t design networks the way someone else does just because.” I don’t see how this clarifies the solution, though—when is it a good idea or a bad one? What is our criterion to decide what to adopt and what not to adopt? Simply saying “X’s problems aren’t your problems” doesn’t really give me any actionable information—or at least I’m not getting it if it’s buried in there someplace.
Instead—maybe—just maybe—we are looking at this all wrong. Maybe there is some other way classify networks that will help us see the problem set better.
I don’t think networks are undifferentiated—I think the enterprise/service provider/hyerpscaler divide is not helpful to understand how different networks are … different, and how to correctly identify an environment and build to it. Reading a classic paper in software design this week—Programs, Life Cycles, and Laws of Software Evolution—brought all this to mind. In writing this paper, Meir Lehman was facing many of the same classification problems, just in software development rather than in building networks.
Rather than saying “enterprise software is different than service provider software”—an assertion absolutely no-one makes—or even “commercial software is different than private software, and developers working in these two areas cannot use the same tools and techniques,” Lehman posits there are three kinds of software systems. He calls these S-Programs, in which the problem and solution can be fully specified; P-Programs, in which the problem can be fully specified, but the program can only be partially specified because of complexity and scale; and E-Programs, where the program itself become part of the world it models. Lehman thinks most software will move towards S-Program status as time moves on—something that hasn’t happened (the reasons are out of scope for this already-too-long-blog-post).
But the classification is useful. For S-Programs, the inputs and outputs can be fully specified, full-on testing can take place before the software is deployed, and lifecycle management is largely about making the software more fully conform to its original conditions. Maybe there are S-Networks, too? Single-purpose networks which are aimed at fulfilling on well-defined thing, and only that thing. Lehman talks about learning how to breaking larger problems into smaller one so the S-Problems can be dealt with separately—is this anything different than separating out the basic problem of providing IP connectivity in a DC fabric underlay, or even providing basic IP connectivity in a transit or campus network, treating it as a separate module with fairly well design goals and measurements?
Lehman talks about P-Programs, where the problem is largely definable, but the solutions end up being more heuristic. Isn’t this similar to a traffic engineering overlay, where we largely know what the goals are, but we don’t necessarily know what specific solution is going to needed at any moment, and the complete set of solutions is just too large to initially calculate? What about E-Programs, where the software becomes a part of the world it models? Isn’t this like the intent-based stuff we’ve been talking about networking for going one 30 years now?
Looking at it another way, isn’t it possible that some networks are largely just S-Networks? And others are largely E-Networks? And that these classifications have nothing to do with whether the network is being built by what we call an “enterprise” or a “service provider?” Isn’t is possible that S-Networks should probably all use the same basic sort of structure and largely be classified as a “commodity,” while E-Networks will all be snowflakes, and largely classified as having high business importance?
Just like I don’t think the OSI model is particularly helpful in teaching and understanding networks any longer, I don’t find the enterprise/service/hyperscaler model very useful in building and operating networks. The service enterprise/service provider divide tends to artificially limit idea transfer when it wants to be transferred, and artificially “hype up” some networks while degrading others—largely based on perceptions of scale.
Scale != complexity. It’s not about service providers and enterprises. It doesn’t matter if Google’s problems are not your problems; borrowing from the hyperscale is not a “bad thing.” It’s just a “thing.” Think clearly about the problem set, understand the problem set, and borrow liberally. There is no such thing as a “service provider technology,” nor is there any such thing as an “enterprise technology.” There are problems, and there are solutions. To be an engineer is to connect the two.
CHINOG is a regional network operators group that meets in Chicago once a year. For this episode of the Hedge, Jason Gooley joins us to talk about the origins of CHINOG, the challenges involved in running a small conference, some tips for those who would like to start a conference of this kind, and thoughts on the importance of community in the network engineering world.
One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—
Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.
Sounds simple in theory—but it is not in practice.
Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core. If you’ve built your network with a decent modular design, you buy enough new routers (or switches—but let’s use routers here) to build out a new aggregation module, the additional firewalls and other middleboxes you need, and the additional line cards to scale the core up. You unit test everything you can in the lab, understanding that you will not be able to fully test in the product network until you arrange a maintenance window. If you’re automating things, you build (and potentially test) the scripts—if you are smart, you will test these scripts in a virtual environment before using them.
You arrange the maintenance window, install the hardware, and … run the scripts. If it works, you go to bed, take a long nap, and get back to work doing “normal maintenance stuff” the next day. Of course, it rarely works, so you preposition some energy bars, make certain you have daycare plans, and put the vendor’s tech support number on speed dial.
What’s wrong with this picture? Well, many things, but primarily: this is not engineering. Was there any thought put into how to test beyond the individual unit level? Is there any way to test realistic traffic flows while connecting the new module to the network without impacting the rest of the network’s operation? Is there any real rollback plan in case things go wrong? Can there be?
In “modern” network design, none of these things tend to exist because they cannot exist. They cannot exist because we have not truly learned to do design life-cycles or truly modular designs. In the software world, if you don’t do modular design, it’s either because you didn’t think it through, or because you thought it through and decided the trade-off just wasn’t worth it. In the networking world, we play around the edges of resilient, modular designs, but networking folks don’t tend to know the underlying technologies—and how they work—well enough to understand how to divide a problem into modules correctly, and the interfaces between those modules.
Let’s consider the same example, but with some engineering principles applied. Instead of a traditional two-layer hierarchy, you have a single-SKU spine and leaf fabric with clearly defined separation between the fabric and pods, clearly defined underlay and overlay protocols, etc. Now you can build a pod and test it against a “fake fabric” before attaching it to the production fabric, including any required automation. Then you can connect the pod to the production fabric and bring up just the underlay protocol, testing the entire underlay before pushing the overlay out to the edge. Then you can push the overlay to the edge and test that before putting any workload on the new pod. Then you can test fake load on the new pod before pushing production traffic onto the pod…
Each of these tests, other than the initial test against a lab environment, can take place on the production network with little or no risk to the entire system. You’re not physically modifying current hardware (except plugging in new cables!), so it’s easy to roll changes back. You know the lower layer parts work before putting the higher layer parts in place. Because the testing happens on the real network, these are canaries rather than traditional “certification” style tests. Because you have real modularization, you can fail fast without causing major harm to any system. Because you are doing things in stages, you can build tests that determine clean and correct operation before moving to the next stage.
This is an engineered solution—thought has been put into proper modules, how those modules connect, what information is carried across those modules, etc. Doing this sort of work requires knowing more than how to configure—or automate—a set of protocols based on what a vendor tells you to do. Doing this sort of work requires understanding what failure looks like at each point in the cycle and deciding whether to fail out or fix it.
It may not meet the “formal” process mathematicians might prefer, but neither is it the “move fast and break stuff” attitude many see in “the Valley.” It is fail fast, but not fail foolishly. And its where we need to move to retain the title of “engineer” and not lose the confidence of the businesses who pay us to build networks that work.
In this episode of the Hedge, Tom Ammon and Russ White are joined by Ivan Pepelnjak of ipSpace.net to talk about being old, knowing about how things are going to break before they do, and being negative. Along the way, we discuss the IETF, open source, and many other aspects of the world of network engineering.
I failed to include the right categories the first time, so this didn’t make it into the podcast catcher feeds correctly…
Network engineering and operations are both “mental work” that can largely be done remotely—but working remote is not only great in many ways, it is also often fraught with problems. In this episode of the Hedge, Roland Dobbins joins Tom and Russ to discuss the ins and outs of working remote, including some strategies we have found effective at removing many of the negative aspects.
For any field of study, there are some mental habits that will make you an expert over time. Whether you are an infrastructure architect, a network designer, or a network reliability engineer, what are the habits of mind those involved in the building and operation of networks follow that mark out expertise?
Experts involve the user
Experts don’t just listen to the user, they involve the user. This means taking the time to teach the developer or application owner how their applications interact with the network, showing them how their applications either simplify or complicate the network, and the impact of these decisions on the overall network.
Experts think about data
Rather than applications. What does the data look like? How does the business use the data? Where does the data need to be, when does it need to be there, how often does it need to go, and what is the cost of moving it? What might be in the data that can be harmful? How can I protect the data while at rest and in flight?
Experts think in modules, surfaces, and protocols
Devices and configurations can, and should, change over time. The way a problem is broken up into modules and the interaction surfaces (interfaces) between those modules can be permanent. Choosing the wrong protocol means choosing a different protocol to solve every problem, leading to accretion of complexity, ossification, and ultimately brittleness. Break the problem up right the first time, and choose the protocols carefully, and let the devices and configurations follow.
Choosing devices first is like selecting the hammer you’re going to use to build a house, and then selecting the design and materials used in the house based on what you can use the hammer for.
Experts think about tradeoffs
State, optimization, and surface is an ironclad tradeoff. If you increase state, you increase complexity while also increasing optimization. If you increase surfaces through abstraction, you are both increasing and decreasing state, which has an impact both on complexity and optimization. All nontrivial abstractions leak. Every time you move data you are facing the speed of serialization, queueing, and light, and hence you are dealing with the choice between consistency, availablity, and partitioning.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
Experts focus on the essence
Every problem has an essential core—something you are trying to solve, and a reason for solving it. Experts know how to divide between the essential and the nonessential. Experts think about what they are not designing, and what they are not trying to accomplish, as well as what they are. This doesn’t mean the rest isn’t there, it just means it’s not quite in focus all the time.
Experts are mentally stimulated to simulate
Labs are great—but moving beyond the lab and thinking about how the system works as a whole is better. Experts mentally simulate how the data moves, how the network converges, how attackers might try to break in, and other things besides.
Experts look around
Interior designers go to famous spaces to see how others have designed before them. Building designers walk through cities and famous buildings to see how others have designed before them. The more you know about how others have designed, the more you know about the history of networks, the more of an expert you will be.
Experts reshape the problem space
Experts are unafraid to think about the problem in a different way, to say “no,” and to try solutions that have not been tried before. Best common practice is a place to start, not a final arbiter of all that is good and true. Experts do not fall to the “is/ought” fallacy.
Experts treat problems as opportunities
Whether the problem is a mistake or a failure, or even a little bit of both, every problem is an opportunity to learn how the system works, and how networks work in general.