This week is very busy for me, so rather than writing a single long, post, I’m throwing together some things that have been sitting in my pile to write about for a long while.

From Dalton Sweeny:

A physicist loses half the value of their physics knowledge in just four years whereas an English professor would take over 25 years to lose half the value of the knowledge they had at the beginning of their career. . . Software engineers with a traditional computer science background learn things that never expire with age: data structures, algorithms, compilers, distributed systems, etc. But most of us don’t work with these concepts directly. Abstractions and frameworks are built on top of these well studied ideas so we don’t have to get into the nitty-gritty details on the job (at least most of the time).

This is precisely the way network engineering is. There is value in the kinds of knowledge that expire, such as individual product lines, etc.—but the closer you are to the configuration, the more ephemeral the knowledge is. This is one of the entire points of rule 11 is your friend. Learn the foundational things that make learning the ephemeral things easier. There are only four problems (really) in moving data from one place to another. There are only around four solutions for each of those problems. Each of those solutions is bounded into a small set (again, about four for each) sub-solutions, or ways of implementing the solution, etc.

I’m going to spend some time talking about this in the Livelesson I’m currently recording, so watch this space for an announcement sometime early next year about publication.

From Ivan P:

What I’m pointing out in this rant is the reverse reasoning along the lines “vendor X is doing something, which confirms there’s a real market need for it”. I’ve been in IT too long, and seen how the startup/VC sausage is made, to believe that fairy tale… and even when it’s true, it doesn’t necessarily imply that you need whatever vendor X is selling.

There are two ways to look at this. Either vendors should lead the market in building solutions, or they should follow whatever the customer wants. From my perspective, one of the problems we have right now is everything is a massive mish-mash of these two things. The operator’s design team thinks of a neat way to do X, and then promises the account team a big check if its implemented. It doesn’t matter that X could be solved some other way that might be simpler, etc.—all that matters is the check. In this case, the vendor stops challenging the customer to build things better, and starts just acting like a commodity provider, rather than an innovative partner.

The interaction between the customer and the vendor needs to be more push-pull than it currently is—right now, it seems like either the operator simply dictates terms to the vendor, or the vendor pretty much dictates architecture to the operator. We need to find a middle ground. The vendor does need to have a solid solution architecture, but the architecture needs to be flexible, as well, and the blocks used to build that architecture need to be usable in ways not anticipated by the vendor’s design folks.

On the other hand, we need to stop chasing features. This isn’t just true of vendors, this is true of operators as well. You get feature lists because that’s what you ask for. Often, operators ask for feature lists because that’s the easiest thing to measure, or because they already have a completely screwed up design they are trying to brownfield around. The worst is—“we have this brownfield that we just can’t get rid of, so we want to build yet another overlay on top, which will make everything simpler.” After about the twentieth overlay a system crash becomes a matter of when rather than if.

Complaining about how slow the IETF is, or how single vendors dominate the standards process, is almost a by-game in the world of network engineering going back to the very beginning. It is one thing to complain; it is another to understand the structure of the problem and make practical suggestions about how to fix it. Join us at the Hedge as Andrew Alston, Tom Ammon, and Russ White reveal some of the issues, and brainstorm how to fix them.

download

There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.

Let me try to recap some of the arguments I’ve heard used against my assertion.

The theory that enterprise and service provider networks require completely different technologies and implementations is often grounded in scale. Service provider networks are so large that they simply must use different solutions—solutions that you cannot apply to any network running at a smaller scale.

The problem with this line of thinking is it throws the baby out with the bathwater. Google is using automation to run their network? Well, then… you shouldn’t use automation because Google’s problems are not your problems. Microsoft is deploying 100g Ethernet over fiber? Then clearly enterprise networks should be using Token Ring or ARCnet because… Microsoft’s problems are not your problems.

The usual answer is—“I’m not saying we shouldn’t take good ideas when we see them, but we shouldn’t design networks the way someone else does just because.” I don’t see how this clarifies the solution, though—when is it a good idea or a bad one? What is our criterion to decide what to adopt and what not to adopt? Simply saying “X’s problems aren’t your problems” doesn’t really give me any actionable information—or at least I’m not getting it if it’s buried in there someplace.

Instead—maybe—just maybe—we are looking at this all wrong. Maybe there is some other way classify networks that will help us see the problem set better.

I don’t think networks are undifferentiated—I think the enterprise/service provider/hyerpscaler divide is not helpful to understand how different networks are … different, and how to correctly identify an environment and build to it. Reading a classic paper in software design this week—Programs, Life Cycles, and Laws of Software Evolution—brought all this to mind. In writing this paper, Meir Lehman was facing many of the same classification problems, just in software development rather than in building networks.

Rather than saying “enterprise software is different than service provider software”—an assertion absolutely no-one makes—or even “commercial software is different than private software, and developers working in these two areas cannot use the same tools and techniques,” Lehman posits there are three kinds of software systems. He calls these S-Programs, in which the problem and solution can be fully specified; P-Programs, in which the problem can be fully specified, but the program can only be partially specified because of complexity and scale; and E-Programs, where the program itself become part of the world it models. Lehman thinks most software will move towards S-Program status as time moves on—something that hasn’t happened (the reasons are out of scope for this already-too-long-blog-post).

But the classification is useful. For S-Programs, the inputs and outputs can be fully specified, full-on testing can take place before the software is deployed, and lifecycle management is largely about making the software more fully conform to its original conditions. Maybe there are S-Networks, too? Single-purpose networks which are aimed at fulfilling on well-defined thing, and only that thing. Lehman talks about learning how to breaking larger problems into smaller one so the S-Problems can be dealt with separately—is this anything different than separating out the basic problem of providing IP connectivity in a DC fabric underlay, or even providing basic IP connectivity in a transit or campus network, treating it as a separate module with fairly well design goals and measurements?

Lehman talks about P-Programs, where the problem is largely definable, but the solutions end up being more heuristic. Isn’t this similar to a traffic engineering overlay, where we largely know what the goals are, but we don’t necessarily know what specific solution is going to needed at any moment, and the complete set of solutions is just too large to initially calculate? What about E-Programs, where the software becomes a part of the world it models? Isn’t this like the intent-based stuff we’ve been talking about networking for going one 30 years now?

Looking at it another way, isn’t it possible that some networks are largely just S-Networks? And others are largely E-Networks? And that these classifications have nothing to do with whether the network is being built by what we call an “enterprise” or a “service provider?” Isn’t is possible that S-Networks should probably all use the same basic sort of structure and largely be classified as a “commodity,” while E-Networks will all be snowflakes, and largely classified as having high business importance?

Just like I don’t think the OSI model is particularly helpful in teaching and understanding networks any longer, I don’t find the enterprise/service/hyperscaler model very useful in building and operating networks. The service enterprise/service provider divide tends to artificially limit idea transfer when it wants to be transferred, and artificially “hype up” some networks while degrading others—largely based on perceptions of scale.

Scale != complexity. It’s not about service providers and enterprises. It doesn’t matter if Google’s problems are not your problems; borrowing from the hyperscale is not a “bad thing.” It’s just a “thing.” Think clearly about the problem set, understand the problem set, and borrow liberally. There is no such thing as a “service provider technology,” nor is there any such thing as an “enterprise technology.” There are problems, and there are solutions. To be an engineer is to connect the two.

CHINOG is a regional network operators group that meets in Chicago once a year. For this episode of the Hedge, Jason Gooley joins us to talk about the origins of CHINOG, the challenges involved in running a small conference, some tips for those who would like to start a conference of this kind, and thoughts on the importance of community in the network engineering world.

download

One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—

Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.

Sounds simple in theory—but it is not in practice.

Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core. If you’ve built your network with a decent modular design, you buy enough new routers (or switches—but let’s use routers here) to build out a new aggregation module, the additional firewalls and other middleboxes you need, and the additional line cards to scale the core up. You unit test everything you can in the lab, understanding that you will not be able to fully test in the product network until you arrange a maintenance window. If you’re automating things, you build (and potentially test) the scripts—if you are smart, you will test these scripts in a virtual environment before using them.

You arrange the maintenance window, install the hardware, and … run the scripts. If it works, you go to bed, take a long nap, and get back to work doing “normal maintenance stuff” the next day. Of course, it rarely works, so you preposition some energy bars, make certain you have daycare plans, and put the vendor’s tech support number on speed dial.

What’s wrong with this picture? Well, many things, but primarily: this is not engineering. Was there any thought put into how to test beyond the individual unit level? Is there any way to test realistic traffic flows while connecting the new module to the network without impacting the rest of the network’s operation? Is there any real rollback plan in case things go wrong? Can there be?

In “modern” network design, none of these things tend to exist because they cannot exist. They cannot exist because we have not truly learned to do design life-cycles or truly modular designs. In the software world, if you don’t do modular design, it’s either because you didn’t think it through, or because you thought it through and decided the trade-off just wasn’t worth it. In the networking world, we play around the edges of resilient, modular designs, but networking folks don’t tend to know the underlying technologies—and how they work—well enough to understand how to divide a problem into modules correctly, and the interfaces between those modules.

Let’s consider the same example, but with some engineering principles applied. Instead of a traditional two-layer hierarchy, you have a single-SKU spine and leaf fabric with clearly defined separation between the fabric and pods, clearly defined underlay and overlay protocols, etc. Now you can build a pod and test it against a “fake fabric” before attaching it to the production fabric, including any required automation. Then you can connect the pod to the production fabric and bring up just the underlay protocol, testing the entire underlay before pushing the overlay out to the edge. Then you can push the overlay to the edge and test that before putting any workload on the new pod. Then you can test fake load on the new pod before pushing production traffic onto the pod…

Each of these tests, other than the initial test against a lab environment, can take place on the production network with little or no risk to the entire system. You’re not physically modifying current hardware (except plugging in new cables!), so it’s easy to roll changes back. You know the lower layer parts work before putting the higher layer parts in place. Because the testing happens on the real network, these are canaries rather than traditional “certification” style tests. Because you have real modularization, you can fail fast without causing major harm to any system. Because you are doing things in stages, you can build tests that determine clean and correct operation before moving to the next stage.

This is an engineered solution—thought has been put into proper modules, how those modules connect, what information is carried across those modules, etc. Doing this sort of work requires knowing more than how to configure—or automate—a set of protocols based on what a vendor tells you to do. Doing this sort of work requires understanding what failure looks like at each point in the cycle and deciding whether to fail out or fix it.

It may not meet the “formal” process mathematicians might prefer, but neither is it the “move fast and break stuff” attitude many see in “the Valley.” It is fail fast, but not fail foolishly. And its where we need to move to retain the title of “engineer” and not lose the confidence of the businesses who pay us to build networks that work.

I failed to include the right categories the first time, so this didn’t make it into the podcast catcher feeds correctly…

Network engineering and operations are both “mental work” that can largely be done remotely—but working remote is not only great in many ways, it is also often fraught with problems. In this episode of the Hedge, Roland Dobbins joins Tom and Russ to discuss the ins and outs of working remote, including some strategies we have found effective at removing many of the negative aspects.

download