WRITTEN
Used to Mean… Now Means…
sarcasm warning—take the following post with a large grain of salt
A thousand years from now, when someone is writing the history of computer networks, one thing they will notice—at least I think they will—is how we tend to reduce our language so as many terms as possible have precisely the same meaning. They might attribute this to marketing, or the hype cycle, or… but whatever the cause this is clearly a trend in the networking world. Some examples might be helpful, so … forthwith, the reduced terminology of the networking world.
Software Defined Networking (SDN): Used to mean a standardized set of interfaces that enabled open access to the forwarding hardware. Came to mean some form of control plane centralization. Now means automated configuration and management of network devices, centralized control planes, traffic engineering, and just about anything else that seems remotely related to these.
Fabric: Used to mean a regular, non-planar, repeating network topology with scale-out characteristics. Now means any vaguely hierarchical topology (not a ring) with a lot of links.
DevOps: Used to mean applying software development processes to the configuration, operation, and troubleshooting of server and network devices. Now means the same thing as SDN.
Clos: Used to mean a three stage fabric in which every device in a prior stage is connected to every device in the next stage, all devices have the same number of ports, all traffic is east/west, and having a scale-out characteristics. Now means the same thing as fabric, and is spelled CLOS because—aren’t all four letter words abbreviations? Now external links are commonly attached to the “core” of the Clos, because… well, it kindof looks hierarchical, after all.
Hierarchical Design: Used to mean a network design with a modular layered design, and specific functions tied to each layer of the network. Generally there were two or three layers, with clear failure domain separation through aggregation and summarization of control plane information. Now means the same thing as fabric.
Cloud: Used to mean the centralization and abstraction of resources to support agile development strategies. Now means… well… the meaning is cloudy at this time, but generally applied to just about anything. Will probably end up meaning the same thing as DevOps, SDN, and fabric.
Network Topology: Used to mean a description of the interconnection system used in building a network. Some kinds of topologies were hub-and-spoke, ring, partial mesh, Clos, Benes, butterfly, full mesh, etc. Now means the same as fabric.
Routing Protocol: Used to mean the protocol, including the semantics and algorithm or heuristic, used to calculate the set of loop-free paths through a network. Includes instances such as IS-IS, EIGRP, and OSPF. Now means BGP, as this is the only protocol used in any production network (except SDN).
Router: Used to mean a device that determines the next hop to which the packet should be forwarded using the layer 3 address, replacing the layer 2 header in the process of forwarding the packet. Now means the same thing as a switch.
Switch: Used to mean a device which determined which port through which a packet should be forwarded based on the layer 2 header, did not modify the packet, etc. Now means any device that forwards packets; has generally replaced “router.”
Security: Used to mean thinking through attack surfaces, understanding protocols and their operation, and how to build a system that is difficult to attack. Now means inserting a firewall into the network.
We used to have a rich set of terms we could use to describe different kinds of topologies, devices, and ways of building networks. We seem to want to insist on merging as many terms as possible so they all mean the same thing; we are quickly reducing ourselves to fabric, switch, SDN, and cloud to describe everything.
Which makes me wonder sometimes—what are they teaching in network engineering classes now-a-days?
Lessons Learned from the Robustness Principle
The Internet, and networking protocols more broadly, were grounded in a few simple principles. For instance, there is the end-to-end principle, which argues the network should be a simple fat pipe that does not modify data in transit. Many of these principles have tradeoffs—if you haven’t found the tradeoffs, you haven’t looked hard enough—and not looking for them can result in massive failures at the network and protocol level.
Another principle networking is grounded in is the Robustness Principle, which states: “Be liberal in what you accept, and conservative in what you send.” In protocol design and implementation, this means you should accept the widest range of inputs possible without negative consequences. A recent draft, however, challenges the robustness principle—draft-iab-protocol-maintenance.
According to the authors, the basic premise of the robustness principle lies in the problem of updating older software for new features or fixes at the scale of an Internet sized network. The general idea is a protocol designer can set aside some “reserved bits,” using them in a later version of the protocol, and not worry about older implementations misinterpreting them—new meanings of old reserved bits will be silently ignored. In a world where even a very old operating system, such as Windows XP, is still widely used, and people complain endlessly about forced updates, it seems like the robustness principle is on solid ground in this regard.
The argument against this in the draft is implementing the robustness principle allows a protocol to degenerate over time. Older implementations are not removed from service because it still works, implementations are not updated in a timely manner, and the protocol tends to have an ever-increasing amount of “dead code” in the form of older expressions of data formats. Given an infinite amount of time, an infinity number of versions of any given protocol will be deployed. As a result, the protocol can and will break in an infinite number of ways.
The logic of the draft is something along the lines of: old ways of doing things should be removed from protocols which are actively maintained in order to unify and simplify the protocol. At least for actively maintained protocols, reliance on the robustness principle should be toned down a little.
Given the long list of examples in the draft, the authors make a good case.
There is another side to the argument, however. The robustness principle is not “just” about keeping older versions of software working “at scale.” All implementations, no matter how good their quality, have defects (or rather, unintended features). Many of these defects involve failing to release or initialize memory, failing to bounds check inputs, and other similar oversights. A common way to find these errors is to fuzz test code—throw lots of different inputs at it to see if it spits up an error or crash.
The robustness principle runs deeper than infinite versions—it also helps implementations deal with defects in “the other end” that generate bad data. The robustness principle, then, can help keep a network running even in the face of an implementation defect.
Where does this leave us? Abandoning the robustness principle is clearly not a good thing—while the network might end up being more correct, it might also end up simply not running. Ever. The Internet is an interlocking system of protocols, hardware, and software; the robustness principle is the lubricant that makes it all work at all.
Clearly, then, there must be some sort of compromise position that will work. Perhaps a two pronged attack might work. First, don’t discard errors silently. Instead, build logging into software that catches all errors, regardless of how trivial they might seem. This will generate a lot of data, but we need to be clear on the difference between instrumenting something and actually paying attention to what is instrumented. Instrumenting code so that “unknown input” can be caught and logged periodically is not a bad thing.
Second, perhaps protocols need some way to end of life older versions. Part of the problem with the robustness principle is it allows an infinite number of versions of a single protocol to exist in the same network. Perhaps the IETF and other standards organizations should rethink this, explicitly taking older ways of doing things out of specs on a periodic basis. A draft that says “you shouldn’t do this any longer,” or “this is no longer in accordance with the specification,” would not be a bad thing.
For the more “average” network engineer, this discussion around the robustness principle should lead to some important lessons, particularly as we move ever more deeply into an automated world. Be clear about versioning of APIs and components. Deprecate older processes when they should no longer be used.
Control your technical debt, or it will control you.
The End of Specialization?
There is a rule in sports and music about practice—the 10,000 hour rule—which says that if you want to be an expert on something, you need ten thousand hours of intentional practice. The corollary to this rule is: if you want to be really good at something, specialize. In colloquial language, you cannot be both a jack of all trades and a master of one.
Translating this to the network engineering world, we might say something like: it takes 10,000 hours to really know the full range of products from vendor x and how to use them. Or perhaps: only after you have spent 10,000 hours of intentional study and practice in building data center networks will you know how to build these things. We might respond to this challenge by focusing our studies and time in one specific area, gaining one series of certifications, learning one vendor’s gear, or learning one specific kind of work (such as design or troubleshooting).
This line of thinking, however, should immediately raise two questions. First, is it true? Anecdotal evidence seems to abound for this kind of thinking; we have all heard of the child prodigy who spent their entire lives focusing on a single sport. We also all know of people who have “paper skills” instead of “real skills;” the reason we often attribute to this is they have not done enough lab work, or they have not put in hours configuring, troubleshooting, or working on the piece of gear in question. Second, is it healthy for the person or the organization the person works for?
To make matters worse, we often see this show p in the job hunting process. The manager wants someone who can “hit the ground running” on this project, using this piece of equipment, and they want them on board and working tomorrow. In response, we see job descriptions and recruiting drives for specific skill sets, down to individual hardware and software.
I recently ran across two articles that push back on this 10,000 hours10,000 rule way of learning does not work.
Re-read that last sentence—what turns out to be the most effective learning strategy often looks just like falling behind. Another recent article pointed out that deep expertise seems to be losing its sway in many workplaces. The author spends time around a new United States Navy littoral ship, which are designed to operate with much smaller crews—one-half to one-third of a comparably sized ship staffed in the traditional way. How do these ships operate? By cross training crew members to be able to do many different tasks.
One of the interesting things this latter article points out is this ability to do many different tasks requires fluid intelligence, which is a completely different set of skills than crystallized intelligence. Fluid intelligence, it seems, becomes stronger over time, peaking much later in life. While the article does not discuss how to develop the kind of fluid intelligence that will serve you well later in life, when this kind of thinking overtakes your narrower skill sets, it makes sense that building a broader set of skills over time is a more likely path than following the 10,000 hour rule.
There is, however, one question that neither author spends a lot of time discussing: if you are not focusing on learning one thing, then how, and on what, should you focus your time spent learning on? For the top athletes in the sports article, it seems like they spent a lot of time in different kinds of physical activity. There was an area of focus, but it was not the kind of narrow focus we normally associate with being excellent at one sport. In the same way, the sailors in the second article were all focused in a broader area—anything required to run a ship. Again, there is focus, but not the kind of narrow focus you might have expect on more standard boats, where one set of sailors just focus on working the lines, while another just focus on navigating, etc. The focus is still there, then—it is just a broader focus.
Why and how does this work? My guess is it works because the skills you learn in dancing, for instance, will help you learn better footwork in boxing and other sports (an example given in the sports article linked above). The skill you learn in handling the lines will help you understand the lay and movement of the boat in ways that are helpful in navigation. These skills, in other words, are somewhat adjacent.
But these skills are more than adjacent. Many of them are also basic, or theoretical, in ways we do not value in the network engineering world. The point I often hear made is: I don’t care about how BGP really works, so long as I can write a script that configures it, and I can troubleshoot it when it breaks. Or: I actually work on vendor x model 1234 all day, what I really need to know to be effective is how to configure it… when I need to replace that piece of gear, I will learn the next one so I can keep doing my job.
My point is this: this way of building skills, this way of working, does not “work” in the long term. There will come a point in your life, and in the life of your company, when point skills will weaken and lose their importance. The research, and experience, shows the better way to learn is to take on the long game, to learn the theory, and to practice the theory in many different settings, rather than focusing too deeply on one thing.
The Floating Point Fix
Floating point is not something many network engineers think about. In fact, when I first started digging into routing protocol implementations in the mid-1990’s, I discovered one of the tricks you needed to remember when trying to replicate the router’s metric calculation was always round down. When EIGRP was first written, like most of the rest of Cisco’s IOS, was written for processors that did not perform floating point operations. The silicon and processing time costs were just too high.
What brings all this to mind is a recent article on the problems with floating point performance over at The Next Platform by Michael Feldman. According to the article:
For those who have not spent a lot of time in the coding world, a floating point number is one that has some number of digits after the decimal. While integers are fairly easy to represent and calculate over in the binary processors use, floating point numbers are much more difficult, because floating point numbers are very difficult to represent in binary. The number of bits you have available to represent the number makes a very large difference in accuracy. For instance, if you try to store the number 101.1 in a float, you will find the number stored is actually 101.099998 To store 101.1, you need a double, which is twice as long as a float
Okay—this is all might be fascinating, but who cares? Scientists, mathematicians, and … network engineers do, as a matter of fact. Fist, carrying around double floats to store numbers with higher precision means a lot more network traffic. Second, when you start looking at timestamps and large amounts of telemetry data, the efficiency and accuracy of number storage becomes a rather big deal.
Okay, so the current floating point storage format, called IEEE754, is inaccurate and rather inefficient. What should be done about this? According to the article, John Gustafson, a computer scientist, has been pushing for the adoption of a replacement called posits. Quoting the article once again:
It does this by using a denser representation of real numbers. So instead of the fixed-sized exponent and fixed-sized fraction used in IEEE floating point numbers, posits encode the exponent with a variable number of bits (a combination of regime bits and the exponent bits), such that fewer of them are needed, in most cases. That leaves more bits for the fraction component, thus more precision.
Did you catch why this is more efficient? Because it uses a variable length field. In other words, posits replaces a fixed field structure (like what was originally used in OSPFv2) with a variable length field (like what is used in IS-IS). While you must eat some space in the format to carry the length, the amount of "unused space" in current formats overwhelms the space wasted, resulting in an improvement in accuracy. Further, many numbers that require a double today can be carried in the size of a float. Not only does using a TLV format increase accuracy, it also increases efficiency.
From the perspective of the State/Optimization/Surface (SOS) tradeoff, there should be some increase in complexity somewhere in the overall system—if you have not found the tradeoffs, you have not looked hard enough. Indeed, what we find is there is an increase in the amount of state being carried in the data channel itself; there is additional state, and additional code that knows how to deal with this new way of representing numbers.
It's always interesting to find situations in other information technology fields where discussions parallel to discussions in the networking world are taking place. Many times, you can see people encountering the same design tradeoffs we see in network engineering and protocol design.
Design Intelligence from the Hourglass Model
Over at the Communications of the ACM, Micah Beck has an article up about the hourglass model. While the math is quite interesting, I want to focus on transferring the observations from the realm of protocol and software systems development to network design. Specifically, start with the concept and terminology, which is very useful. Taking a typical design, such as this—

The first key point made in the paper is this—
The thin waist of the hourglass is a narrow straw through which applications can draw upon the resources that are available in the less restricted lower layers of the stack.
A somewhat obvious point to be made here is applications can only use services available in the spanning layer, and the spanning layer can only build those services out of the capabilities of the supporting layers. If fewer applications need to be supported, or the applications deployed do not require a lot of “fancy services,” a weaker spanning layer can be deployed. Based on this, the paper observes—
The balance between more applications and more supports is achieved by first choosing the set of necessary applications N and then seeking a spanning layer sufficient for N that is as weak as possible. This scenario makes the choice of necessary applications N the most directly consequential element in the process of defining a spanning layer that meets the goals of the hourglass model.
Beck calls the weakest possible spanning layer to support a given set of applications the minimally sufficient spanning layer (MSSL). There is one thing that seems off about this definition, however—the correlation between the number of applications supported and the strength of the spanning layer. There are many cases where a network supports thousands of applications, and yet the network itself is quite simple. There are many other cases where a network supports just a few applications, and yet the network is very complex. It is not the number of applications that matter, it is the set of services the applications demand from the spanning layer.
Based on this, we can change the definition slightly: an MSSL is the weakest spanning layer that can provide the set of services required by the applications it supports. This might seem intuitive or obvious, but it is often useful to work these kinds of intuitive things out, so they can be expressed more precisely when needed.
First lesson: the primary driver in network complexity is application requirements. To make the network simpler, you must reduce the requirements applications place on the network.
There are, however, several counter-intuitive cases here. For instance, TCP is designed to emulate (or abstract) a circuit between two hosts—it creates what appears to be a flow controlled, error free channel with no drops on top of IP, which has no flow control, and drops packets. In this case, the spanning layer (IP), or the wasp waist, does not support the services the upper layer (the application) requires.
In order to make this work, TCP must add a lot of complexity that would normally be handled by one of the supporting layers—in fact, TCP might, in some cases, recreate capabilities available in one of the supporting layers, but hidden by the spanning layer. There are, as you might have guessed, tradeoffs in this neighborhood. Not only are the mechanisms TCP must use more complex that the ones some supporting layer might have used, TCP represents a leaky abstraction—the underlying connectionless service cannot be completely hidden.
Take another instance more directly related to network design. Suppose you aggregate routing information at every point where you possibly can. Or perhaps you are using BGP route reflectors to manage configuration complexity and route counts. In most cases, this will mean information is flowing through the network suboptimally. You can re-optimize the network, but not without introducing a lot of complexity. Further, you will probably always have some form of leaky abstraction to deal with when abstracting information out of the network.
Second lesson: be careful when stripping information out of the spanning layer in order to simplify the network. There will be tradeoffs, and sometimes you end up with more complexity than what you started with.
A second counter-intuitive case is that of adding complexity to the supporting layers in order to ultimately simplify the spanning layer. It seems, on the model presented in the paper, that adding more services to the spanning layer will always end up adding more complexity to the entire system. MPLS and Segment Routing (SR), however, show this is not always true. If you need traffic steering, for instance, it is easier to implement MPLS or SR in the support layer rather than trying to emulate their services at the application level.
Third lesson: sometimes adding complexity in a lower layer can simplify the entire system—although this might seem to be counter-intuitive from just examining the model.
The bottom line: complexity is driven by applications (top down), but understanding the full stack, and where interactions take place, can open up opportunities for simplifying the overall system. The key is thinking through all parts of the system carefully, using effective mental models to understand how they interact (interaction surfaces), and the considering the optimization tradeoffs you will make by shifting state to different places.
DORA, DevOps, and Lessons for Network Engineers
DevOps Research and Assessment (DORA) released their 2018 Accelerate report on the state of DevOps at the end of 2018; I’m a little behind in my reading, so I just got around to reading it, and trying to figure out how to apply their findings to the infrastructure (networking) side of the world.
DORA found organizations that outsource entire functions, such as building an entire module or service, tend to perform more poorly than organizations that outsource by integrating individual developers into existing internal teams (page 43). It is surprising companies still think outsourcing entire functions is a good idea, given the many years of experience the IT world has with the failures of this model. Outsourced components, it seems, too often become a bottleneck in the system, especially as contracts constrain your ability to react to real-world changes. Beyond this, outsourcing an entire function not only moves the work to an outside organization, but also the expertise. Once you have lost critical mass in an area, and any opportunity for employees to learn about that area, you lose control over that aspect of your system.
DORA also found a correlation between faster delivery of software and reduced Mean Time To Repair (MTTR) (page 19). On the surface, this makes sense. Shops that delivery software continuously are bound to have faster, more regularly exercised processes in place for developing, testing, and rolling out a change. Repairing a fault or failure requires change; anything that improves the speed of rolling out a change is going to drive MTTR down.
Organizations that emphasize monitoring and observability tended to perform better than others (page 55). This has major implications for network engineering, where telemetry and management are often “bolted on” as an afterthought, much like security. This is clearly not optimal, however—telemetry and network management need to be designed and operated like any other application. Data sources, stores, presentation, and analysis need to be segmented into separate services, so new services can be tried out on top of existing data, and new sources can feed into existing services. Network designers need to think about how telemetry will flow through the management system, including where and how it will originate, and what it will be used for.
These observations about faster delivery and observability should drive a new way of thinking about failure domains; while failure domains are often primarily thought of as reducing the “blast radius” when a router or link fails, they serve two much larger roles. First, failure domain boundaries are good places to gather telemetry because this is where information flows through some form of interaction surface between two modules. Information gathered at a failure domain boundary will not tend to change as often, and it will often represent the operational status of the entire module.
Second, well places failure domain boundaries can be used to stake out areas where “new things” can be put in operation with some degree of confidence. If a network has well-designed failure domain boundaries, it is much easier to deploy new software, hardware, and functionality in a controlled way. This enables a more agile view of network operations, including the ability to roll out changes incrementally through a canary process, and to use processes like chaos monkey to understand and correct unexpected failure modes.
Another interesting observation is the j-curve of adoption (page 3):

This j-curve shows the “tax” of building the underlying structures needed to move from a less automated state to a more automated one. Keith’s Law:
…operates in part because of this j-curve. Do not be discouraged if it seems to take a lot of work to make small amounts of progress in many stages of system development—the results will come later.
The bottom line: it might seem like a report about software development is too far outside the realm of network engineering to be useful—but the reality is network engineers can learn a lot about how to design, build, and operate a network from software engineers.
Cloudy with a chance of lost context
One thing we often forget in our drive to “more” is this—context matters. Ivan hit on this with a post about the impact of controller failures in software-defined networks just a few days ago:
What is the context in terms of a network controller? To broaden the question—what is the context for anything? The context, in philosophical terms, is the telos, the final end, or the purpose of the system you are building. Is the point of the network to provide neat gee-whiz capabilities, or is it to move data in a way that creates value for the business?
Ivan’s article reminded me of another article I read recently about the larger world of cloud, IoT, and context problems by Bob Frankston., who says:
We are currently building networks with the presupposition that cloud based services are more secure, more agile, and more reliable than we can build locally. Somehow we think “the cloud” will never—can never—fail.
But of course, cloud services can fail. We know this because they have. Office 365 has failed a number of times, including this one. Azure failed at least once in 2014 due to operator error. Dyn was taken down by a DDoS attack not so long ago. Google cloud failed not too long ago, as well. Back in 2016, Salesforce suffered a major outage.
The point is not that we should not use cloud services. The point is that systems should be designed so any form of failure in a centralized component should not cause local services, at least at some minimal level, to fail. And yet, how many networks today rely on cloud-based services to operate at all, much less correctly? Would your company’s email still be delivered if your cloud based anti-spam software service failed? Would your company’s building locks work if the cloud based security service you use goes down?
If they would not, do you have a contingency plan in place for when these services do fail? Can you open the dors manually, do you have a way to turn off the spam filtering, do you have a way to manually configure the network to “get by” until service can be restored? Do you know which data flows must be supported in the case of an outage of this sort, or what the impact on the entire system will be? Do you have a plan to restore the network and its services to their optimal state once the external services optimal operation relies on are available after an outage?
A larger point can be made here: when pulling services into the network itself, we often rely on unexamined presuppositions, and we often leave the system with unexamined and little-understood risks.
Context matters—what is it you are trying to do? Do you really need to do it this way? What happens if the centralized or cloudy system you rely on fails? It’s not that we should not use or deploy cloud-based systems—but we should try to expose and understand the failure modes and risks we are building into our networks.
