The End of Specialization?

There is a rule in sports and music about practice—the 10,000 hour rule—which says that if you want to be an expert on something, you need ten thousand hours of intentional practice. The corollary to this rule is: if you want to be really good at something, specialize. In colloquial language, you cannot be both a jack of all trades and a master of one.

Translating this to the network engineering world, we might say something like: it takes 10,000 hours to really know the full range of products from vendor x and how to use them. Or perhaps: only after you have spent 10,000 hours of intentional study and practice in building data center networks will you know how to build these things. We might respond to this challenge by focusing our studies and time in one specific area, gaining one series of certifications, learning one vendor’s gear, or learning one specific kind of work (such as design or troubleshooting).

This line of thinking, however, should immediately raise two questions. First, is it true? Anecdotal evidence seems to abound for this kind of thinking; we have all heard of the child prodigy who spent their entire lives focusing on a single sport. We also all know of people who have “paper skills” instead of “real skills;” the reason we often attribute to this is they have not done enough lab work, or they have not put in hours configuring, troubleshooting, or working on the piece of gear in question. Second, is it healthy for the person or the organization the person works for?

To make matters worse, we often see this show p in the job hunting process. The manager wants someone who can “hit the ground running” on this project, using this piece of equipment, and they want them on board and working tomorrow. In response, we see job descriptions and recruiting drives for specific skill sets, down to individual hardware and software.

I recently ran across two articles that push back on this 10,000 hours10,000 rule way of learning does not work.

Over time, as I delved further into studies about learning and specialisation, I came across more and more evidence that it takes time to develop personal and professional range – and that there are benefits to doing so. I discovered research showing that highly credentialed experts can become so narrow-minded that they actually get worse with experience, even while becoming more confident (a dangerous combination). And I was stunned when cognitive psychologists I spoke with led me to an enormous and too-often ignored body of work demonstrating that learning itself is best done slowly to accumulate lasting knowledge, even when that means performing poorly on tests of immediate progress. That is, the most effective learning looks inefficient – it looks like falling behind.

Re-read that last sentence—what turns out to be the most effective learning strategy often looks just like falling behind. Another recent article pointed out that deep expertise seems to be losing its sway in many workplaces. The author spends time around a new United States Navy littoral ship, which are designed to operate with much smaller crews—one-half to one-third of a comparably sized ship staffed in the traditional way. How do these ships operate? By cross training crew members to be able to do many different tasks.

One of the interesting things this latter article points out is this ability to do many different tasks requires fluid intelligence, which is a completely different set of skills than crystallized intelligence. Fluid intelligence, it seems, becomes stronger over time, peaking much later in life. While the article does not discuss how to develop the kind of fluid intelligence that will serve you well later in life, when this kind of thinking overtakes your narrower skill sets, it makes sense that building a broader set of skills over time is a more likely path than following the 10,000 hour rule.

There is, however, one question that neither author spends a lot of time discussing: if you are not focusing on learning one thing, then how, and on what, should you focus your time spent learning on? For the top athletes in the sports article, it seems like they spent a lot of time in different kinds of physical activity. There was an area of focus, but it was not the kind of narrow focus we normally associate with being excellent at one sport. In the same way, the sailors in the second article were all focused in a broader area—anything required to run a ship. Again, there is focus, but not the kind of narrow focus you might have expect on more standard boats, where one set of sailors just focus on working the lines, while another just focus on navigating, etc. The focus is still there, then—it is just a broader focus.

Why and how does this work? My guess is it works because the skills you learn in dancing, for instance, will help you learn better footwork in boxing and other sports (an example given in the sports article linked above). The skill you learn in handling the lines will help you understand the lay and movement of the boat in ways that are helpful in navigation. These skills, in other words, are somewhat adjacent.

But these skills are more than adjacent. Many of them are also basic, or theoretical, in ways we do not value in the network engineering world. The point I often hear made is: I don’t care about how BGP really works, so long as I can write a script that configures it, and I can troubleshoot it when it breaks. Or: I actually work on vendor x model 1234 all day, what I really need to know to be effective is how to configure it… when I need to replace that piece of gear, I will learn the next one so I can keep doing my job.

My point is this: this way of building skills, this way of working, does not “work” in the long term. There will come a point in your life, and in the life of your company, when point skills will weaken and lose their importance. The research, and experience, shows the better way to learn is to take on the long game, to learn the theory, and to practice the theory in many different settings, rather than focusing too deeply on one thing.

Disaggregation and Business Value

I recently spoke at CHINOG on the business value of disaggregation, and participated in a panel on getting involved in the IETF. If you’re interested in these two talks, the videos are linked below.

The Floating Point Fix

Floating point is not something many network engineers think about. In fact, when I first started digging into routing protocol implementations in the mid-1990’s, I discovered one of the tricks you needed to remember when trying to replicate the router’s metric calculation was always round down. When EIGRP was first written, like most of the rest of Cisco’s IOS, was written for processors that did not perform floating point operations. The silicon and processing time costs were just too high.

What brings all this to mind is a recent article on the problems with floating point performance over at The Next Platform by Michael Feldman. According to the article:

While most programmers use floating point indiscriminately anytime they want to do math with real numbers, because of certain limitations in how these numbers are represented, performance and accuracy often leave something to be desired.

For those who have not spent a lot of time in the coding world, a floating point number is one that has some number of digits after the decimal. While integers are fairly easy to represent and calculate over in the binary processors use, floating point numbers are much more difficult, because floating point numbers are very difficult to represent in binary. The number of bits you have available to represent the number makes a very large difference in accuracy. For instance, if you try to store the number 101.1 in a float, you will find the number stored is actually 101.099998 To store 101.1, you need a double, which is twice as long as a float

Okay—this is all might be fascinating, but who cares? Scientists, mathematicians, and … network engineers do, as a matter of fact. Fist, carrying around double floats to store numbers with higher precision means a lot more network traffic. Second, when you start looking at timestamps and large amounts of telemetry data, the efficiency and accuracy of number storage becomes a rather big deal.

Okay, so the current floating point storage format, called IEEE754, is inaccurate and rather inefficient. What should be done about this? According to the article, John Gustafson, a computer scientist, has been pushing for the adoption of a replacement called posits. Quoting the article once again:

It does this by using a denser representation of real numbers. So instead of the fixed-sized exponent and fixed-sized fraction used in IEEE floating point numbers, posits encode the exponent with a variable number of bits (a combination of regime bits and the exponent bits), such that fewer of them are needed, in most cases. That leaves more bits for the fraction component, thus more precision.

Did you catch why this is more efficient? Because it uses a variable length field. In other words, posits replaces a fixed field structure (like what was originally used in OSPFv2) with a variable length field (like what is used in IS-IS). While you must eat some space in the format to carry the length, the amount of "unused space" in current formats overwhelms the space wasted, resulting in an improvement in accuracy. Further, many numbers that require a double today can be carried in the size of a float. Not only does using a TLV format increase accuracy, it also increases efficiency.

From the perspective of the State/Optimization/Surface (SOS) tradeoff, there should be some increase in complexity somewhere in the overall system—if you have not found the tradeoffs, you have not looked hard enough. Indeed, what we find is there is an increase in the amount of state being carried in the data channel itself; there is additional state, and additional code that knows how to deal with this new way of representing numbers.

It's always interesting to find situations in other information technology fields where discussions parallel to discussions in the networking world are taking place. Many times, you can see people encountering the same design tradeoffs we see in network engineering and protocol design.

Design Intelligence from the Hourglass Model

Over at the Communications of the ACM, Micah Beck has an article up about the hourglass model. While the math is quite interesting, I want to focus on transferring the observations from the realm of protocol and software systems development to network design. Specifically, start with the concept and terminology, which is very useful. Taking a typical design, such as this—

The first key point made in the paper is this—

The thin waist of the hourglass is a narrow straw through which applications can draw upon the resources that are available in the less restricted lower layers of the stack.

A somewhat obvious point to be made here is applications can only use services available in the spanning layer, and the spanning layer can only build those services out of the capabilities of the supporting layers. If fewer applications need to be supported, or the applications deployed do not require a lot of “fancy services,” a weaker spanning layer can be deployed. Based on this, the paper observes—

The balance between more applications and more supports is achieved by first choosing the set of necessary applications N and then seeking a spanning layer sufficient for N that is as weak as possible. This scenario makes the choice of necessary applications N the most directly consequential element in the process of defining a spanning layer that meets the goals of the hourglass model.

Beck calls the weakest possible spanning layer to support a given set of applications the minimally sufficient spanning layer (MSSL). There is one thing that seems off about this definition, however—the correlation between the number of applications supported and the strength of the spanning layer. There are many cases where a network supports thousands of applications, and yet the network itself is quite simple. There are many other cases where a network supports just a few applications, and yet the network is very complex. It is not the number of applications that matter, it is the set of services the applications demand from the spanning layer.

Based on this, we can change the definition slightly: an MSSL is the weakest spanning layer that can provide the set of services required by the applications it supports. This might seem intuitive or obvious, but it is often useful to work these kinds of intuitive things out, so they can be expressed more precisely when needed.

First lesson: the primary driver in network complexity is application requirements. To make the network simpler, you must reduce the requirements applications place on the network.

There are, however, several counter-intuitive cases here. For instance, TCP is designed to emulate (or abstract) a circuit between two hosts—it creates what appears to be a flow controlled, error free channel with no drops on top of IP, which has no flow control, and drops packets. In this case, the spanning layer (IP), or the wasp waist, does not support the services the upper layer (the application) requires.

In order to make this work, TCP must add a lot of complexity that would normally be handled by one of the supporting layers—in fact, TCP might, in some cases, recreate capabilities available in one of the supporting layers, but hidden by the spanning layer. There are, as you might have guessed, tradeoffs in this neighborhood. Not only are the mechanisms TCP must use more complex that the ones some supporting layer might have used, TCP represents a leaky abstraction—the underlying connectionless service cannot be completely hidden.

Take another instance more directly related to network design. Suppose you aggregate routing information at every point where you possibly can. Or perhaps you are using BGP route reflectors to manage configuration complexity and route counts. In most cases, this will mean information is flowing through the network suboptimally. You can re-optimize the network, but not without introducing a lot of complexity. Further, you will probably always have some form of leaky abstraction to deal with when abstracting information out of the network.

Second lesson: be careful when stripping information out of the spanning layer in order to simplify the network. There will be tradeoffs, and sometimes you end up with more complexity than what you started with.

A second counter-intuitive case is that of adding complexity to the supporting layers in order to ultimately simplify the spanning layer. It seems, on the model presented in the paper, that adding more services to the spanning layer will always end up adding more complexity to the entire system. MPLS and Segment Routing (SR), however, show this is not always true. If you need traffic steering, for instance, it is easier to implement MPLS or SR in the support layer rather than trying to emulate their services at the application level.

Third lesson: sometimes adding complexity in a lower layer can simplify the entire system—although this might seem to be counter-intuitive from just examining the model.

The bottom line: complexity is driven by applications (top down), but understanding the full stack, and where interactions take place, can open up opportunities for simplifying the overall system. The key is thinking through all parts of the system carefully, using effective mental models to understand how they interact (interaction surfaces), and the considering the optimization tradeoffs you will make by shifting state to different places.

The Hedge


Since leaving the Network Collective as a co-host, I have been thinking about what to do “next” in the podcast space. While I will continue recording the History of Networking series until I either run out of guests or energy, it seems like an opportune time to build something new.

Hence—the Hedge, a new podcast published here at rule11.tech.

Why the Hedge? Because the people I work with in the engineering world remind me a lot of hedgehogs. If you scare them, they curl up into a ball of spikes and hiss. Because we always have problems with people hogging the (h)edge of the network. Because the (h)edge of the network and business is important. Because a hedge is a great place to gather just to have a conversation about what is going on in the world.

You can add your own bad hedge jokes here, because I’m out for the moment.

How with the Hedge be different? There will be no ads, sponsors or memberships. Each episode will be under 40 minutes, which means you should be able to listen while working out, cooking dinner, eating lunch, or… well, whatever it is you can do in 30-40 minutes. There will be no planned topics, just people talking about whatever they care about.

How often will the Hedge be published? Hopefully (!) every two weeks, starting in the middle of July or so.

Will there be a feed for this thing? Sure, here on rule11.tech—and I’ll work on getting the feed into a couple of podcast feeds once the first episode is published.

Who will be on the Hedge? Tom Ammon and Eyvonne Sharp will be on quite often, but other than that… who do you think should be on the Hedge?

DORA, DevOps, and Lessons for Network Engineers

DevOps Research and Assessment (DORA) released their 2018 Accelerate report on the state of DevOps at the end of 2018; I’m a little behind in my reading, so I just got around to reading it, and trying to figure out how to apply their findings to the infrastructure (networking) side of the world.

DORA found organizations that outsource entire functions, such as building an entire module or service, tend to perform more poorly than organizations that outsource by integrating individual developers into existing internal teams (page 43). It is surprising companies still think outsourcing entire functions is a good idea, given the many years of experience the IT world has with the failures of this model. Outsourced components, it seems, too often become a bottleneck in the system, especially as contracts constrain your ability to react to real-world changes. Beyond this, outsourcing an entire function not only moves the work to an outside organization, but also the expertise. Once you have lost critical mass in an area, and any opportunity for employees to learn about that area, you lose control over that aspect of your system.

DORA also found a correlation between faster delivery of software and reduced Mean Time To Repair (MTTR) (page 19). On the surface, this makes sense. Shops that delivery software continuously are bound to have faster, more regularly exercised processes in place for developing, testing, and rolling out a change. Repairing a fault or failure requires change; anything that improves the speed of rolling out a change is going to drive MTTR down.

Organizations that emphasize monitoring and observability tended to perform better than others (page 55). This has major implications for network engineering, where telemetry and management are often “bolted on” as an afterthought, much like security. This is clearly not optimal, however—telemetry and network management need to be designed and operated like any other application. Data sources, stores, presentation, and analysis need to be segmented into separate services, so new services can be tried out on top of existing data, and new sources can feed into existing services. Network designers need to think about how telemetry will flow through the management system, including where and how it will originate, and what it will be used for.

These observations about faster delivery and observability should drive a new way of thinking about failure domains; while failure domains are often primarily thought of as reducing the “blast radius” when a router or link fails, they serve two much larger roles. First, failure domain boundaries are good places to gather telemetry because this is where information flows through some form of interaction surface between two modules. Information gathered at a failure domain boundary will not tend to change as often, and it will often represent the operational status of the entire module.

Second, well places failure domain boundaries can be used to stake out areas where “new things” can be put in operation with some degree of confidence. If a network has well-designed failure domain boundaries, it is much easier to deploy new software, hardware, and functionality in a controlled way. This enables a more agile view of network operations, including the ability to roll out changes incrementally through a canary process, and to use processes like chaos monkey to understand and correct unexpected failure modes.

Another interesting observation is the j-curve of adoption (page 3):

This j-curve shows the “tax” of building the underlying structures needed to move from a less automated state to a more automated one. Keith’s Law:

In a complex system, the cumulative effect of a large number of small optimizations is externally indistinguishable from a radical leap.

…operates in part because of this j-curve. Do not be discouraged if it seems to take a lot of work to make small amounts of progress in many stages of system development—the results will come later.

The bottom line: it might seem like a report about software development is too far outside the realm of network engineering to be useful—but the reality is network engineers can learn a lot about how to design, build, and operate a network from software engineers.