Tradeoffs Come in Threes

On a Spring 2019 walk in Beijing I saw two street sweepers at a sunny corner. They were beat-up looking and grizzled but probably younger than me. They’d paused work to smoke and talk. One told a story; the other’s eyes widened and then he laughed so hard he had to bend over, leaning on his broom. I suspect their jobs and pay were lousy and their lives constrained in ways I can’t imagine. But they had time to smoke a cigarette and crack a joke. You know what that’s called? Waste, inefficiency, a suboptimal outcome. Some of the brightest minds in our economy are earnestly engaged in stamping it out. They’re winning, but everyone’s losing. —Tim Bray

This, in a nutshell, is what is often wrong with our design thinking in the networking world today. We want things to be efficient, wringing the last little dollar, and the last little bit of bandwidth, out of everything.

This is also, however, a perfect example of the problem of triads and tradeoffs. In the case of the street sweeper, we might thing, “well, we could replace those folks sitting around smoking a cigarette and cracking jokes with a robot, making things much more efficient.” We might notice the impact on the street sweeper’s salaries—but after all, it’s a boring job, and they are better off doing something else anyway, right?

We’re actually pretty good at finding, and “solving” (for some meaning of “solving,” of course), these kinds of immediately obvious tradeoffs. It’s obvious the street sweepers are going to lose their jobs if we replace them with a robot. What might not be so obvious is the loss of the presence of a person on the street. That’s a pair of eyes who can see when a child is being taken by someone who’s not a family member, a pair of ears that can hear the rumble of a car that doesn’t belong in the neighborhood, a pair of hands that can help someone who’s fallen, etc.

This is why these kinds of tradeoffs always come in (at least) threes.

Let’s look at the street sweepers in terms of the SOS triad. Replacing the street sweepers with a robot or machine certainly increases optimization. According to the triad, though, increasing optimization in one area should result in some increase in complexity someplace, and some loss of optimization in other places.

What about surfaces? The robot must be managed, and it must interact with people and vehicles on the street—which means people and vehicles must also interact with the robot. Someone must build and maintain the robot, so there must be some sort of system, with a plethora of interaction surfaces, to make this all happen. So yes, there may be more efficiency, but there are now more interaction surfaces to deal with now, too. These interaction surfaces increase complexity.

What about state? In a sense, there isn’t much change in state other than moving it—purely in terms of sweeping the street, anyway. The sweeper and the robot must both understand when and how to sweep the street, etc., so the state doesn’t seem to change much here.

On the other hand, that extra set of eyes and ears, that extra mind, that is no longer on the street in a personal way represents a loss of state. The robot is an abstraction of the person who was there before, and abstraction always represents a loss of state in some way. Whether this loss of state decreases the optimal handling of local neighborhood emergencies is probably a non-trivial problem to consider.

The bottom line is this—when you go after efficiency, you need to think in terms of efficiency of what, rather than efficiency as a goal-in-itself. That’s because there is no such thing as “efficiency-in-itself,” there is only something you are making more efficient—and a lot of things you potentially making less efficient.

Automate your network, certainly, or even buy a system that solves “all the problems.” But remember there are tradeoffs—often a large number of tradeoffs you might not have thought about—and those tradeoffs have consequences.

It’s not “if you haven’t found the tradeoff, you haven’t looked hard enough…” Don’t stop at one. It’s “if you haven’t found the tradeoffs, you haven’t looked hard enough.” It’s a plural for a reason.

Learning from the Post-Mortem

Post-mortem reviews seem to be quite common in the software engineering and application development sides of the IT world—but I do not recall a lot of post-mortems in network engineering across my 30 years. This puzzling observation sprang to mind while I was reading a post over at the ACM this last week about how to effectively learn from the post-mortem exercise.

The common pattern seems to be setting aside a one hour meeting, inviting a lot of people, trying to shift blame while not actually saying you are shifting blame (because we are all supposed to live in a blame-free environment now—fix the problem, not the blame!), and then … a list is created on a whiteboard, pictures are taken, and everyone walks away with a rock-solid plan to never do that again.

In a few months’ time, the same team will be in the same room, draw the same drawings, and say the same things all over again. At least that is the way it seems to me. If there is an effective post-mortem process in use by a company someplace, I do not think I have seen it.

From the article—

Are we missing anything in this prevalent rinse-and-repeat cycle of how the industry generally addresses incidents that could be helpful? Put another way: As we experience incidents, work through them, and deal with their aftermath, if we set aside incident-specific, and therefore fundamentally static, remediation items, both in technology and process, are we learning anything else that would be useful in addressing and responding to incidents? Can we describe that knowledge? And if so, how would we then make use of it to leverage past pain and improve future chances at success?

I tend to think, from the few times I have seen network post-mortems performed, that the reason they do not work well is because we slip into the same appliance/configuration frame of mind so quickly. We want to understand what configuration was entered incorrectly, or what defect should be reported back to the vendor, rather than thinking about organizational and process changes. The smaller the detail, the safer the conclusions, after all—aim small, miss big, is what we say in the shooting world.

We focus so much on mean time to innocence, and how to create a practically perfect process that will never fail, that we fail to do the one thing we should be doing: learning.

Okay, so enough whining—what can be done about this situation? A few practical suggestions come to mind. These are not, of course, well-thought-out solutions, but rather, perhaps, “part of the solution.”

Rather than trying to figure out the root cause, spend that precious hour of post-mortem time mapping out three distinct workflows. The first should be the process that set up the failure. What drove the installation of this piece of hardware or software? What drove the deployment of this protocol? How did we get to the place where this failure had that effect? Once this is mapped out, see if there is anything in that process, or even in the political drivers and commitments made during that process, that could or should be modified to really change the way technology is deployed in your network.

The second process you should map out is the steps taken to detect the problem. Dwell time is a huge problem in modern networks—the time between a failure occurring and being detected. You should constantly focus on bringing dwell time down while paying close attention to the collateral damage of false positives. Mapping out how this failure was detected, and where it should have been caught sooner, can help improve telemetry systems, ultimately decreasing MTTR.

The third, and final, workflow you map out should be the troubleshooting process itself. People rarely map out their troubleshooting process for later reference, but this little trick I learned from way back in tube-type electronics days used to save me hours of time in the field. As you troubleshoot, make a flow chart. Record what you checked, why you checked it, how you checked it, and what you learned from the check. This flowchart, or workflow, is precious material in the post-mortem process. What can you instrument, or make easier to find, to reduce troubleshooting time in the next go-round? How can you traverse the network and find the root cause faster next time? These are crucial questions you can only answer with the use of a troubleshooting workflow.

I don’t know if you already do post-mortems or not, or how valuable you think they are—but I would suggest they can be, and are, quite useful. So long as you get out of the narrows and focus on systems and workflows. Aim small, miss big—but aim big and you’ll either hit the target or, at worst, miss small.

Quitting Certifications: When?

At what point in your career do you stop working towards new certifications?

Daniel Dibb’s recent post on his blog is, I think, an excellent starting point, but I wanted to add a few additional thoughts to the answer he gives there.

Daniel’s first question is how do you learn? Certifications often represent a body of knowledge people who have a lot of experience believe is important, so they often represent a good guided path to holistically approaching a new body of knowledge. In the professional learning world this would be called a ready-made mental map. There is a counterargument here—certifications are often created by vendors as a marketing tool, rather than as something purely designed for the betterment of the community, or the dissemination of knowledge. This doesn’t mean, however, that certifications are “evil.” It just means you need to evaluate each certification on its own merits.

As an aside, I’ve been trying to start a non-vendor-specific certification for the last two years but have been struggling to find a group of people with the energy and excitement required to make it happen. To some degree, the reason certifications are vendor-based is because we, as a community, don’t do a good job at building them.

The second series of questions relate to your position—would a certification give you a bonus, help you get a new position, or give credibility to the company you work for? These are all valid questions requiring self-reflection around what you hope to achieve materially by working through the certification.

The final set of questions Daniel poses relate to whether a certification would give you what might be called authority in the network engineering world. Certifications, seen in this way, are a form of transitive trust. There are two components here—the certification blueprint tells you about the body of knowledge, and the certifying authority tells you about the credibility of the process. Given these two things, someone with a certification is saying “someone you trust has said I have this knowledge, so you should trust I have this knowledge as well.” The certification acts as a transit between you and the certified person, transferring some amount of your trust in the certifying organization to the person you are talking to.

There are other ways to build this kind of trust, of course. For instance, if you blog, or run a podcast, or are a frequent guest contributor, or have a lot of code on git, or have written some books, etc. In these cases, the trust is no longer transitive but direct—you can see a person has a body of knowledge because they have (at least to some degree) exposed that knowledge publicly.

All of these reasons are fine and good—but I think there is another point to think about in this discussion: what are you saying to the community? Once you stop “doing” certifications, you can be saying one of two things. The first is certifications are useless. If all the reasons for getting a certification above are true, then telling someone “certifications are useless” is not a good thing. Those who don’t care about certifications should rather take the position first, do no harm.

A stronger position would be to carefully evaluate existing certifications and help guide folks desiring a certification down a good path rather than a bad one. Which certifications are primarily vendor marketing programs? Which are technically sound? If you are at the point where you are no longer going to pursue certifications, these are questions you should be able to answer.

An even stronger position would be—if you’re at the point where you do not think you need to be certified, where you have a “body of knowledge” that allows people to directly trust your work, then perhaps you should also be at a point where you are helping guide the development of certifications in some way.

Certifications—whether to get them or not, and when to stop caring about them—are rather more nuanced than many in the networking world make out. There are valid reasons for, and valid reasons against—and in general, I think we need to do better a developing and policing certifications in order to build a stronger community.

Learning from Failure at Scale

One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.

To produce the study, the authors took data from Facebook’s ticket logging system over 6 years, from 2011 through 2018. They used language-based systems to classify each event based on severity, kind of remediation, and root cause. Once the events were classified, the researchers plotted and tried to understand the results. For instance, table 2 lists the most common root causes of data center fabric incidents: 17% were maintenance, 13% misconfiguration, 13% hardware, and 12% software defects (bugs).

Given Facebook’s network is completely automated, with a full code review/canary process for validating changes before they are put into production, misconfiguration failures should lower than a manually operated network. That 13% of failures are still accounted for by misconfiguration shows even the best automation program cannot eliminate failures from misconfiguration. This number is also interesting because it implies networks without this degree of automation must have much higher failure rates due to misconfiguration. While the raw number of failures are not given, this seems to provide both an idea of how much improvement automation can create, as well as a sort of “cap” on how much improvement operators can expect by automating.

If misconfiguration causes 13% of all failures, and software defects cause 12%, then 25% of all failures are caused by human error. I don’t know of any other studies of this kind, but 25% sounds about right based on years of experience. Whether this 25% is spread across failures in vendor code and operator configuration, or across operator created code and operator configuration, the percentage of failure seems to remain about the same. It is not likely you can eliminate failures caused by human error, nor are you likely to drive it down more than a couple of percentage points.

Another interesting finding here is larger networks increase the time humans take to resolve incidents. As the size of the network scales up, the MTTR scales up with it. This is intuitive—larger networks tend to have more complex configurations, leading to more time spent trying to chase down and understand a problem. One thing the paper does not discuss, but might be interesting, is how modularization impacts these numbers. Intuitively, containing failures within a module (whether horizontally along topological lines or vertically through virtualization) should decrease the scope in which a network engineer needs to search to find a problem and resolve it. This is, on the other hand, likely to be offset somewhat by the increased complexity and reduction in visibility caused by segmentation—so it’s hard to determine what the overall effect of deeper segmentation in a network might be.

Overall, this is an interesting paper to parse through and understand—there are lots of great insights here for network operators at any scale.

Whither Cyber-Insurance?

Note: I’m off in the weeds a little this week thinking about cyber-insurance because of a paper that landed in one of my various feeds—while this isn’t something we often think about as network operators, it does impact the overall security of the systems we build.

When you go to the doctor for a yearly checkup, do you think about health or insurance? You probably think about health, but the practice of going to the doctor for regular checkups began because of large life insurance companies in the United States. These companies began using statistical methods to make risk, or to build actuarial tables they could use to set the premiums properly. Originally, life insurance companies relied on the “hunches” of their salesmen, combined with some checking by people in the “back office,” to determine the correct premium. Over time, they developed networks of informers in local communities, such as doctors, lawyers, and even local politicians, who could describe the life of anyone in their area, providing the information the company needed to set premiums correctly.

Over time, however, statistical methods came into play, particularly relying on an initial visit with a doctor. The information these insurance companies gathered, however, gave them insight into what habits increased or decreased longevity—they decided they should use this information to help shape people’s lives so they would live longer, rather than just using it to discover the correct premiums. To gather more information, and to help people live better lives, life insurance companies started encouraging yearly doctor visits, even setting up non-profit organizations to support the doctors who gave these examinations. Thus was born the yearly doctor’s visit, the credit rating agencies, and a host of other things we take for granted in modern life.

You can read about the early history of life insurance and its impact on society in How Our Days Became Numbered.

What does any of this have to do with networks? Only this—we are in much the same position in the cyber-insurance market right now as the life insurance market in the late 1800s through the mid-1900s—insurance agents interview a company and make a “hunch bet” on how much to charge the company for cyber-insurance. Will cyber-insurance ever mature to the same point as life insurance? According to a recent research paper, the answer is “probably not.”  Why not?

First, legal restrictions will not allow a solution such as the one imposed by payment processors. Second, there does not seem to be a lot of leverage in cyber-insurance premiums. The cost of increasing security is generally much higher than any possible premium discount, making it cheaper for companies just to pay the additional premium than to improve their security posture. Third, there is no real evidence tying the use of specific products to reductions in security breaches. Instead, network and data security tend to be tied to practices rather than products, making it harder for an insurer to precisely specify what a company can and should to improve their posture.

Finally, the largest problem is measurement. What does it look like for a company to “go to the doctor” regularly? Does this mean regular penetration tests? Standardizing penetration tests is difficult, and it can be far too easy to counter pentests without improving the overall security posture. Like medical care in the “early days,” there is no way to know you have gathered enough information on the population to know if you correctly understand the kinds of things that improve “health”—but there is no way to compel reporting (much less accurate reporting), nor is there any way to compel insurance companies to share the information they have about cyber incidents.

Will cyber-insurance exist as a “separate thing” in the future? The authors largely answer in the negative. The pressures of “race to the bottom,” providing maximal coverage with minimal costs (which they attribute to the structure of the cyber-insurance market), combined with lack of regulatory clarity and inaccurate measurements, will probably end up causing cyber-insurance to “fold into” other kinds of insurance.

Whether this is a positive or negative result is a matter of conjecture—the legacy of yearly doctor’s visits and public health campaigns is not universally “good,” after all.