Privacy for Providers

While this talk is titled privacy for providers, it really applies to just about every network operator. This is meant to open a conversation on the topic, rather than providing definitive answers. I start by looking at some of the kinds of information network operators work with, and whether this information can or should be considered “private.” In the second part of the talk, I work through some of the various ways network operators might want to consider when handling private information.

Hedge 101: In Situ OAM

Understanding the flow of a packet is difficult in modern networks, particularly data center fabrics with their wide fanout and high ECMP counts. At the same time, solving this problem is becoming increasingly important as quality of experience becomes the dominant measure of the network. A number of vendor-specific solutions are being developed to solve this problem. In this episode of the Hedge, Frank Brockners and Shwetha Bhandari join Alvaro Retana and Russ White to discuss the in-situ OAM work currently in progress in the IPPM WG of the IETF.


The Hedge 44: Pete Lumbis and Open Source

Open source software is everywhere, it seems—and yet it’s nowhere at the same time. Everyone is talking about it, but how many people and organizations are actually using it? Pete Lumbis at NVIDIA joins Tom Ammon and Russ White to discuss the many uses and meanings of open source software in the networking world.


Whither Cyber-Insurance?

Note: I’m off in the weeds a little this week thinking about cyber-insurance because of a paper that landed in one of my various feeds—while this isn’t something we often think about as network operators, it does impact the overall security of the systems we build.

When you go to the doctor for a yearly checkup, do you think about health or insurance? You probably think about health, but the practice of going to the doctor for regular checkups began because of large life insurance companies in the United States. These companies began using statistical methods to make risk, or to build actuarial tables they could use to set the premiums properly. Originally, life insurance companies relied on the “hunches” of their salesmen, combined with some checking by people in the “back office,” to determine the correct premium. Over time, they developed networks of informers in local communities, such as doctors, lawyers, and even local politicians, who could describe the life of anyone in their area, providing the information the company needed to set premiums correctly.

Over time, however, statistical methods came into play, particularly relying on an initial visit with a doctor. The information these insurance companies gathered, however, gave them insight into what habits increased or decreased longevity—they decided they should use this information to help shape people’s lives so they would live longer, rather than just using it to discover the correct premiums. To gather more information, and to help people live better lives, life insurance companies started encouraging yearly doctor visits, even setting up non-profit organizations to support the doctors who gave these examinations. Thus was born the yearly doctor’s visit, the credit rating agencies, and a host of other things we take for granted in modern life.

You can read about the early history of life insurance and its impact on society in How Our Days Became Numbered.

What does any of this have to do with networks? Only this—we are in much the same position in the cyber-insurance market right now as the life insurance market in the late 1800s through the mid-1900s—insurance agents interview a company and make a “hunch bet” on how much to charge the company for cyber-insurance. Will cyber-insurance ever mature to the same point as life insurance? According to a recent research paper, the answer is “probably not.”  Why not?

First, legal restrictions will not allow a solution such as the one imposed by payment processors. Second, there does not seem to be a lot of leverage in cyber-insurance premiums. The cost of increasing security is generally much higher than any possible premium discount, making it cheaper for companies just to pay the additional premium than to improve their security posture. Third, there is no real evidence tying the use of specific products to reductions in security breaches. Instead, network and data security tend to be tied to practices rather than products, making it harder for an insurer to precisely specify what a company can and should to improve their posture.

Finally, the largest problem is measurement. What does it look like for a company to “go to the doctor” regularly? Does this mean regular penetration tests? Standardizing penetration tests is difficult, and it can be far too easy to counter pentests without improving the overall security posture. Like medical care in the “early days,” there is no way to know you have gathered enough information on the population to know if you correctly understand the kinds of things that improve “health”—but there is no way to compel reporting (much less accurate reporting), nor is there any way to compel insurance companies to share the information they have about cyber incidents.

Will cyber-insurance exist as a “separate thing” in the future? The authors largely answer in the negative. The pressures of “race to the bottom,” providing maximal coverage with minimal costs (which they attribute to the structure of the cyber-insurance market), combined with lack of regulatory clarity and inaccurate measurements, will probably end up causing cyber-insurance to “fold into” other kinds of insurance.

Whether this is a positive or negative result is a matter of conjecture—the legacy of yearly doctor’s visits and public health campaigns is not universally “good,” after all.

Grey Failure Lessons Learned

Grey Failures in the Real World

Most “smaller scale” operators probably believe they are not impacted by grey failures, but this is probably not true. Given the law of large numbers, there must be some number of grey failures in some percentage of smaller networks simply because there are so many of them. What is interesting about grey failures is there is so little study in this area; since these errors can exist in a network for years without being discovered, they are difficult to track down and repair, and they are often “fixed” by someone randomly doing things in surrounding systems that end up performing an “unintentional repair” (for instance by resetting some software state through a reboot). It is interesting, then, to see a group of operators collating the grey failures they have seen across a number of larger scale networks.

Gunawi, Haryadi S., Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, et al. “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” 1–14, 2018.

Some interesting results of the compilation are covered in a table early in the document. One of these is that grey failures can convert from one form to another, or rather a single grey failure can express itself in many different ways. This is one of the reasons these kinds of failures can be difficult to trace and repair. For instance, a single link that drops 5% of the traffic will impact different applications at different times, depending on variations in flow startup and ECMP hashing. Another interesting effect of grey failures is a single failure can cascade across multiple systems. The example given in the document is a fan that fails in a way to increase vibration while running less efficiently. The hardware management software may well increase the run speed of the fan higher in order to compensate, increasing the fan’s vibration. This vibration, in turn, causes a nearby hard drove to fail more quickly. The hard drive may, in fact, end up being replaced on a regular basis without anyone ever thinking to check nearby fans to see if they are causing this particular hard drive slot to fail hardware more frequently.

The authors make a number of suggestions for finding and resolving these long-tail errors in a large-scale system. They argue vendors should unmask errors if they occur frequently enough. Further, they argue the nature of grey failures require operators to troubleshoot and repair these failures in the operating system. Operators, then, need to build systems with monitoring that can be refined when needed to chase down grey failures in the operational environment. This also means operators need to spend time troubleshooting in the production environment before jumping to a lab, or assuming that a problem that cannot be reproduced is not really a problem at all. A third suggestion made here is to broaden fuzz testing to include grey failures; intentionally injecting failures is a tried and true method for understanding how a system works, so this is solid advice in general.

What is not mentioned in the document is that many of these failures are a result of increasing system complexity. The example of the fan and hard drive, for instance, is really an instance of a hidden interaction surface; it is simply a result of placing multiple complex systems close to one another without considering how they might interact in unexpected ways. There is another important lesson here in learning how to look for and see unexpected interaction surfaces, and understanding how these surfaces can impact system operation.

Complexity, ultimately, is not only the enemy of security, but also the enemy of consistent system operation and mean time to repair.

Reduce, reuse, and consider complexity in system design.