Hedge 101: In Situ OAM

Understanding the flow of a packet is difficult in modern networks, particularly data center fabrics with their wide fanout and high ECMP counts. At the same time, solving this problem is becoming increasingly important as quality of experience becomes the dominant measure of the network. A number of vendor-specific solutions are being developed to solve this problem. In this episode of the Hedge, Frank Brockners and Shwetha Bhandari join Alvaro Retana and Russ White to discuss the in-situ OAM work currently in progress in the IPPM WG of the IETF.


The Hedge 44: Pete Lumbis and Open Source

Open source software is everywhere, it seems—and yet it’s nowhere at the same time. Everyone is talking about it, but how many people and organizations are actually using it? Pete Lumbis at NVIDIA joins Tom Ammon and Russ White to discuss the many uses and meanings of open source software in the networking world.


Whither Cyber-Insurance?

Note: I’m off in the weeds a little this week thinking about cyber-insurance because of a paper that landed in one of my various feeds—while this isn’t something we often think about as network operators, it does impact the overall security of the systems we build.

When you go to the doctor for a yearly checkup, do you think about health or insurance? You probably think about health, but the practice of going to the doctor for regular checkups began because of large life insurance companies in the United States. These companies began using statistical methods to make risk, or to build actuarial tables they could use to set the premiums properly. Originally, life insurance companies relied on the “hunches” of their salesmen, combined with some checking by people in the “back office,” to determine the correct premium. Over time, they developed networks of informers in local communities, such as doctors, lawyers, and even local politicians, who could describe the life of anyone in their area, providing the information the company needed to set premiums correctly.

Over time, however, statistical methods came into play, particularly relying on an initial visit with a doctor. The information these insurance companies gathered, however, gave them insight into what habits increased or decreased longevity—they decided they should use this information to help shape people’s lives so they would live longer, rather than just using it to discover the correct premiums. To gather more information, and to help people live better lives, life insurance companies started encouraging yearly doctor visits, even setting up non-profit organizations to support the doctors who gave these examinations. Thus was born the yearly doctor’s visit, the credit rating agencies, and a host of other things we take for granted in modern life.

You can read about the early history of life insurance and its impact on society in How Our Days Became Numbered.

What does any of this have to do with networks? Only this—we are in much the same position in the cyber-insurance market right now as the life insurance market in the late 1800s through the mid-1900s—insurance agents interview a company and make a “hunch bet” on how much to charge the company for cyber-insurance. Will cyber-insurance ever mature to the same point as life insurance? According to a recent research paper, the answer is “probably not.”  Why not?

First, legal restrictions will not allow a solution such as the one imposed by payment processors. Second, there does not seem to be a lot of leverage in cyber-insurance premiums. The cost of increasing security is generally much higher than any possible premium discount, making it cheaper for companies just to pay the additional premium than to improve their security posture. Third, there is no real evidence tying the use of specific products to reductions in security breaches. Instead, network and data security tend to be tied to practices rather than products, making it harder for an insurer to precisely specify what a company can and should to improve their posture.

Finally, the largest problem is measurement. What does it look like for a company to “go to the doctor” regularly? Does this mean regular penetration tests? Standardizing penetration tests is difficult, and it can be far too easy to counter pentests without improving the overall security posture. Like medical care in the “early days,” there is no way to know you have gathered enough information on the population to know if you correctly understand the kinds of things that improve “health”—but there is no way to compel reporting (much less accurate reporting), nor is there any way to compel insurance companies to share the information they have about cyber incidents.

Will cyber-insurance exist as a “separate thing” in the future? The authors largely answer in the negative. The pressures of “race to the bottom,” providing maximal coverage with minimal costs (which they attribute to the structure of the cyber-insurance market), combined with lack of regulatory clarity and inaccurate measurements, will probably end up causing cyber-insurance to “fold into” other kinds of insurance.

Whether this is a positive or negative result is a matter of conjecture—the legacy of yearly doctor’s visits and public health campaigns is not universally “good,” after all.

Grey Failure Lessons Learned

Grey Failures in the Real World

Most “smaller scale” operators probably believe they are not impacted by grey failures, but this is probably not true. Given the law of large numbers, there must be some number of grey failures in some percentage of smaller networks simply because there are so many of them. What is interesting about grey failures is there is so little study in this area; since these errors can exist in a network for years without being discovered, they are difficult to track down and repair, and they are often “fixed” by someone randomly doing things in surrounding systems that end up performing an “unintentional repair” (for instance by resetting some software state through a reboot). It is interesting, then, to see a group of operators collating the grey failures they have seen across a number of larger scale networks.

Gunawi, Haryadi S., Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, et al. “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” 1–14, 2018. https://www.usenix.org/conference/fast18/presentation/gunawi.

Some interesting results of the compilation are covered in a table early in the document. One of these is that grey failures can convert from one form to another, or rather a single grey failure can express itself in many different ways. This is one of the reasons these kinds of failures can be difficult to trace and repair. For instance, a single link that drops 5% of the traffic will impact different applications at different times, depending on variations in flow startup and ECMP hashing. Another interesting effect of grey failures is a single failure can cascade across multiple systems. The example given in the document is a fan that fails in a way to increase vibration while running less efficiently. The hardware management software may well increase the run speed of the fan higher in order to compensate, increasing the fan’s vibration. This vibration, in turn, causes a nearby hard drove to fail more quickly. The hard drive may, in fact, end up being replaced on a regular basis without anyone ever thinking to check nearby fans to see if they are causing this particular hard drive slot to fail hardware more frequently.

The authors make a number of suggestions for finding and resolving these long-tail errors in a large-scale system. They argue vendors should unmask errors if they occur frequently enough. Further, they argue the nature of grey failures require operators to troubleshoot and repair these failures in the operating system. Operators, then, need to build systems with monitoring that can be refined when needed to chase down grey failures in the operational environment. This also means operators need to spend time troubleshooting in the production environment before jumping to a lab, or assuming that a problem that cannot be reproduced is not really a problem at all. A third suggestion made here is to broaden fuzz testing to include grey failures; intentionally injecting failures is a tried and true method for understanding how a system works, so this is solid advice in general.

What is not mentioned in the document is that many of these failures are a result of increasing system complexity. The example of the fan and hard drive, for instance, is really an instance of a hidden interaction surface; it is simply a result of placing multiple complex systems close to one another without considering how they might interact in unexpected ways. There is another important lesson here in learning how to look for and see unexpected interaction surfaces, and understanding how these surfaces can impact system operation.

Complexity, ultimately, is not only the enemy of security, but also the enemy of consistent system operation and mean time to repair.

Reduce, reuse, and consider complexity in system design.

The Revenge of the Ancillaries

Have you ever tried to make water flow in a specific direction? Maybe you have some particularly muddy spot in your yard, so you dig a small ditch and think, “the water will now flow from here to there, and the muddy spot won’t be so muddy the next time it rains.” Then it rains, and the water goes a completely different direction, or overflows the little channel you’ve dug, making things worse. The most effective way to channel water, of course, is to put it in pipes—but this doesn’t always seem to work, either.

The next time you think about shadow IT in your organization, think of these pipes, and how the entire system of IT must look to a user in your organization. For instance, I have had corporate laptops where you must enter two or three passwords to boot the laptop, provided by departments that require you to use your corporate laptop for everything, and with security rules forbidding the use of any personal software on the corporate laptop. I have even had company issued laptops on which you could not modify the position of icons on the desktop, change the menu items in any piece of software, or modify the software in any way. Why? Because… information security … making the job of the help desk easier (so they can close cases faster) … getting you to focus on your job, instead of social media …

Either one of two things is going to happen in this kind of situation: people are going to find a way around the rules, or they are going to minimize the amount of time they spend working. The pipe is either going to drain or burst.

This is what Sumit Rama has called the revenge of the ancillaries:

In building a given function—say, an order form for a brain MRI—the design choices were more political than technical: administrative staff and doctors had different views about what should be included. The doctors were used to having all the votes. But Epic had arranged meetings to try to adjudicate these differences. Now the staff had a say (and sometimes the doctors didn’t even show), and they added questions that made their jobs easier but other jobs more time-consuming. Questions that doctors had routinely skipped now stopped them short, with “field required” alerts.

This is a form of the tragedy of the commons. It seems fine for you to put requirements on someone else that makes your life easier; it only takes a few more seconds for them and the requirement seems to be quite reasonable. But if no-one is looking at the complete system, the system itself becomes too complex to use, and people start saying things like, “I’d really like to do this for you, but the system won’t let me.” Every heard that one?

Now let’s apply this to networking. Suppose you have some process for connecting servers to the network. This process involves going to house security, who imposes a long checklist on the connection, then to budgeting, who wants to know precisely what the server will be used for, why, and how long, and then to someone in O/S compliance, who wants to know what operating system will be used, and why, and then to DevOps, who wants to ensure the deployment of these servers are properly automated, and…

No single requirement is a big deal. None of them really take a lot of time. But combined, the process is so difficult that the user finally just pulls out a credit card and expenses a virtual machine on some public cloud service. Then you end up with production stuff running in a public cloud service with no controls at all.

Underlying some of this is the problem of complexity. If you have ten different monitoring platforms, pushing new hardware and software into place becomes a gauntlet no-one wants to run. If, on the other hand, you have one centralized data store, coupled with a myriad of tools to push information into, and retrieve information from, that one data store, you can allow system developers to choose whatever method works best to push and pull information. Marshaling the data becomes the largest issue, and the APIs into and out of the data store becomes the biggest decision to make—rather than selecting the suite of applications used to run telemetry.

Having an internal cloud model, with clear rules about when a virtual server will be deactivated and archived in some way, perhaps with manual process review on objection, might be a good idea. One of the nice things about virtualization is it allows many of the security, usage, and other rules to just be implemented without any sort of process. If you want people to build applications that use IP as their primary point of contact, rather than Ethernet addresses, make IP addresses easy to get, and layer connections harder. Channeling works; containment does not.

Let me repeat this one more time for emphasis: you can channel users, but you cannot contain them.

Rules need to be truly reasonable, with an eye to the system as a whole, rather than focusing on individual snippets. Documentation must be easy to find, and a clear process for working around any rules well explained. Rules need to be examined from time to time to see what percentage of the population is simply ignoring them, or working around them, why, and how things might be changed to be better.

Ultimately, people cannot be contained in a pipe. Not that you really want to—people in pipes don’t produce or create. It’s not a good place to be.