OPERATIONS

Grey Failure Lessons Learned

Grey Failures in the Real World

Most “smaller scale” operators probably believe they are not impacted by grey failures, but this is probably not true. Given the law of large numbers, there must be some number of grey failures in some percentage of smaller networks simply because there are so many of them. What is interesting about grey failures is there is so little study in this area; since these errors can exist in a network for years without being discovered, they are difficult to track down and repair, and they are often “fixed” by someone randomly doing things in surrounding systems that end up performing an “unintentional repair” (for instance by resetting some software state through a reboot). It is interesting, then, to see a group of operators collating the grey failures they have seen across a number of larger scale networks.

Gunawi, Haryadi S., Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, et al. “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” 1–14, 2018. https://www.usenix.org/conference/fast18/presentation/gunawi.

Some interesting results of the compilation are covered in a table early in the document. One of these is that grey failures can convert from one form to another, or rather a single grey failure can express itself in many different ways. This is one of the reasons these kinds of failures can be difficult to trace and repair. For instance, a single link that drops 5% of the traffic will impact different applications at different times, depending on variations in flow startup and ECMP hashing. Another interesting effect of grey failures is a single failure can cascade across multiple systems. The example given in the document is a fan that fails in a way to increase vibration while running less efficiently. The hardware management software may well increase the run speed of the fan higher in order to compensate, increasing the fan’s vibration. This vibration, in turn, causes a nearby hard drove to fail more quickly. The hard drive may, in fact, end up being replaced on a regular basis without anyone ever thinking to check nearby fans to see if they are causing this particular hard drive slot to fail hardware more frequently.

The authors make a number of suggestions for finding and resolving these long-tail errors in a large-scale system. They argue vendors should unmask errors if they occur frequently enough. Further, they argue the nature of grey failures require operators to troubleshoot and repair these failures in the operating system. Operators, then, need to build systems with monitoring that can be refined when needed to chase down grey failures in the operational environment. This also means operators need to spend time troubleshooting in the production environment before jumping to a lab, or assuming that a problem that cannot be reproduced is not really a problem at all. A third suggestion made here is to broaden fuzz testing to include grey failures; intentionally injecting failures is a tried and true method for understanding how a system works, so this is solid advice in general.

What is not mentioned in the document is that many of these failures are a result of increasing system complexity. The example of the fan and hard drive, for instance, is really an instance of a hidden interaction surface; it is simply a result of placing multiple complex systems close to one another without considering how they might interact in unexpected ways. There is another important lesson here in learning how to look for and see unexpected interaction surfaces, and understanding how these surfaces can impact system operation.

Complexity, ultimately, is not only the enemy of security, but also the enemy of consistent system operation and mean time to repair.

Reduce, reuse, and consider complexity in system design.

The Revenge of the Ancillaries

Have you ever tried to make water flow in a specific direction? Maybe you have some particularly muddy spot in your yard, so you dig a small ditch and think, “the water will now flow from here to there, and the muddy spot won’t be so muddy the next time it rains.” Then it rains, and the water goes a completely different direction, or overflows the little channel you’ve dug, making things worse. The most effective way to channel water, of course, is to put it in pipes—but this doesn’t always seem to work, either.

The next time you think about shadow IT in your organization, think of these pipes, and how the entire system of IT must look to a user in your organization. For instance, I have had corporate laptops where you must enter two or three passwords to boot the laptop, provided by departments that require you to use your corporate laptop for everything, and with security rules forbidding the use of any personal software on the corporate laptop. I have even had company issued laptops on which you could not modify the position of icons on the desktop, change the menu items in any piece of software, or modify the software in any way. Why? Because… information security … making the job of the help desk easier (so they can close cases faster) … getting you to focus on your job, instead of social media …

Either one of two things is going to happen in this kind of situation: people are going to find a way around the rules, or they are going to minimize the amount of time they spend working. The pipe is either going to drain or burst.

This is what Sumit Rama has called the revenge of the ancillaries:

In building a given function—say, an order form for a brain MRI—the design choices were more political than technical: administrative staff and doctors had different views about what should be included. The doctors were used to having all the votes. But Epic had arranged meetings to try to adjudicate these differences. Now the staff had a say (and sometimes the doctors didn’t even show), and they added questions that made their jobs easier but other jobs more time-consuming. Questions that doctors had routinely skipped now stopped them short, with “field required” alerts.

This is a form of the tragedy of the commons. It seems fine for you to put requirements on someone else that makes your life easier; it only takes a few more seconds for them and the requirement seems to be quite reasonable. But if no-one is looking at the complete system, the system itself becomes too complex to use, and people start saying things like, “I’d really like to do this for you, but the system won’t let me.” Every heard that one?

Now let’s apply this to networking. Suppose you have some process for connecting servers to the network. This process involves going to house security, who imposes a long checklist on the connection, then to budgeting, who wants to know precisely what the server will be used for, why, and how long, and then to someone in O/S compliance, who wants to know what operating system will be used, and why, and then to DevOps, who wants to ensure the deployment of these servers are properly automated, and…

No single requirement is a big deal. None of them really take a lot of time. But combined, the process is so difficult that the user finally just pulls out a credit card and expenses a virtual machine on some public cloud service. Then you end up with production stuff running in a public cloud service with no controls at all.

Underlying some of this is the problem of complexity. If you have ten different monitoring platforms, pushing new hardware and software into place becomes a gauntlet no-one wants to run. If, on the other hand, you have one centralized data store, coupled with a myriad of tools to push information into, and retrieve information from, that one data store, you can allow system developers to choose whatever method works best to push and pull information. Marshaling the data becomes the largest issue, and the APIs into and out of the data store becomes the biggest decision to make—rather than selecting the suite of applications used to run telemetry.

Having an internal cloud model, with clear rules about when a virtual server will be deactivated and archived in some way, perhaps with manual process review on objection, might be a good idea. One of the nice things about virtualization is it allows many of the security, usage, and other rules to just be implemented without any sort of process. If you want people to build applications that use IP as their primary point of contact, rather than Ethernet addresses, make IP addresses easy to get, and layer connections harder. Channeling works; containment does not.

Let me repeat this one more time for emphasis: you can channel users, but you cannot contain them.

Rules need to be truly reasonable, with an eye to the system as a whole, rather than focusing on individual snippets. Documentation must be easy to find, and a clear process for working around any rules well explained. Rules need to be examined from time to time to see what percentage of the population is simply ignoring them, or working around them, why, and how things might be changed to be better.

Ultimately, people cannot be contained in a pipe. Not that you really want to—people in pipes don’t produce or create. It’s not a good place to be.

Reaction: The Importance of Open APIs

Over at CIMI, Tom Nolle Considers whether the open API is a revolution, or a cynical trap. The line of argument primarily relates to accessing functions in a Virtual Network Function (VNF), which is then related to Network Function Virtualization (NFV). The broader point is made in this line:

One important truth about an API is that it effectively imposes a structure on the software on either side of it. If you design APIs to join two functional blocks in a diagram of an application, it’s likely that the API will impose those blocks on the designer.

This is true—if you design the API first, it will necessarily impose information flow between the different system components, and even determine, at least to some degree, the structure of the software modules on either side of the API. For instance, if you decide to deploy a single network appliance vendor, then your flow of building packet filters will be similar across all devices. However, if you add a second vendor into the mix, you might find the way packet filters are described and deployed are completely different, requiring a per-device module that moves from intent to implementation.

While this problem will always exist, there is another useful way of looking at the problem. Rather than seeing the API as a set of calls and a set of data structures, you can break things up into the grammar and the dictionary. The grammar is the context in which words are placed so they relate to other words, and the dictionary is the meaning of the words. We often think of an API as being almost purely dictionary; if I push this data structure to the device, then something happens; in grammatical terms, the verb is implied in the subject and object pair you feed to the device.

Breaking things up in this way allows you to see the problem in a different way. There is no particular reason the dictionary or the grammar must be implied. Rather than being built into the API, they can be carried through the API. An example would be really helpful here.

In the old/original OSPF, all fields were fixed length. There was no information about what any particular field was, because the information being carried was implied by the location of the data in the packet. In this case, the grammar determined the dictionary, both of which had to be coded into the OSPF implementation. The grammar and dictionary are essentially carried from one implementation to another through the OSPF standards or specifications. IS-IS, on the other hand, carries all its information in TLVs, which means at least some information required to interpret the data is carried alongside the data itself. This additional information is called metadata.

There are a couple of tradeoffs here (if you haven’t found the tradeoffs, you haven’t looked hard enough). First, the software on both ends of the connection that read and interpret the information are going to be much more complex. Second, the amount of data carried on the wire increases, as you are not only carrying the data, you are also carrying the metadata. The upside is that IS-IS is easier to extend; implementations can ignore TLVs they don’t understand, new TLVs with new metadata can be added, etc.

If you want to understand this topic more deeply, this is the kind of thing Computer Networking Problems and Solutions discusses in detail.

In the world of network configuration and management, an example of this kind of system is YANG, which intentionally carries metadata alongside the data itself. In this way, the dictionary is, in a sense, carried with the data. There will always be different words for different objects, of course, but translators can be built to allow one device to talk to another.

There is still the problem of flow, or grammar, which can make it difficult to configure two devices from one set of primitives. “I want to filter packets from this application” can still be expressed by a different process on two different vendor implementations. However, the ability to translate the dictionary part of the problem between devices is a major step forward in solving the problem of building software that will work across multiple devices.

This is why YANG, JSON, and their associated ecosystems really matter.