Grey Failure Lessons Learned

Grey Failures in the Real World

Most “smaller scale” operators probably believe they are not impacted by grey failures, but this is probably not true. Given the law of large numbers, there must be some number of grey failures in some percentage of smaller networks simply because there are so many of them. What is interesting about grey failures is there is so little study in this area; since these errors can exist in a network for years without being discovered, they are difficult to track down and repair, and they are often “fixed” by someone randomly doing things in surrounding systems that end up performing an “unintentional repair” (for instance by resetting some software state through a reboot). It is interesting, then, to see a group of operators collating the grey failures they have seen across a number of larger scale networks.

Gunawi, Haryadi S., Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, et al. “Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems,” 1–14, 2018.

Some interesting results of the compilation are covered in a table early in the document. One of these is that grey failures can convert from one form to another, or rather a single grey failure can express itself in many different ways. This is one of the reasons these kinds of failures can be difficult to trace and repair. For instance, a single link that drops 5% of the traffic will impact different applications at different times, depending on variations in flow startup and ECMP hashing. Another interesting effect of grey failures is a single failure can cascade across multiple systems. The example given in the document is a fan that fails in a way to increase vibration while running less efficiently. The hardware management software may well increase the run speed of the fan higher in order to compensate, increasing the fan’s vibration. This vibration, in turn, causes a nearby hard drove to fail more quickly. The hard drive may, in fact, end up being replaced on a regular basis without anyone ever thinking to check nearby fans to see if they are causing this particular hard drive slot to fail hardware more frequently.

The authors make a number of suggestions for finding and resolving these long-tail errors in a large-scale system. They argue vendors should unmask errors if they occur frequently enough. Further, they argue the nature of grey failures require operators to troubleshoot and repair these failures in the operating system. Operators, then, need to build systems with monitoring that can be refined when needed to chase down grey failures in the operational environment. This also means operators need to spend time troubleshooting in the production environment before jumping to a lab, or assuming that a problem that cannot be reproduced is not really a problem at all. A third suggestion made here is to broaden fuzz testing to include grey failures; intentionally injecting failures is a tried and true method for understanding how a system works, so this is solid advice in general.

What is not mentioned in the document is that many of these failures are a result of increasing system complexity. The example of the fan and hard drive, for instance, is really an instance of a hidden interaction surface; it is simply a result of placing multiple complex systems close to one another without considering how they might interact in unexpected ways. There is another important lesson here in learning how to look for and see unexpected interaction surfaces, and understanding how these surfaces can impact system operation.

Complexity, ultimately, is not only the enemy of security, but also the enemy of consistent system operation and mean time to repair.

Reduce, reuse, and consider complexity in system design.