Random Thoughts on Grey Failures and Scale

I have used the example of increasing paths to the point where the control plane converges more slowly, impacting convergence, hence increasing the Mean Time to Repair, to show that too much redundancy can actually reduce overall network availability. Many engineers I’ve talked to balk at this idea, because it seems hard to believe that adding another link could, in fact, impact routing protocol convergence in such a way. I ran across a paper a while back that provides a different kind of example about the trade-off around redundancy in a network, but I never got around to actually reading the entire paper and trying to figure out how it fits in.


In Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, the authors argue that one of the main problems with building a cloud system is with grey failures—when a router fails only some of the time, or drops (or delays) only some small percentage of the traffic. The example given is—

  • A single service must collect information from many other services on the network to complete a particular operation
  • Each of these information collection operations represent a single transaction carried across the network
  • The more transactions there are, the more likely it is that every path in the network is going to be used
  • If every path in the network is used, every grey failure is going to have an impact on the performance of the application
  • The more devices there are physically, the more likely at least one device is going to exhibit a grey failure

In the paper, the fan out is assumed to be the number of transactions over the number of routers (packet forwarding devices). The lower the number of routers, the more likely it is that every one of them will be used to handle a large number of flows, but the more routers there are, the more likely it is that one of them will experience some kind of grey failure. There is some point at which there are enough flows over enough devices that the application will always be impacted by at least one grey failure, harming its performance. When this point is reached, the system is going to start exhibiting performance loss that is going to be very hard to understand, much less troubleshoot and repair.

Maybe an example will be helpful here. Say a particular model of router has a 10% chance of hitting a bug where the packet processing pipeline fails, and it takes some number of milliseconds to recover. Now, look at the following numbers—

  1. 1 flow over 2 routers; the application has a 50% chance of using one path or the other
  2. 10 flows over 2 routers; the application has close to a 100% chance of using every path
  3. 10 flows over 100 routers; the application has a 10% chance of using any given path path
  4. 10,000 flows over 1,000 routers; the application has close to a 100% chance of using every path

If you treat the number of flows as the numerator, and the number of paths as the denominator of a simple fraction, so long as the number of flows “swamps” the number of paths, the chances of every path in the network being used is very high. Now consider what happens if 1% of the routers produced and shipped will have the grey failure. Some very (probably not perfect) back of an envelope numbers, corresponding to the four bullets above—

  1. There is a .5% chance of the application being impacted by the grey failure
  2. There is a 2% chance of the application being impacted by the grey failure
  3. There is a 1% chance of the application being impacted by the grey failure
  4. There is a close to 100% chance of the application being impacted by the grey failure

This last result is somewhat surprising. At some point, this system—the application and network combined—will cross a threshold where the grey failure will always impact the operation of the application. But when you build a network of 1000 devices, and introduce massive equal cost multipath in the network, the problem becomes almost impossible to trace down and fix. Which of the 1000 devices has the grey failure? If the failure is transient, this is the proverbial broken connector in the box that has been sitting in the corner of the lab for the last 20 years problem.

How can this problem be solved? The paper in hand suggests that the only real solutions are more and more accurate measurements. The problem with this solution is, of course, that measurements generate data, and data must be… carried over the network, which makes the measurement data itself subject to the grey failures. What has been discovered so far, though, is this: some combination of careful measurement, combined with carefully testing across a wide variety of workload profiles, and mixed together in some data analytics, can find these sorts of problems.

Returning to the original point of this post—is this ultimately another instance of larger amounts of state causing the network to converge more slowly? No. But it still seems to be another case of more redundancy, both in the network and the way the application is written, opening holes where lower availability becomes a reality.