Research: Lessons from Evolve or Die

Google runs what is probably one of the largest networks in the world. Because of this, network engineers often have two sorts of reactions to anything Google publishes, or does. The first is “my network is not that big, nor that complicated, so I don’t really care what Google is doing.” This is the “you are not a hyperscaler” (YANAH) reaction. The second, and probably more common, reaction is: whatever Google is doing must be good, so I should do the same thing. A healthier reaction to both of these is to examine these papers, and the work done by other hyperscalers, to find the common techniques they are applying to large scale networks, and then see where they might be turned into, or support, common network design principles. This is the task before us today in looking at a paper published in 2016 by Google called Evolve or Die: High Availability Design Principles Drawn from Google’s Network Infrastructure.

The first part of this paper discusses the basic Google architecture, including a rough layout of the kinds of modules they deploy, the module generations, and the interconnectivity between those modules. This is useful background information for understanding the remainder of the paper, but it need not detain us in this post. Section 3.2, Challenges, is the first section of immediate interest to the “average” network engineer. This section outlines three specific challenges engineers at Google have found in building a highly available network at scale.

The first of these is scale and heterogeneity. Shared fate is not only a problem when multiple virtual links cross a single piece of physical infrastructure; shared fate is also a problem when consider using a single implementation of a protocol or operating system across an entire network. A single kind of event can trigger a defect in every instance of an implementation, causing a network wide failure. To counter this, most companies will intentionally purchase equipment from multiple vendors. An interesting aside mentioned in this paper is that the stretching out of hardware and software over life cycles also provides a form of heterogeneity, and hence a counter to the shared fate of a monoculture.

While heterogeneity is a good thing from a shared fate perspective (because it prevents a shared response across all devices to a single problem or state), good network designers always ask “what is the tradeoff.” The tradeoff is the second challenge in the document, device management complexity. The more kinds of devices you deploy in order to provide heterogeneity, the more ways in which you must manage devices. One counter to this is to separate the control plane from the forwarding plane, deploying some form of Software Defined Network (SDN). As the paper, notes, however, management platforms simply have not kept up with the changes needed to support centralized control planes.

A first point to learn and apply, then, is this: intentionally using multiple vendors, and multiple generations of equipment, is a good thing in terms of preventing a single event from impacting every device in the network. There is a tradeoff in network management, however, that must be considered. Planning for multiple generations of devices and designs, as well as multiple implementations, is something that should be done early in the design process. One way to do this is to focus on technologies first, then implementations, rather than following the normal design process of purchasing a device, and then figuring out what technologies to deploy on that device to solve a specific problem.

The next interesting section of this paper is 6, Root Cause Categories. The chart is hard to read, but it seems to show a few hardware and software failures mixed with a lot of misconfiguration and other human failures. These numbers tend to show the real world impact of creating a heterogeneous network. Complexity ramps up, and human mistakes follow in the wake of this complexity.

Section 7, High Availability Principles, is the final section of the paper, and holds some very interesting lessons drawn from the research. The first principle listed is use defense in depth. As defined in this paper, the idea of defense in depth is really breaking apart failure domains through modularization and information hiding.

A second point made in the paper is develop fallback strategies. This is another side of failure domains that we often don’t consider in network design–understanding how the network will react to failure, and planning for those reactions. Network engineers plan for the “best case” far too often, fail to consider how traffic and load will shift in the case of a failure, or hence misunderstand the complexities of capacity planning. Most of the time, we seem to assume a failure means traffic will simply not flow; the reality is that IP itself is designed to push the traffic through on any available path. Being able to “see” the converged state of the network, and how convergence will happen, is an important skill to do this sort of planning. To get there, you have to understand how the protocols really work.

Another principle mentioned in the paper is update network network elements consistently. Another problem I often see in network deployments is the idea that we should leave things alone if they are working. This generally means only upgrading boxes that need to be upgraded, and not ripping out old things when we have the chance to do so. Software revisions should be kept as consistent as possible throughout the entire network. Consistency across software revisions does not mean giving up heterogeneity, however; while software should be consistent, systems should be heterogeneous.

The Fail Open section goes beyond what might be considered “standard” design and operational practices into more interesting ideas. For instance, require positive and negative means to require both a list of devices that will be impacted by a change, and a list of devices that will not be impacted by a change. This might be a lot of work, but in some cases it might help catch inconsistencies in the planning stages of a change. Preserve the data plane means to keep forwarding information even though the control plane has failed while traffic is being drained off of a link. This is similar in principle to graceful restart, only taken to the next level. Of course, doing this would assume failure modes have been considered, and cascading failures will not result.

This is an interesting paper for network engineers looking for the practical application of mostly common design principles in large scale networks. It also shows the problems at scale are often not all that different than at any other scale, only the level of detail and planning is that much more important.