Research: Lessons from Evolve or Die

Google runs what is probably one of the largest networks in the world. Because of this, network engineers often have two sorts of reactions to anything Google publishes, or does. The first is “my network is not that big, nor that complicated, so I don’t really care what Google is doing.” This is the “you are not a hyperscaler” (YANAH) reaction. The second, and probably more common, reaction is: whatever Google is doing must be good, so I should do the same thing. A healthier reaction to both of these is to examine these papers, and the work done by other hyperscalers, to find the common techniques they are applying to large scale networks, and then see where they might be turned into, or support, common network design principles. This is the task before us today in looking at a paper published in 2016 by Google called Evolve or Die: High Availablility Design Principles Drawn from Google’s Network Infrastructure.

The first part of this paper discusses the basic Google architecture, including a rough layout of the kinds of modules they deploy, the module generations, and the interconnectivity between those modules. This is useful background information for understanding the remainder of the paper, but it need not detain us in this post. Section 3.2, Challenges, is the first section of immediate interest to the “average” network engineer. This section outlines three specific challenges engineers at Google have found in building a highly available network at scale.

The first of these is scale and heterogenity. Shared fate is not only a problem when multiple virtual links cross a single piece of physical infrastructure; shared fate is also a problem when consider using a single implementation of a protocol or operating system across an entire network. A single kind of event can trigger a defect in every instance of an implementation, causing a network wide failure. To counter this, most companies will intentionally purchase equipment from multiple vendors. An interesting aside mentioned in this paper is that the stretching out of hardware and software over life cycles also provides a form of heterogenity, and hence a counter to the shared fate of a monoculture.

While heterogenity is a good thing from a shared fate perspective (because it prevents a shared response across all devices to a single problem or state), good network designers always ask “what is the tradeoff.” The tradeoff is the second challenge in the document, device management complexity. The more kinds of devices you deplou in order to provide heterogenity, the more ways in which you must manage devices. One counter to this is to separate the control plane from the forwarding plane, deploying some form of Software Defined Network (SDN). As the paper, notes, however, management platforms simply have not kept up with the changes needed to support centralized control planes.

A first point to learn and apply, then, is this: intentionally using multiple vendors, and multiple generations of equipment, is a good thing in terms of preventing a single event from imapcting every device in the network. There is a tradeoff in network management, however, that must be considered. Planning for multiple generations of devices and designs, as well as multiple implementations, is something that should be done early in the design process. One way to do this is to focus on technologies first, then implementations, rather than following the normal design process of purchasing a device, and then figuring out what technologies to deploy on that device to solve a specific problem.

The next interesting section of this paper is 6, Root Cause Categories. The chart is hard to read, but it seems to show a few hardware and software failures mixed with a lot of misconfiguration and other human failures. These numbers tend to show the real world impact of creating a heterogenius network. Complexity ramps up, and human mistakes follow in the wake of this complexity.

Section 7, High Availability Principles, is the final section of the paper, and holds some very interesting lessons drawn from the research. The first principle listed is use defense in depth. As defined in this paper, the idea of defense in depth is really breaking apart failure domains through modularization and information hiding.

A second point made in the paper is develop fallback strategies. This is another side of failure domains that we often don’t consider in network design–understanding how the network will react to failure, and planning for those reactions. Network engineers plan for the “best case” far too often, fail to consider how traffic and load will shift in the case of a failure, or hence misunderstand the complexities of capacity planning. Most of the time, we seem to assume a failure means traffic will simply not flow; the reality is that IP itself is designed to push the traffic through on any available path. Being able to “see” the converged state of the network, and how convergence will happen, is an important skill to do this sort of planning. To get there, you have to understand how the protocols really work.

Another principle mentioned in the paper is update network network elements consistently. Another problem I often see in network deployments is the idea that we should leave things alone if they are working. This generally means only upgrading boxes that need to be upgraded, and not ripping out old things when we have the chance to do so. Software revisions should be kept as consistent as possible throughout the entire network. Conistency across software revisions does not mean giving up heterogenity, however; while software should be consistent, systems should be heterogenius.

The Fail Open section goes beyond what might be considered “standard” design and operational practices into more interesting ideas. For instance, require positive and negative means to require both a list of devices that will be impacted by a change, and a list of devices that will not be impacted by a change. This might be a lot of work, but in some cases it might help catch inconsistencies in the planning stages of a change. Preserve the data plane means to keep forwarding information even though the control plane has failed while traffic is being drained off of a link. This is similar in principle to graceful restart, only taken to the next level. Of course, doing this would assume failure modes have been considered, and cascading failures will not result.

This is an interesting paper for network engineers looking for the practical application of mostly common design principles in large scale networks. It also shows the problems at scale are often not all that different than at any other scale, only the level of detail and planning is that much more important.

Off the Cuff: Microsoft Purchases Github

Last week, Eyvonne, Donald, Alistair, and I sat and talked about the recent purchase of Github by Microsoft. Will this be the end of git as a widely used open source repository, or will we all look back in five years and think “move along, nothing to see here?”

Weekend Reads 061518: A 51% attack materializes

In recent days the nightmare scenario for any cryptocurrency is playing out for Bitcoin Gold, as an attacker has taken control of its blockchain and proceeded to defraud cryptocurrency exchanges. All the Bitcoin Gold in circulation is valued at $786 million, according to data provider Coinmarketcap. Blockchains are designed to be decentralized but when an individual or group acting in concert controls the majority of a blockchain’s processing power, they can tamper with transactions and pave the way for fraud. This is known as a 51% attack.—Joon Ian Wong @Quartz

We have also discovered a new stage 3 module that injects malicious content into web traffic as it passes through a network device. At the time of our initial posting, we did not have all of the information regarding the suspected stage 3 modules. The new module allows the actor to deliver exploits to endpoints via a man-in-the-middle capability (e.g. they can intercept network traffic and inject malicious code into it without the user’s knowledge). With this new finding, we can confirm that the threat goes beyond what the actor could do on the network device itself, and extends the threat into the networks that a compromised network device supports. @Cisco TALOS

It is probably a myth that Bill Gates said “640 KB ought to be enough,” and whether or not he said it the truth is that it has never been enough. Ever since we first started building computers, no amount of memory has ever has been enough – and it never will be. Data is a gas, and it rapidly expands to fill any and all available space and then continues to apply direct and significant pressure to the walls of the container. —James Cuff @The Next Platform

Guardicore Labs team has uncovered a traffic manipulation and cryptocurrency mining campaign infecting a wide number of organizations in industries such as finance, education and government. This campaign, dubbed Operation Prowli, spreads malware and malicious code to servers and websites and has compromised more than 40,000 machines in multiple areas of the world. Prowli uses various attack techniques including exploits, password brute-forcing and weak configurations. @Gaurdicore

A bipartisan trio of lawmakers introduced an amendment to the National Defense Authorization Act pushing back on national security threats posed by Chinese telecom giants Huawei and ZTE. Sen. Tom Cotton (R., Ark.), Senate Minority Leader Chuck Schumer (D., N.Y.), and Sen. Chris Van Hollen (D., Md.) sponsored an amendment that would prohibit U.S. government agencies from purchasing or leasing telecommunications equipment or services from Huawei, ZTE or any of its affiliates or subsidiaries, according to a release. —David Rutz @Free Beacon

In light of the limited DPA or jurisprudential guidance concerning the legitimacy of providing any non-public WHOIS data to any class of third party, third parties are dependent on ad hoc determinations as to whether their legitimate interests are outweighed by privacy rights in any given case. While certain contracted parties appear to be providing limited guidance as to what information they require in order to respond favorably to a data access request (of course with no guarantee of success), the vast majority have not provided any such guidance, and all decisions are made on a case-by-case basis with no transparent or predictable criteria. —Brian Winterfeldt @CircleID

On the ‘net: Clarifying Disaggregation

Software Defined Networks (SDNs), Network Function Virtualization (NFV), white box, and large-scale fabrics built out of small fixed configuration devices are all trends that seem to be “top of mind” right now in the network engineering world. While these four trends appear to be completely different ideas, they are closely related through a single concept: disaggregation. This article will consider the meaning of disaggregation, and then how this single concept applies to the four previously noted movements. @SearchSDN

Short Take: Practical Career Advice

One problem I’ve heard in the past is that much of the career advice given in the networking world is not practical. In this short take, I take this problem on, explaining why it might be more practical than it initially seems.


May 2018