Nines are not enough

How many 9’s is your network? How about your service provider’s? Now, to ask the not-so-obvious question—why do you care? Does the number of 9’s actually describe the reliability of the network? According to Jeffery Mogul and John Wilkes, nines are not enough. The question is—while this paper was written for commercial relationships and cloud providers, is it something you can apply to running your own network? Let’s dive into the meat of the paper and find out.

While 5 9’s is normally given as a form of Service Level Agreement (SLA), there are two other measures of reliability a network operator needs to consider—the Service Level Objective (SLO), and the Service Level Indicator (SLI). The SLO defines a set of expectations about the level of service; internal SLO’s define “trigger points” where actions should be taken to prevent an external SLO from failing. For instance, if the external SLO says no more than 2% of the traffic will be dropped on this link, the internal SLO might say if more than 1% of the traffic on this link is dropped, you need to act. The SLA, on the other hand, says if more than 2% of the traffic on this link is dropped, the operator will rebate (some amount) to the customer. The SLI says this is how I am going to measure the percentage of packets dropped on this link.

Splitting these three concepts apart helps reveal what is wrong with the entire 5 9’s way of thinking, because it enables you to ask questions like—can my telemetry system measure and report on the amount of traffic dropped on this link? Across what interval should this SLI apply? If I combine all the SLI’s across my entire network, what does the monitoring system need to look like? Can I support the false positives likely to occur with such a monitoring system?

These questions might be obvious, of course, but there are more non-obvious ones, as well. For instance—how do my internal and external SLO’s correlate to my SLI’s? Measuring the amount of traffic dropped on a link is pretty simple (in theory). Measuring something like this application will not perform at less than 50% capacity because of network traffic is going to be much, much harder.

The point Mogul and Wilkes make in this paper is that we just need to rethink the way we write SLO’s and their resulting SLA’s to be more realistic—in particular, we need to think about whether or not the SLI’s we can actually measure and act on can cash the SLO and SLA checks we’re writing. This means we probably need to expose more, rather than less, of the complexity of the network itself—even though this cuts against the grain of the current move towards abstracting the network down to “ports and packets.” To some degree, the consumer of networking services is going to need to be more informed if we are to build realistic SLA’s that can be written and kept.

How does this apply to the “average enterprise network engineer?” At first glance, it might seem like this paper is strongly oriented towards service providers, since there are definite contracts, products, etc.,  in play. If you squint your eyes, though, you can see how this would apply to the rest of the world. The implicit promise you make to an application developer or owner that their application will, in fact, run on the network with little or no performance degradation is, after all, an SLO. Your yearly review examining how well the network has met the needs of the organization is an SLA of sorts.

The kind of thinking represented here, if applied within an organization, could turn the conversation about whether to out- or in-source on its head. Rather than talking about the 5 9’s some cloud provider is going to offer, it opens up discussions about how and what to measure, even within the cloud service, to understand the performance being offered, and how more specific and nuanced results can be measured against a fuller picture of value added.

This is a short paper—but well worth reading and considering.