Cloudy with a chance of lost context

One thing we often forget in our drive to “more” is this—context matters. Ivan hit on this with a post about the impact of controller failures in software-defined networks just a few days ago:

A controller (or management/provisioning) system is obviously the central point of failure in any network, but we have to go beyond that and ask a simple question: “What happens when the controller cluster fails and/or when nodes lose connectivity to the controller?”

What is the context in terms of a network controller? To broaden the question—what is the context for anything? The context, in philosophical terms, is the telos, the final end, or the purpose of the system you are building. Is the point of the network to provide neat gee-whiz capabilities, or is it to move data in a way that creates value for the business?

Ivan’s article reminded me of another article I read recently about the larger world of cloud, IoT, and context problems by Bob Frankston., who says:

Having abundant server capacity available is wonderful, but it makes no sense to have simple local functions fail. For example, when turning on a porch light, it makes no sense to have the light switch fail if someone doesn’t renew a subscription or if a distant server fails.

We are currently building networks with the presupposition that cloud based services are more secure, more agile, and more reliable than we can build locally. Somehow we think “the cloud” will never—can never—fail.

But of course, cloud services can fail. We know this because they have. Office 365 has failed a number of times, including this one. Azure failed at least once in 2014 due to operator error. Dyn was taken down by a DDoS attack not so long ago. Google cloud failed not too long ago, as well. Back in 2016, Salesforce suffered a major outage.

The point is not that we should not use cloud services. The point is that systems should be designed so any form of failure in a centralized component should not cause local services, at least at some minimal level, to fail. And yet, how many networks today rely on cloud-based services to operate at all, much less correctly? Would your company’s email still be delivered if your cloud based anti-spam software service failed? Would your company’s building locks work if the cloud based security service you use goes down?

If they would not, do you have a contingency plan in place for when these services do fail? Can you open the dors manually, do you have a way to turn off the spam filtering, do you have a way to manually configure the network to “get by” until service can be restored? Do you know which data flows must be supported in the case of an outage of this sort, or what the impact on the entire system will be? Do you have a plan to restore the network and its services to their optimal state once the external services optimal operation relies on are available after an outage?

A larger point can be made here: when pulling services into the network itself, we often rely on unexamined presuppositions, and we often leave the system with unexamined and little-understood risks.

Context matters—what is it you are trying to do? Do you really need to do it this way? What happens if the centralized or cloudy system you rely on fails? It’s not that we should not use or deploy cloud-based systems—but we should try to expose and understand the failure modes and risks we are building into our networks.

Posted in DESIGN, WRITTEN

Related