False Dichotomy

Last week wasn’t a good one for the cause of network engineering. United Airlines grounded flights because of a router failure, the New York Stock Exchange stopped trading for several hours because of a technical problem, and the Wall Street Journal went off line for several hours due to a technical malfunction. How should engineers react to these sorts of large scale public outages? The first option, of course, is to flail our arms and run out of the room screaming. Panic is a lot of fun when you first engage, but over time it tends to get a little boring, so maybe panic isn’t the right solution here.

Another potential reaction is to jump on the “it’s too complex” bandwagon. sure, a lot of these systems are very complex — in fact, they’re probably too complex for the actual work they do. Complexity is required to solve hard problems; elegance is choosing the path with the least amount of complexity that will solve the problem. Far too often, in the engineering world, we choose the more complex path because of some imagined requirement that never actually materializes, or because we imagine a world where the solution we’re putting in place today will last forever, and we can just go fishin’ (or some such).

But just as often, we choose solutions because they aren’t what we’re deploying today. In the midst of the mess, a number of people are making the case that “none of this would have happened if we had adopted SDNs a few years back.” Because we distribute the control plane today, a simple fix is to say, “centralize it all, and all these problems will go away.” If we could just get rid of all those hard configurations, and all that distributed code, and put it all in one place where it can be more easily controlled, these sorts of network failures just wouldn’t be possible. Pardon me while I lapse into technical language.

Fiddlesticks. Balderdash. Frogs on stilts.

When I first starting working on networks, installing inverse multiplexers to build T1 equivalent lines on really old copper wire (much of it was actually cloth wrapped), there were major outages in the centralized systems that caused flight delays, stock exchanges to go offline, and newspapers to miss their publication. In those days, we yelled and screamed about our centralized systems, and complained that compute was tied to storage was tied to the network in a way that made none of them sustainable — we had one huge failure domain, and no way to continue doing business when one part went down. So a lot of individual departments bought PCs and put them in closets where no-one could see them, and they could continue doing work when the big computer (the cloud in today’s terms) went down.

The point is this: Moving to a completely centralized environment isn’t going to be any better than what we’re doing right now. There will still be bugs. There will still be fat fingered mistakes. Those bugs and mistakes might happen less often, but they’ll likely have a larger impact area. Systems won’t have fewer moving parts, they’ll just look like the do on the outside.

Moving to the opposite point on the pendulum because we ran into a wall on this side of the swing doesn’t mean there won’t be a wall on the other side as well.

Instead, we need to learn to think about what it makes sense to centralize, what it makes sense to distribute — how to split the problems we face up into logical chunks, and eat one at a time, rather than assuming the grass is really greener on the other end of the pendulum. As Donnie Savage always says: “In reality, the grass is greener because there’s a drain field.”

The counter to this is to remember — there are no perfect solutions; here are no permanent solutions. Repeat often, especially when you have a headache, it’s five in the afternoon, and your network has just gone down.

1 Comment

  1. matt Conran on 16 July 2015 at 3:16 am

    Ideally we need to find a way / design to minimize the “blast ” radius of a centralized approach 🙂