Reaction: Keith’s Law

Ethan pointed me to this post about complexity and incremental improvement in a slack message. There are some interesting things here, leading me in a number of different directions, that might be worth your reading time. The post begins with an explanation of what the author calls “Keith’s law”—

I am going to paraphrase the version he shared over lunch at the Facebook campus a few years ago and call it Keith’s Law: In a complex system, the cumulative effect of a large number of small optimizations is externally indistinguishable from a radical leap. If you want to do big things in a software-eaten world, it is absolutely crucial that you understand Keith’s Law. —Breaking Smart

The author attributes this to the property of emergence; given I don’t believe in blind emergence, I would attribute this effect to the combined intertwining of many intelligent actors producing an effect that at least many of them probably wanted (the improvement of the complex system), and each of them working in their own spheres to achieve that result without realizing the overall multiplier effect of their individual actions. If that was too long and complicated, perhaps this is shorter and better—

The law of unintended consequences runs both ways.

Many of us call out the law of unintended consequences when bad things happen, but few do when good things happen. The reality, however, is the law of unintended consequences works in both directions. It is possible for the cumulative effect of many people pulling in (roughly) the same direction will align closely enough in their efforts, in the pursuit of a combined goal, to cause a major leap forward. We expect this sort of thing in a small team with strong leadership and narrowly defined goals—for instance, a sculling team must have precise coordination, combined with individual strength and stamina, in order to win a race.

Technology and companies, however, tend to be larger, more complex systems than a sculling team, with many more goals (or perhaps subgoals covered by one larger goal). So the instances where real cooperation happens might be, as the author of the piece above says, ten minutes per year. As interesting as this observation might be, there are some other points buried in this piece that I think can actually provide guidance for designers and architects in a larger way. Three points in particular—

20/ At smaller scales, one person with god-level visibility into, and comprehension of, the system can keep it all in their head and herd the various knob-turnings/tweakings in a chosen direction.

21/ But in a complex system, where hundreds of people can be doing little things, this does not work. And more communication is not the answer (at least not the whole answer).

22/ The way out is people with strong “finger-tip feeling” (which we’ve discussed before), herding the system, which can turn the uncontrolled random walk into controlled, cumulative gains.

What the original posts suggests is that if you can get people who have a good “feel” for the direction things need to go in the right places, and each of them turns the right “knobs” in the right way, the system will eventually run towards the goal(s) intended. This, I think, is true, but there’s another piece in here that needs more serious consideration, specifically—

22/ As Keith observed once, most humans can at best understand 1-2 levels of abstraction above/below their home zone. Beyond, you rely on things like metaphor and pop sociology.

And this, I think, is something we don’t often take seriously enough. But to get there, I need to dive into personal experience a little. I’ve worked in many different areas of the network industry, from taking cases, global escalation, large scale design, developing protocols (and hence products), and a few other things. In the world of vendors, the system sizes are often “small enough” that there can be some small set of folks who understand the entire system, at least at some level. One accessible instance of this is described in The Soul of a New Machine, which follows a team developing a new microcomputer in the late 1970’s. At many large operators, however, no such person (or team of people) exists. It’s simply not possible for them to exist because of the levels of abstraction rule stated above—there’s simply too much to understand for one person, or even a team of people, to understand it.

There are counters to this, of course; something I’ve learned as I’ve studied hyperscale implementations, now that I work for a hyperscaler.

The first is radical simplification. At the extreme end of the rule that a system is not complete until everything that is not needed has been removed lives the hyperscale world. The network must be simple. No, simpler than you think, and even simpler still. There is a direct tradeoff between being able to support hundreds of thousands of servers and making everything as simple as it can possibly be. If scale brings its own complexity, then the answer is not a random walk, but rather to remove complexity in other places until the system is at least understandable in the three layer pattern of abstraction described above (is it any mistake the three layer hierarchical model was so popular in the early years of network design?).

The second is radical separation. One of the solutions to complexity is to separate complexity from complexity, almost a cliché by now. But the reality is that hyperscalers apply this lesson between systems and applications, not just between topologies. Separating complexity from complexity is relies on abstractions, of course, to break up failure domains, provide choke points, etc.—and abstractions leak. But no matter how leaky the abstraction, to abstract is better than not to abstract.

So if you want the law of unintended consequences to run in both ways, it takes more than just “fingertip feeling” in the right places. It also takes breaking up the ground where people are, so they can make their little piece of the puzzle work a little better. And that’s not going to happen until they have freedom to do thing differently, which means their piece of the puzzle must be separate enough from the whole for local actions to have local impact. In other words, the basic concepts of network design, such as separating complexity from complexity, and simplification, etc., work at organizational scales, as well.

Here is a lesson worth considering.