Micromanaging networks considered harmful: on (k)nerd knobs

Nerd Knobs (or as we used to call them in TAC, knerd knobs) are the bane of the support engineer’s life. Well, that and crashes. And customer who call in with a decoded stack trace. Or don’t know where to put the floppy disc that came with the router into the router. But, anyway…

What is it with nerd knobs? Ivan has a great piece up this week on the topic. I think this is the closest he gets to what I think of as the real root cause for nerd knobs —

Instead of using cookie-cutter designs, we prefer to carefully craft unique snowflakes that magically integrate the legacy stuff that should have been dead years ago with the next-generation technologies… and every unique snowflake needs at least a nerd knob or two to make it work.

Greg has a response to Ivan up; again, I think he gets close to the problem with these thoughts —

Most IT managers have lost the ability to recognise technical debt and its impacts … Nerd Knobs are symptoms of much deeper problems/technical debt in the networking market and treat the cause not the symptom.

A somewhat orthogonal article caught my eye, though, that I think explains what is actually going on here with those pesky nerd knobs. The article is really about SQL and the concept of micromanaging software. To give you a flavor (in case you’re too lazy/busy to head over there and read the whole thing) —

So, here’s an analogy that highlights the key difference between what “imperative” languages like Java or Python and “declarative” languages like SQL do to your computation. In Python, say, you specify step-by-step what the computer should do: open the file; read the first line; if the line doesn’t match some requirement, skip it; update the counter; read the next line; update the counter again; if the counter exceeds some value, stop; if the end of file is reached, close the file; return the counter. Code often accumulates like this and builds up into complex business rules that are usually poorly understood. via infoworld

I think this gets to the heart of the nerd knob problem. What’s happening with nerd knobs is it’s easier to tell the system how we want something done than it is to tell the system what we want to do. Think about this way: you install a routing protocol, and you tell it what you want in broad, general terms. Something like, “I want the shortest path between each pair of points in the network.” Then you run into a situation where you need that modified, so you mess around with the metrics some, and get on with your life. Then you run into a situation where you need this flow to go here, and that flow to go there, so you install some policy based routing along the way.

Per link metrics are just the first level of nerd knobs. Policy based routing is just the second. The more precise we want to get, the deeper the nerd knobs go. Want to load share over links that aren’t truly equal cost? Oh, just nerd knob it. Want to send AS’ in the AS path you shouldn’t? Just nerd knob it.

The reality is every nerd knob in routing represents a policy driven by a business requirement expressed as a tweak to the underlying fundamental routing algorithm. As Ivan rightly points out, going to SDNs isn’t going to solve this problem. If anything, it’s going to make it worse. Now, rather than seeing the nerd knob for what it is, a pain in the butt that needs to be explained and dealt with at 2AM when you’re half asleep and the TAC engineer is halfway around the world, it’s going to be “just another line of code.”

This might sound brilliant to someone who hasn’t managed, or dealt with, multi-million line projects and the vagaries of codebase management. Ask someone who has, though, before you get into this. It’s just a different set of problems, not a better set of problems.

The root cause here, though, isn’t nerd knobs. And it’s not business requirements. And it’s not really laziness (most of the time). It’s not even machismo most of the time (though I will admit the natural arrogance of the geek is probably worth studying by some anthropologist somewhere). There are two root causes, really.

First, we, the networking industry, haven’t really thought through what a control plane actually does. Oh, we have the seven layer model with the control plane thrown off to the side, or the claim that there shouldn’t even be a control plane. But this is part of why I think the seven layer model needs to die — because it’s a host focused view of the networking world. End-to-end and dumb as rocks routers are nice to contemplate, but I think we need to admit that even the dumb rocks are a bit more complex than we first thought.

Second, I don’t think we’ve really incorporated complexity into our souls. As someone once told me, “the CAP theorem is just an observer problem!” Or rather, we somehow believe that by making virtual things we can skip all that ugly physical reality stuff. Faster, cheaper, and better are all three available “on tap,” if we can just figure out how to see the problem right. This is nonsense on stilts.

We need to get in here and do some serious thinking about complexity, and how to manage it in network design. We need to do things like think about interaction surfaces, and how to prevent them from becoming so deep and broad as to be unmanageable. As the article on SQL says, from above —

In a world of regulation and increasing interdependencies between organizations, expressing intent independently of implementation means that you can avoid a class of unintended consequences of systems building.

Where have I heard this before? Oh, maybe it’s in that new book on network complexity someplace.

Seriously — I know this is a long rant, so I’ll quit now, but — seriously (!) we need to grow up and start treating the control plane as an engineering problem. Then, and only then, will we get rid of nerd knobs, no matter whether they’re some hidden CLI command, or some “if/then/else” or “goto” statement hidden someplace in the controller code.

P.S. BTW, Greg, I disagree with you about routing protocols. They’ll “go away” for a short while, until we start trying to deal with networks that don’t run on standards based routing protocols. And then we’ll beg for them to come back. We’ll form something like the IETF, and solve all the same problems all over again, convinced that we can do better than that last group of engineers did. Been there. Done that. Got the t-shirt (someplace).