Micromanaging networks considered harmful: on (k)nerd knobs

Nerd Knobs (or as we used to call them in TAC, knerd knobs) are the bane of the support engineer’s life. Well, that and crashes. And customer who call in with a decoded stack trace. Or don’t know where to put the floppy disc that came with the router into the router. But, anyway…

What is it with nerd knobs? Ivan has a great piece up this week on the topic. I think this is the closest he gets to what I think of as the real root cause for nerd knobs —

Instead of using cookie-cutter designs, we prefer to carefully craft unique snowflakes that magically integrate the legacy stuff that should have been dead years ago with the next-generation technologies… and every unique snowflake needs at least a nerd knob or two to make it work.

Greg has a response to Ivan up; again, I think he gets close to the problem with these thoughts —

Most IT managers have lost the ability to recognise technical debt and its impacts … Nerd Knobs are symptoms of much deeper problems/technical debt in the networking market and treat the cause not the symptom.

A somewhat orthogonal article caught my eye, though, that I think explains what is actually going on here with those pesky nerd knobs. The article is really about SQL and the concept of micromanaging software. To give you a flavor (in case you’re too lazy/busy to head over there and read the whole thing) —

So, here’s an analogy that highlights the key difference between what “imperative” languages like Java or Python and “declarative” languages like SQL do to your computation. In Python, say, you specify step-by-step what the computer should do: open the file; read the first line; if the line doesn’t match some requirement, skip it; update the counter; read the next line; update the counter again; if the counter exceeds some value, stop; if the end of file is reached, close the file; return the counter. Code often accumulates like this and builds up into complex business rules that are usually poorly understood. via infoworld

I think this gets to the heart of the nerd knob problem. What’s happening with nerd knobs is it’s easier to tell the system how we want something done than it is to tell the system what we want to do. Think about this way: you install a routing protocol, and you tell it what you want in broad, general terms. Something like, “I want the shortest path between each pair of points in the network.” Then you run into a situation where you need that modified, so you mess around with the metrics some, and get on with your life. Then you run into a situation where you need this flow to go here, and that flow to go there, so you install some policy based routing along the way.

Per link metrics are just the first level of nerd knobs. Policy based routing is just the second. The more precise we want to get, the deeper the nerd knobs go. Want to load share over links that aren’t truly equal cost? Oh, just nerd knob it. Want to send AS’ in the AS path you shouldn’t? Just nerd knob it.

The reality is every nerd knob in routing represents a policy driven by a business requirement expressed as a tweak to the underlying fundamental routing algorithm. As Ivan rightly points out, going to SDNs isn’t going to solve this problem. If anything, it’s going to make it worse. Now, rather than seeing the nerd knob for what it is, a pain in the butt that needs to be explained and dealt with at 2AM when you’re half asleep and the TAC engineer is halfway around the world, it’s going to be “just another line of code.”

This might sound brilliant to someone who hasn’t managed, or dealt with, multi-million line projects and the vagaries of codebase management. Ask someone who has, though, before you get into this. It’s just a different set of problems, not a better set of problems.

The root cause here, though, isn’t nerd knobs. And it’s not business requirements. And it’s not really laziness (most of the time). It’s not even machismo most of the time (though I will admit the natural arrogance of the geek is probably worth studying by some anthropologist somewhere). There are two root causes, really.

First, we, the networking industry, haven’t really thought through what a control plane actually does. Oh, we have the seven layer model with the control plane thrown off to the side, or the claim that there shouldn’t even be a control plane. But this is part of why I think the seven layer model needs to die — because it’s a host focused view of the networking world. End-to-end and dumb as rocks routers are nice to contemplate, but I think we need to admit that even the dumb rocks are a bit more complex than we first thought.

Second, I don’t think we’ve really incorporated complexity into our souls. As someone once told me, “the CAP theorem is just an observer problem!” Or rather, we somehow believe that by making virtual things we can skip all that ugly physical reality stuff. Faster, cheaper, and better are all three available “on tap,” if we can just figure out how to see the problem right. This is nonsense on stilts.

We need to get in here and do some serious thinking about complexity, and how to manage it in network design. We need to do things like think about interaction surfaces, and how to prevent them from becoming so deep and broad as to be unmanageable. As the article on SQL says, from above —

In a world of regulation and increasing interdependencies between organizations, expressing intent independently of implementation means that you can avoid a class of unintended consequences of systems building.

Where have I heard this before? Oh, maybe it’s in that new book on network complexity someplace.

Seriously — I know this is a long rant, so I’ll quit now, but — seriously (!) we need to grow up and start treating the control plane as an engineering problem. Then, and only then, will we get rid of nerd knobs, no matter whether they’re some hidden CLI command, or some “if/then/else” or “goto” statement hidden someplace in the controller code.

P.S. BTW, Greg, I disagree with you about routing protocols. They’ll “go away” for a short while, until we start trying to deal with networks that don’t run on standards based routing protocols. And then we’ll beg for them to come back. We’ll form something like the IETF, and solve all the same problems all over again, convinced that we can do better than that last group of engineers did. Been there. Done that. Got the t-shirt (someplace).

Engineering Lessons, IPv6 Edition

Yes, we really are going to reach a point where the RIRs will run out of IPv4 addresses. As this chart from Geoff’s blog shows —

ipv4-exhaustion

Why am I thinking about this? Because I ran across a really good article by Geoff Huston over at potaroo about the state of the IPv4 address pool at APNIC. The article is a must read, so stop right here, right click on this link, open it in a new tab, read it, and then come back. I promise this blog isn’t going anyplace while you’re over on Geoff’s site. But my point isn’t to ring the alarm bells on the IPv4 situation. Rather, I’m more interested in how we got here in the first place. Specifically, why has it taken so long for the networking industry to adopt IPv6?

Inertia is a tempting answer, but I’m not certain I buy this as the sole reason for lack of deployment. IPv6 was developed some fifteen years ago; since then we’ve deployed tons of new protocols, tons of new networking gear, and lots of other things. Remember what a cell phone looked like fifteen years ago? In fact, if we’d have started fifteen years ago with simple dual mode devices, we could easily be fully deployed in IPv6 today. As it is, we’re really just starting now.

We didn’t see a need? Perhaps, but that’s difficult to maintain, as well. When IPv6 was originally developed (remember — fifteen years ago), we all knew there was an addressing problem. I suspect there’s another reason.

I suspect that IPv6, in it’s original form tried to boil the ocean, and the result might have been too much change too fast for the networking community to handle in such a fundamental area of the stack. What engineering lessons might we draw from the long times scales around IPv6 deployment?

For those who weren’t in the industry those many years ago, there were several drivers behind IPv6 beyond just the need for more address space. For instance, the entire world exploded with “no more NATs.” In fact, many engineers, to this day, still dislike NATs, and see IPv6 as a “solution” to the NAT “problem.” Mailing lists roiled with long discussions about NAT, security by obscurity (still waiting for someone who strongly believes that obscurity is useless to step onto a modern battlefield with a state of the art armor system painted bright orange), and a thousand other topics. You see, ARP really isn’t all that efficient, so let’s do something a little different and create an entirely new neighbor discovery system. And then there’s that whole fragmentation issue we’ve been dealing with for IPv4 for all these years. And…

Part of the reason it’s taken so long to deploy IPv6, I think, is because it’s not just about expanding the address space. IPv6, for various reasons, has tried to address every potential failing ever found in IPv4.

Don’t miss my point here. The design and engineering decisions made for IPv6 are generally solid. But all of us — and I include myself here — tend to focus too much on building that practically perfect protocol, rather than building something that was “good enough,” along with stretchy spots where obvious change can be made in the future.

In this specific case, we might have passed over one specific question too easily — how easy will this be to deploy in the real world? I’m not saying there weren’t discussions around this very topic, but the general answer was, “we have fifteen years to deploy this stuff.” And, yet… Here we are fifteen years later, and we’re still trying to convince people to deploy it. Maybe a bit of honest reflection might be useful just about now.

I’m not saying we shouldn’t deploy IPv6. Rather, I’m saying we should try and take a lesson from this — a lesson in engineering process. We needed, and need, IPv6. We probably didn’t need the NAT wars. We needed, and need, IPv6. But we probably didn’t need the wars over fragmentation.

What we, as engineers, tend to do is to build solutions that are complete, total, self contained, and practically perfect. What we, as engineers, should do is build platforms that are flexible, usable, and can support a lot of different needs. Being a perfectionists isn’t just something you say during the interview to that one dumb question about your greatest weakness. Sometimes you — we, really — do need to learn to stop what we’re doing, take a look around, and ask — why are we doing this?