Autonomic, Automated, and Reality
Once the shipping department drops the box off with that new switch, router, or “firewall,” what happens next? You rack it, cable it up, turn it on, and start configuring, right? There are access to controls to configure—SSH, keys, disabling standard accounts, disabling telnet—interface addresses to configure, routing adjacencies to configure, local policies to configure, and… After configuring all of this, you can adjust routing in the network to route around the new device, and then either canary the device “in production” (if you run your network the way it should be run), or find some prearranged maintenance time to bring the new device online and test things out. After all of this, you can leave the new device up and running in the network, and move on to the next task.
Until it breaks.
Then you consult the documentation to remind yourself why it was configured this way, consult the documentation to understand how the application everyone is complaining about not working should work, etc. There are the many hours spent sitting on the console gathering information by running various commands and the output of various logs. Eventually, once you find the problem, you can either replace the right parts, or reconfigure the right bits, and get everything running again.
In the “modern” world (such as it is), we think it’s a huge leap forward to stop configuring devices manually. If we can just automate the configuration of all that “stuff” we have to do at the beginning, after the box is opened and before the device is placed into service, we think we have this whole networking thing pretty well figured out.
Even if you had everything in your network automated, you still haven’t figured this networking thing out.
We need to move beyond automation. Where do we need to move to? It’s not one place, but two. The first is we need to move beyond automation to autonomous operation. As an example, there is a shiny new system that is currently being widely deployed to automate the deployment and management of containers. Part of this system is the automation of connectivity, including routing, between containers. The routing system being deployed as part of this system is essentially statically configured policy-based routing combined with network address translation.
Let me point something out that is not going to be very popular: this is a step backwards in terms of making the system autonomous. Automating static routing information is not a better solution than building a real, dynamic, proactive, autonomic, routing system. It’s not simpler—trust me, I say this as someone who has operated large networks which used automated static routes to do everything.
The “opsification of everything” is neat, but it shouldn’t be our end goal.
Now part of this, I know, is the fault of vendors. Vendors who push EGPs onto data center fabrics because, after all, “the configuration complexity doesn’t matter so long as you can automate it.” The configuration complexity does matter, because configuration complexity belies an underlying protocol complexity, and sets up long and difficult troubleshooting sessions that are completely unnecessary.
The second place we need to move in the networking world? The focus on automation is just another form of focusing on configuration. We abstract the configuration, and we touch a lot more devices at once, but we are still thinking about configuration. The more we think about configuration, the less we think about how the system should work, how it really works, what the gaps are, and how to bridge those gaps. So long as we are focused on the configuration, automated or not, we are not focused on how the network can bring value to the business. The longer we are focused on configuration, the less value we are bringing to the business, and the more likely we are to end up being replaced by … an automated system … no matter how poorly that automated system actually works.
And no, the cloud isn’t going to solve this. Containers aren’t going to solve this. The “automated configuration pattern” is already being repeated in the cloud. As more complex workloads are moved into the cloud, the problems there are only going to get harder. What starts out as a “simple” system using policy-based routing analogs and network address translation configured through an automation server will eventually look complex against the hardest problems we had to solve using T1’s, frame relay circuits, inverse multiplexers, wire down patch panels, and mechanical switch crossbar frames. It’s fun to pretend we don’t need dynamic routing to solve the problems that face the network—at least until you hit hard problems, and have to relearn the lessons of the last 20+ years.
Yes, I know vendors are partly to blame for this. I know that, for a vendor, it’s easier to get people to buy into your CLI, or your entire ecosystem, rather than getting them to think about how to solve the problems your business is handing them.
On the other hand, none of this is going to change from the top down. This is only going to change when the average network engineer starts asking vendors for truly simpler solutions that don’t require reams configuration information. It will change when network engineers get their heads out of the configuration and features, and into the business problems.
Gr8 Article Russ. Would like to add few things:
1. In most decent Org. we still have Planning and Operate as two separate functions even with in IT context.
And by design in most cases Planning don’t get into OPS issues to find how effectively they did their job during planning and design phase. So that feedback loop is missing to begin with.
2. As you mentioned for most people Config automation is where every body start but also ends up with. IMHO part of the problem is most Network Engineers don’t necessarily understand the Software Development Life Cycle and Practices. So just because someone is able to do basic automation or even advance doesn’t mean they get it right. Since :
A. Business doesn’t pay for projects based on personal motivations but based on Business outcomes
B. Without knowing the complete cycle or understanding it may lead to several long term problems while you manage to solve your own problems. For example the individual may end up exposing more surface to security threats unintentionally. Not sure how modern world CISO write policies around compliances and audits to address NetDevOPS type initiatives. (Specially the project lead by individuals)
C. Somehow from Enterprise world perspective, most SDx or IBx solutions are taking Black Box approach. Which is pretty interesting and alarming in my view at the same time.
D. Most vendors are coming up with products that hardly makes any sense. In my view mostly the learnings they had from working with WebScales and trying to push those to enterprises. While some might be good solutions, most of those webscales problems doesn’t apply to an average Enterprise.
So in short it’s not only about doing right things but rather doing things right.
BTW…what’s your take on how many average network engineers can talk the business language and what you recommend to fill this gap ?
HTH…
Evil CCIE
Thanks! I might respond to a few of these in a future post. 🙂