Hedge 183: Mike Bushong on Operational Excellence
What’s next for network engineering? While we normally think of answers to this question in terms of technology, Mike Bushong joins this episode of the Hedge to argue the future is in operations—and operational excellence. Join Mike, Tom, and Russ as we discuss how the importance of operating a network is impacting the design of hardware, software, and networks.
Really interesting episode, thanks gents!
I think the perspective of being concerned first and foremost of delivering meaningful service to other IT groups and ultimately our users is key to developing networking for the next generation. This inevitably means automation – programmatically creating consumption models for network service which feed business requirements regardless of size (you touched on this briefly and yes, complexity is the key)
The way to deliver this isn’t going to be device-by-device but by the understanding of distributed system behavior. Each device contributes to the network state as a whole and while its individual config and state is an indication of desire to propagate a behavior, the collective interactions of all the devices are what are needed to be understood in order to show that. This means three things to me:
1) We shouldn’t care about CLI for individual devices but we absolutely must care about how devices interact with others to produce the collective behavior. So no more rote learning of commands for routers and switches, but instead ensure that the next generation of network engineers understands what these things do. Most automation nowadays is just replacing CLI with a GUI or scripted interface to standardise output, but still ends up writing config and policy to those individual devices;
2) Russ hit on the point of verification. We don’t do enough of that but it’s not easy. We need a verification system to be able to understand not just the config but the behavior of devices individually and collectively in order to appreciate whether we’re getting the service we expect. Simply testing that a config line has been pushed during an automation task is not enough, as the collective behavior right across the network can change with one command on one CLI;
3) Our current “monitoring” of devices and telemetry is not enough. We need to use the verification from above to improve how we measure behavior of networks. When I was designing networks, my key outcomes were to make the network supportable while maximising availability of the services that ran over it. While I cared that monitoring told me that a device was down, or an interface overloaded, if my network had enough resilience or bandwidth in it to cope then service was available if compromised.
Great that we’re having these conversations – would love to hear more!
Regards, Daren.