SDN, AI, and DevOps

According to the recent SONAR report, 52% of respondents reported they are using Software Defined Networking (SDN) tools to automate their networks, while 57% reported they are using network management tools. The report notes “52% may be slightly exaggerated, depending on how one defines SDN…” Which leads naturally to the question—what the difference between SDN and DevOps is, and how does AI figure into both or either of these.  SDN, DevOps, and AI describe separate and overlapping movements in the design, deployment, and management of networks. While they are easy to confuse, they have three different origins and meanings.

Software Defined Networking grew out of research efforts to build and deploy experimental control planes, either distributed or centralized. SDN, however, quickly became associated with replacing some or all the functions of a distributed control plane with a centralized controller, particularly in order to centralize policy related to the control plane such as traffic engineering. SDN solutions always work through a programmatic interface designed to primarily supply forwarding information to network devices.

Development Operations, or DevOps, is a movement away from human-centered interfaces towards machine-centered interfaces for the deployment, operation, and troubleshooting of networks. DevOps is centered on the deployment, configuration, and management of the entire device, rather than providing the information required to forward traffic. DevOps can either use a programmatic interface, such as YANG, or “screen scraping,” to configure and manage network devices.

Finally, Artificial Intelligence, or AI, in the context of computer networks, is focused on the use of data gathered from the network to improve operations, from decreasing the time required to troubleshoot a problem to making the network adapt more quickly to shifting application and business requirements. AI, applied to networks, is narrow in scope, so it is Artificial Narrow Intelligence, or ANI. Real implementations of AI in the networking field are often applications of Machine Learning, or ML; while these two terms are often used interchangeably, they are not quite the same thing.

The following illustration will be useful in understanding the relationship between these three concepts.

 

In the figure, the SDN and DevOps controllers interact with two different aspects of the network devices forwarding traffic; both SDN and DevOps can be deployed in the same network to solve different problems. For instance, DevOps might be used to configure network devices to reach the SDN controller so they can receive the information they need to forward packets. Or the DevOps system might be used to configure a distributed control plane, such as IS-IS, on all the network devices, and also to configure a centralized controller which can override the local decisions of the distribute routing protocol for traffic engineering.

There are some situations where the difference between SDN and DevOps solutions is not obvious. The most common example is DevOps could be used to configure routing information on each network device, performing the same function as an SDN controller. In this case, what is the difference?

First, an SDN solution is intended specifically to replace the distributed control plane, rather than to configure the entire device. Second, the configurations pushed to a device through DevOps is normally persistent; if a device reboots, the configuration pushed through DevOps will be loaded and enabled, impacting the operation of the device. In contrast, any information pushed to a device through an SDN controller would normally be ephemeral; when the device is rebooted, information pushed by the SDN controller will be lost.

Finally, AI and self-healing are shown on the right side of this diagram as a way to turn telemetry into actionable input for either the DevOps or the SDN system. The ability of ML networks to find and recognize patterns in streams of data means it is perfectly suited to find new patterns of network behavior and alert an operator, or to match current conditions to the past, anticipating future failures or finding an otherwise unnoticed problem.

While SDN, DevOps, and AI overlap, then, they serve different purposes in the realm of network engineering and operations. There are many areas of overlap, but they are also different enough to argue the three terms should be cleanly separated, with each adding a different kind of value to the overall system.

The Hedge 8: Open Source and the Future of Routing Software

Almost every company relies on open source software in some way, which leads to the natural question—how will the heart of the network, routing and switching, be impacted by open source software? In this episode of the Hedge, Sue Hares, Donald Sharp, and Russ White discuss the current and future world of open source routing software. Donald is one of the main drivers of the FR Routing open source routing stack; Russ White is a maintainer on the project and is still deeply involved in commercial routing software, and Sue Hares was deeply involved in the origins of the GateD open source routing stack.

download

You can find previous episodes of the Hedge here, and you can subscribe to the Hedge on iTunes and other feed services.

Service Provider Tech Doesn’t Apply?

Service provider problems are not your problems. You should not be trying to solve your problems the same way service providers do.

This seems intuitively true—after all, just about everything about a train or a large over-the-road truck (or lorry) is different from a passenger car. If the train is the service provider network and the car is the “enterprise” network, it seems to be obvious the two have very little in common.

Or is it?

What this gets right is that if an operator sells access to their network, or a single application, their network is likely to be built differently than the more general-purpose designs used in organizations that must support a wide range of applications and purposes. These differences are likely to show up in the choice of hardware, how the network is operated, and the kinds of services offered (or not).

What this gets right is operators who sell access to their networks, or support a single application, always seem to build at a scale far beyond what more general-purpose networks ever reach. Microsoft and Facebook number their servers in the millions, and single purchase orders include thousands of routers. eBay and LinkedIn number their servers in the hundreds of thousands, and their routers and switches in the tens of thousands. How can a small enterprise network of a few hundred servers be anything like these larger networks?

What this gets wrong is assuming none of the technologies, tools, or attitudes from these larger-scale networks is every applicable to the smaller networks many engineers encounter on a day-to-day basis.

All those networks with BGP deployed in their data center fabrics are using technology designed primarily for interconnecting intermediate systems on the default-free zone—in other words, for connecting the networks of transit service providers. All those networks with OSPF deployed are using a link state protocol originally designed to provide edge-to-edge reachability in transit service provider networks. All those networks with IS-IS deployed are using a link state protocol originally designed to provide connectivity to large-scale telephony-style networks.

What about transport technologies? The only transport technologies originally designed specifically for “enterprise use” have long since been replaced by optical technologies designed for large-scale provider or “hyperscale” use. Token Ring and ARCnet are long gone, as is the original shared medium Ethernet, replaced by switched Ethernet largely over optical transport. Even current general WiFi is primarily designed for public operator use cases—look at 5G and WiFi 6 and note how public operator requirements have influenced these technologies.

The truth is there is no “pure” enterprise technology; following the dictum that you should not use “service-provider technologies” in your network would leave you with … no network at all.

There is a second realm where this line of argument falls flat, and its more important than the question of which technologies to use: the techniques and attitudes learned in the operation of truly large-scale networks hold valuable lessons for all network engineers. Should you use a spine and leaf topology in your data center, rather than a more traditional hierarchical design? The answer has nothing to do with scale, and everything to do with flexibility in design and operational agility. Should you automate your network, even if its only ten routers? The answer has nothing to do with what Amazon is doing, and everything to do with how much time you want to spend on configuring and troubleshooting versus responding to real business needs.

Think of it this way: the driver who drives the large over-the-road truck is still going to learn lessons and instincts about driving that will make them a better driver in a minivan.

Come join me at NXTWORK in November to continue the conversation in my master class on building and operating data center fabrics, as I explore how you can apply lessons from the hyperscale world to your network.

The Hedge 7: Leslie Daigle and Internet Invariants

Some things always change, and some things never change. In this episode of the Hedge, Leslie Daigle joins Phill Simonds and Russ White to discuss her research into the things that do not change—and whether or not those things really have changed over the years since her original report for the Internet Society on Internet invariants.

download

past episodes of the Hedge

Is it planning… or just plain engineering?

Over at the ECI blog, Jonathan Homa has a nice article about the importance of network planning–

In the classic movie, The Graduate (1967), the protagonist is advised on career choices, “In one word – plastics.” If you were asked by a young person today, graduating with an engineering or similar degree about a career choice in telecommunications, would you think of responding, “network planning”? Well, probably not.

Jonathan describes why this is so–traffic is constantly increasing, and the choice of tools we have to support the traffic loads of today and tomorrow can be classified in two ways: slim and none (as I remember a weather forecaster saying when I “wore a younger man’s shoes”). The problem, however, is not just tools. The network is increasingly seen as a commodity, “pure bandwidth that should be replaceable like memory,” made up of entirely interchangeable parts and pieces, primarily driven by the cost to move a bit across a given distance.

This situation is driving several different reactions in the network engineering world, none of which are really healthy. There is a sense of resignation among people who work on networks. If commodities are driven by price, then the entire life of a network operator or engineer is driven by speed, and speed alone. All that matters is how you can build ever larger networks with ever fewer people–so long as you get the bandwidth you need, nothing else matters.

This is compounded by a simple reality–network world has driven itself into the corner of focusing on the appliance–the entire network is appliances running customized software, with little thought about the entire system. Regardless of whether this is because of the way we educate engineers through our college programs and our certifications, this is the reality on the ground level of network engineering. When your skill set is primarily built around configuring and managing appliances, and the world is increasingly making those appliances into commodities, you find yourself in a rather depressing place.

Further, there is a belief that there is no more real innovation to be had–the end of the road is nigh, and things are going to look pretty much like they look right now for the rest of … well, forever.

I want you, as a network engineer, operator, or whatever you call yourself, to look these beliefs in the eye and call them what they are: nonsense on stilts.

The real situation is this: the current “networking industry,” such as it is, has backed itself into a corner. The emphasis on planning Jonathan brings out is valid, but it is just the tip of the proverbial iceberg. There is a hint in this direction in Jonathan’s article in the list of suggestions (or requirements). Thinking across layers, thinking about failure, continuous optimization… these are all… system level thinking, To put this another way, a railway boxcar might be a commodity, but the railroad system is not. The individual over-the-road truck might be a commodity, and the individual road might not be all that remarkable, but the road system is definitely not a commodity.

The sooner we start thinking outside the appliance as network engineers or operators (or whatever you call yourself), the sooner we will start adding value to the business. This means thinking about algorithms, protocols, and systems–all that “theory stuff” we typically decry as being less than usefl–rather than how to configure x on device y. This means thinking about security across the network, rather than as how you configure a firewall. This means thinking about the tradeoffs with implementing security, including what systemic risk looks like, and when the risks are acceptable when trying to accomplish as specific goal, rather than thinking about how to route traffic through a firewall.

If demand is growing, why is the networking world such a depressing place right now? Why do I see lots of people saying things like “there will be no network engineers in enterprises in five years?” Rather than blaming the world, maybe we should start looking at how we are trying to solve the problems in front of us.

Autonomic, Automated, and Reality

Once the shipping department drops the box off with that new switch, router, or “firewall,” what happens next? You rack it, cable it up, turn it on, and start configuring, right? There are access to controls to configure—SSH, keys, disabling standard accounts, disabling telnet—interface addresses to configure, routing adjacencies to configure, local policies to configure, and… After configuring all of this, you can adjust routing in the network to route around the new device, and then either canary the device “in production” (if you run your network the way it should be run), or find some prearranged maintenance time to bring the new device online and test things out. After all of this, you can leave the new device up and running in the network, and move on to the next task.

Until it breaks.

Then you consult the documentation to remind yourself why it was configured this way, consult the documentation to understand how the application everyone is complaining about not working should work, etc. There are the many hours spent sitting on the console gathering information by running various commands and the output of various logs. Eventually, once you find the problem, you can either replace the right parts, or reconfigure the right bits, and get everything running again.

In the “modern” world (such as it is), we think it’s a huge leap forward to stop configuring devices manually. If we can just automate the configuration of all that “stuff” we have to do at the beginning, after the box is opened and before the device is placed into service, we think we have this whole networking thing pretty well figured out.

Even if you had everything in your network automated, you still haven’t figured this networking thing out.

We need to move beyond automation. Where do we need to move to? It’s not one place, but two. The first is we need to move beyond automation to autonomous operation. As an example, there is a shiny new system that is currently being widely deployed to automate the deployment and management of containers. Part of this system is the automation of connectivity, including routing, between containers. The routing system being deployed as part of this system is essentially statically configured policy-based routing combined with network address translation.

Let me point something out that is not going to be very popular: this is a step backwards in terms of making the system autonomous. Automating static routing information is not a better solution than building a real, dynamic, proactive, autonomic, routing system. It’s not simpler—trust me, I say this as someone who has operated large networks which used automated static routes to do everything.

The “opsification of everything” is neat, but it shouldn’t be our end goal.

Now part of this, I know, is the fault of vendors. Vendors who push EGPs onto data center fabrics because, after all, “the configuration complexity doesn’t matter so long as you can automate it.” The configuration complexity does matter, because configuration complexity belies an underlying protocol complexity, and sets up long and difficult troubleshooting sessions that are completely unnecessary.

The second place we need to move in the networking world? The focus on automation is just another form of focusing on configuration. We abstract the configuration, and we touch a lot more devices at once, but we are still thinking about configuration. The more we think about configuration, the less we think about how the system should work, how it really works, what the gaps are, and how to bridge those gaps. So long as we are focused on the configuration, automated or not, we are not focused on how the network can bring value to the business. The longer we are focused on configuration, the less value we are bringing to the business, and the more likely we are to end up being replaced by … an automated system … no matter how poorly that automated system actually works.

And no, the cloud isn’t going to solve this. Containers aren’t going to solve this. The “automated configuration pattern” is already being repeated in the cloud. As more complex workloads are moved into the cloud, the problems there are only going to get harder. What starts out as a “simple” system using policy-based routing analogs and network address translation configured through an automation server will eventually look complex against the hardest problems we had to solve using T1’s, frame relay circuits, inverse multiplexers, wire down patch panels, and mechanical switch crossbar frames. It’s fun to pretend we don’t need dynamic routing to solve the problems that face the network—at least until you hit hard problems, and have to relearn the lessons of the last 20+ years.

Yes, I know vendors are partly to blame for this. I know that, for a vendor, it’s easier to get people to buy into your CLI, or your entire ecosystem, rather than getting them to think about how to solve the problems your business is handing them.

On the other hand, none of this is going to change from the top down. This is only going to change when the average network engineer starts asking vendors for truly simpler solutions that don’t require reams configuration information. It will change when network engineers get their heads out of the configuration and features, and into the business problems.