SD-WAN and Multiple Metrics

Ivan has posted a reaction to Ethan, which prompts me to… Okay, let’s start at the beginning. Ethan wrote a nice post on SD-WAN and the “shortest path we always wanted,” covering some of the positive and negative aspects of software defined WAN.

Ivan responded with this post, in which he says several interesting things, prompting some thoughts from yours truly…

Routing in SD-WAN environment is almost trivial…

Depends on what you mean when you say “routing…” If routing here means the discovery of the topology, and computing a best path through a topology, then controller based (centralized) “routing” is almost certainly more complex than distributed routing protocols. If routing here means, “take into consideration a wide swath of policies, including which link is most loaded right now, which link has the shortest queues, and lots of other things, and compute me a best path,” then a controller based centralized system is most likely going to be less complex. Take a gander through my last set of NANOG slides if you want to see where my thinking lies in this area — or read my new book on network complexity if you want a longer explanation.

The question is — which is it you really want, and when? And why can’t these two functions be split up in some logical way?

Could We Do This With Routing Protocols? We could, but that doesn’t mean we should. There have been numerous attempts to include QoS-awareness in traditional routing protocols, from load parameter in IGRP to QoS-based metrics in OSPF, and Cisco’s Multi-Topology Routing.

The bottom line problem is this: computing shortest path across more than one metric is n(complete) — it simply can’t be solved. I can’t point to the papers off the top of my head, but I’ve read them, and they’re pretty convincing. You essentially have two choices here.

First, you can normalize the various parameters using some sort of blending function. Ever heard of EIGRP K-values? They allow you to change the blend of metrics you’re using to find the shortest path in any given situation. You can emphasize bandwidth, delay, utilization, MTU, or a whatever else. There were two flies in the ointment here — people didn’t take the time to learn the ropes, and because EIGRP is a distributed protocol it’s quite possible to get into a feedback loop between traffic flow and metrics. The second isn’t an unsolvable problem, but because of the first it wasn’t ever seriously looked in to.

Second, you can just build certain metrics into the protocol as “primary” over others. Ever heard of the 14 step BGP bestpath algorithm? Even there it should really be called a heuristic because it’s not quite atomic — different orders of events can result in different answers. This is, in fact, the common problem with such multistep decision processes in a distributed decision process.

The bottom line, as always — if policy is your primary concern, centralize. If fast reachability is your primary concern, distribute. At least until the networking world learns how to layer control planes the way we’ve already layered data planes.