Simpler is Better… Right?
A few weeks ago, I was in the midst of a conversation about EVPNs, how they work, and the use cases for deploying them, when one of the participants exclaimed: “This is so complicated… why don’t we stick with the older way of doing things with multi-chassis link aggregation and virtual chassis device?” Sometimes it does seem like we create complex solutions when a simpler solution is already available. Since simpler is always better, why not just use them? After all, simpler solutions are easier to understand, which means they are easier to deploy and troubleshoot.
The problem is we too often forget the other side of the simplicity equation—complexity is required to solve hard problems and adapt to demanding environments. While complex systems can be fragile (primarily through ossification), simple solutions can flat out fail just because they can’t cope with changes in their environment.
As an example, consider MLAG. On the surface, MLAG is a useful technology. If you have a server that has two network interfaces but is running an application that only supports a single IP address as a service access point, MLAG is a neat and useful solution. Connect a single (presumably high reliability) server to two different upstream switches through two different network interface cards, which act like one logical Ethernet network. If one of the two network devices, optics, cables, etc., fails, the server still has network connectivity.
But MLAG has well-known downsides, as well. There is a little problem with the physical locality of the cables and systems involved. If you have a service split among multiple services, MLAG is no longer useful. If the upstream switches are widely separated, then you have lots of cabling fun and various problems with jitter and delay to look forward to.
There is also the little problem of MLAG solutions being mostly proprietary. When something fails, your best hope is a clueful technical assistance engineer on the other end of a phone line. If you want to switch vendors for some reason, you have the fun of taking the entire server out of operation for a maintenance window to do the switch, along with the vendor lock-in in the case of failure, etc.
EVPN, with its ability to attach a single host through multiple virtual Ethernet connections across an IP network, is a lot more complex on the surface. But looks can be deceiving… In the case of EVPN, you see the complexity “upfront,” which means you can (largely) understand what it is doing and how it is doing it. There is no “MLAG black box;” useful in cases where you must troubleshoot.
And there will be cases where you will need to troubleshoot.
Further, because EVPN is standards-based and implemented by multiple vendors, you can switch over one virtual connection at a time; you can switch vendors without taking the server down. EVPN is a much more flexible solution overall, opening up possibilities around multihoming across different pods in a butterfly fabric, or allowing the use of default MAC addresses to reduce table sizes.
Virtual chassis systems can often solve some of the same problems as EVPN, but again—you are dealing with a black box. The black box will likely never scale to the same size, and cover the same use cases, as a community-built standard like EVPN.
The bottom line
Sometimes starting with a more complex set of base technologies will result in a simpler overall system. The right system will not avoid complexity, but rather reduce it where possible and contain it where it cannot be avoided. If you find your system is inflexible and difficult to manage, maybe its time to go back to the drawing board and start with something a little more complex, or where the complexity is “on the surface” rather than buried in an abstraction. The result might actually be simpler.
I sort of disagree to some of the bits you mentioned. In my view it largely depends upon couple of key considerations.
1. MLAG being proprietory is not necessarily a big problem in my view. While the network Architects might demand an industry version or standard, I think it was vendor ecosystem that pushed back that ask in past to create stickiness from Business model perspective and also from ROI perspective in terms of the investment they made into R&D to develop such features.
Also when MLAG fails situation can be handled at different layers and again it’s up to Network Architect how well he or she craft a given solution. On the other side it depends upon how good and familiar your operations staff is across all those features and functions used in a given network.
I am assuming a perfect Enterprise here which uses RCA as a method/tool to improve Operational and functional metrics and not as Business tool to give justifications around failures.
2. EVPN at surface does look cool and more complicated and might be very effective as well, but it doesn’t take away complexity of any of interaction layers fitting together like I mentioned above in itself. So whether it’s a bad design around EVPN , Bad implementation, Missing skill set or purpose of RCA , it doesn’t take away any of those complexities on it’s own.
Also it’s been a while since I validated EVPN interoperability across vendors for a while myself, I assume most people would try to avoid those corner cases as much as possible in real life. Few years back it was definitely a mess.
And of course cabling in large Data Centers is an interesting problem to solve, EOR & MOR etc. is something that I believe most Architects are still not familiar with from deep level understanding and cost implications perspective. Because to me the biggest consideration from Business standpoint there is COST for most part.