Have you ever wondered why spine-and-leaf networks are the “standard” for data center networks? While the answer has a lot to do with trial and error, it turns out there is also a mathematical reason the fat-tree spine-and-leaf is is used almost universally. There often is some mathematical reason for the decisions made in engineering, although we rarely explore those reasons. If it seems to work, there is probably a reason.
The fat-tree design is explored in a paper published in 2015 (available here at the ACM, and now added to my “classic papers” page so there is a local copy as well), using a novel technique to not only explore why the spine-and-leaf fat-tree is so flexible, but even what the ideal ratio of network capacity is at each stage. The idea begins with this basic concept: one kind of network topology can be emulated on top of another physical topology. For instance, you can emulate a toroid topology on top of a hierarchical network, or a spine-and-leaf on top of of hypercube, etc. To use terms engineers are familiar with in a slightly different way, let’s call the physical topology the underlay, and the emulated topology the overlay. This is not too far off, as emulated topologies do need to be a virtual construction on top of the underlay, which means… tunnels, of course.
In emulating one kind of topology over another kind of topology, however, you lose some performance. This performance loss factor is quite important, in that it tells you whether or not it is worth building your actual network using the design of the underlay, or using the overlay. If, for instance, you emulate a toroid design on top of a folded Clos, and you discover the performance is half as good as a real, physical toroid, then you might be better off building a toroid rather than a folded Clos.
This might be difficult to grasp at first, so let’s use a specific example. The only topologies most engineers are familiar with are rings, hierarchical networks, partial mesh, full mesh, and some form of spine-and-leaf. Consider the case of a full mesh network versus a ring topology. Suppose you have some application that, through careful testing, you have determined best runs on a full mesh network. Now when you go to build your physical network, you find that you have the choice between building a real full mesh network, which will be very expensive, or a ring network, which will be very cheap. How much performance will you lose in running this application on the ring topology?
The way to discover the answer is to build a virtual full mesh overlay onto the ring topology. Once you do this, it turns out there are ways to determine how much performance loss you will encounter using this combination of underlay and overlay. Now, take this one step farther—what if you decide not to build the overlay, but simply run the application directly on the ring topology? Minus the tunnel encapsulation and de-encapsulation, the resulting loss in performance should be roughly the same,
Given this, you should be able to calculate the performance loss from emulating every kind of topology on top of every other kind of topology. From here, you can determine if there is one topology that can be used as an underlay for every other kind of topology with a minimal amount of performance loss. It turns out there is: the fat-tree spine-and-leaf topology. So an application that might best work on a full mesh network will run on a fat-tree spine-and-leaf topology with minimal performance loss. The same can be said of an application that runs best on a ring topology, or a hierarchical topology, or a partial mesh, or a… Thus, the fat-tree spine-and-leaf can be called a universal topology.
After coming to this point, the authors wonder if it is possible to determine how the bandwidth should be apportioned in a fat-tree spine-and-leaf to produce optimal emulation results across every possible overlay. This would give designers the ability to understand how to achieve the most optimal universal application performance on a single physical topology. Again, there turns out to be an answer to this question: doubling the bandwidth at each stage in the fabric is the best way to create an underlay that will emulate every other kind of overlay with the minimal amount of performance loss. The authors come to this conclusion by comparing a data center network to an on-chip connection network, which has a provable “best” design based on the physical layout of the wires. Translating the physical space into a monetary cost produces the same result as above: doubling the bandwidth at each stage creates the least expensive network with the most capabilities.
The result, then, is this: the fat-tree spine-and-leaf design is the most effective design for most applications, a universal topology. Further, building such a design where there are two neighbors down for every neighbor up, and doubling the bandwidth at each stage, is the most efficient use of resources in building such a network.