Research: P Fat Trees

Link speeds in data center fabrics continue to climb, with 10g, 25g, 40g, and 100g widely available, and 400g promised in just a few short years. What isn’t so obvious is how these higher speeds are being reached. A 100g link, for instance, is really four 25g links bundled as a single link at the physical layer. If the optics are increasing in speed, and the processors are increasing in their ability to switch traffic, why are these higher speed links being built in this way? According to the paper under investigation today, the reason is the speed of the chips that serialize traffic from and deserialize traffic off the optical medium. The development of the Complementary metal–oxide–semiconductor, of CMOS, chips required to build ever faster optical interfaces seems to have stalled out at around 25g, which means faster speeds must be achieved by bundling multiple lower speed links.

Mellette, William M., Alex C. Snoeren, and George Porter. “P-FatTree: A Multi-Channel Datacenter Network Topology.” In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, 78–84. HotNets ’16. New York, NY, USA: ACM, 2016. https://doi.org/10.1145/3005745.3005746.

The authors then point out that many data operators have moved towards some form of chassis device in order to reduce the costs of cabling and optics. Chassis devices most often use some form of spine and leaf internally to switch traffic between the input and output ports across a short run copper fabric, resulting in a switching path within the chassis router that looks something like the following figure.

The spine and leaf in connecting the switching ASICs are one of the main reasons data center operators move away from chassis devices; the number of hops through the network becomes unstable with the addition of these internal spine and leaf fabrics, the backpressure and quality of service is essentially unmanageable across this fabric on most devices, and there is little in the way of traffic analysis that can be done on this internal fabric. The authors do not address these problems, however.

Rather, they address the added set of switching ASICs in the spine layer of the internal spine and leaf network. As it turns out, the switching ASICs themselves are a major consumer of power, and heat generator, in switches. They argue that removing this internal spine layer would greatly reduce the amount of power required in a fabric, as well as the amount of heat generated.
To do this, they propose unbundling the links attached to each SerDes CMOS chip, exposing them as individual links to the control plane. This would allow the switching path to be shortened to something like the figure below.

Exposing the unbundled links to the external control plane allows each stage of the internal fabric to be treated as another hop in the network, and hence for “normal” ECMP to choose the path through the chassis fabric.

The authors suggest the four unbundled links attached to a single switching ASIC can be treated as a member of a different “switching plane,” which, in effect, creates four virtual topologies across the fabric, each of which is one quarter the speed of the total fabric bandwidth. Each virtual topology could run its own control plane, producing four somewhat redundant networks, and the ability to steer traffic onto any given plane at the edge of the network for traffic engineering, policy separation, or any other purpose. The result is a fabric that is more flexible in use, while retaining a fixed hop count through the fabric, and reducing the ASIC count in the fabric by around one third.

This is an interesting concept, but it would require an entire fabric to be built this way from the ground up; there is little chance of a brown field deployment of this kind of design. One tradeoff in this kind of design would be the additional control plane state, including assigning four addresses to each host (although this might be mitigated by the clever use of anycast), and the maintenance of four control planes, etc. Another design tradeoff would be the shared risk link groups involved in splitting a single optical fiber and ASIC into four circuits—these aren’t exactly “virtual circuits,” but they share many of the same characteristics.