Research: Service Fabric

Microservices architectures probably will not “take over the world,” in terms of solving every application you can throw at them, but they are becoming more widespread. Microservices and related “staged” design patterns are ideal for edge facing applications, where the edge facing services, in particular, need to scale quickly across broad geographical regions. Supporting microservices using a standard overlay model can be challenging; somehow the network control plane, container placement/spinup/cleanup, and service discovery must be coordinated. While most networks would treat each of these as a separate problem, service fabrics are designed to either interact with, or even replace, each of the systems involved with a single, unified overlay construct.

Kakivaya, Gopal, Lu Xun, Richard Hasha, Shegufta Bakht Ahsan, Todd Pfleiger, Rishi Sinha, Anurag Gupta, et al. “Service Fabric: A Distributed Platform for Building Microservices in the Cloud.” In Proceedings of the Thirteenth EuroSys Conference, 33:1–33:15. EuroSys ’18. New York, NY, USA: ACM, 2018.

Kakivaya, et al., begin by considering the five major design principles of a service fabric: modular and layered design; self-* properties; decentralized operation; strong consistency; and support for stateful services. They then introduce Microsoft’s Service Fabric (SF) service, which they say has taken over sixteen years and the work of more than a hundred core software engineers. After considering some of the components of SF at a high level, they discuss a single use case; if you do not understand the design and application of the microservices design pattern, this section is a great tutorial to start from. The authors then dive into several interesting (for network engineers) components of SF in more detail.

The first of these is the federation subsystem; this allows groups of nodes to be organized into a single federation. Nodes in a federation form themselves into a virtual ring topology regardless of the underlying topology. From a networking perspective, rings have several interesting characteristics.

First, routing through a ring converges more slowly than other topologies; the larger the ring, the slower the convergence. Second, ring topologies tend to form microloops while converging, as well. Third, the addition of a new node does not increase the number of neighbors on any node (each node in a ring has two neighbors regardless of how large the ring is), but the stretch, or the total length of the longest path through the network, increases with each additional node.

Since the rings in SF are primarily used for control plane functions, rather than routing—more on this in a minute—the convergence properties of ring topologies in this application really only apply to the speed at which nodes can be inserted and removed from the ring, rather than to the speed of routing through the ring. Federated rings use a strong consistent membership model, which means that although a single node might be polled for liveness by multiple other nodes in the mesh, only one needs to declare the node down in order to remove it from the ring. Down detection in SF is symmetric; every node is both responsible to monitor some other set of nodes, and also to report on its own liveness to the nodes by which it is being monitored.

How can these federated rings avoid the downsides of routing through a ring topology? Because routed paths do not follow the ring. If a node needs to communicate with another node, it first uses service discovery to determine the IP address of the remote service, then sends traffic directly to that IP address. The traffic between nodes is, then, IP routed. Routing tables are build and maintained through a Distributed Hash Table (DHT). What is a DHT?

A network of five nodes is illustrated here; each node has one or two labeled links attached. While a service mesh would use nodes or service identifiers instead of links, the principle is the same. Assume two of the nodes in this network are given routing responsibilities; A is to handle routing for all even numbered addresses, while D is to handle all odd numbered addresses. This even/odd split is a very primitive form of a hash, which is simply used to split a larger number space into smaller buckets. Smaller buckets are easier to search; splitting the buckets up on multiple systems allows each to process and manage a smaller set of table entries.

Hashes are considered in more detail in Computer Networking Problems and Solutions.

If node E wants to reach link (or service) 6, it runs the hashing algorithm used by all the devices (divide by two in this case), then consults a local table to determine which node it should query about information to reach 6. It will discover the correct node to query, in this simple case, is A. Given the hashing is set up correctly, this is an efficient way to find and route to individual nodes fairly quickly.

Note this kind of system would suffer from the normal ills of a distributed routing protocol, including the limitations of the CAP theorem. In fact, the authors note that routing in SF is eventually consistent, which means nodes querying for a particular destination can receive stale information, just like in BGP, OSPF, IS-IS, etc.

This paper is a terrific introduction to the world of service mesh systems; it is well worth reading if you are interested in this new and emerging kind of overlay.