“We use a nonblocking fabric…”
Probably not. Nonblocking is a word that is thrown around a lot, particularly in the world of spine and leaf fabric design—but, just like calling a Clos a spine and leaf, we tend to misuse the word nonblocking in ways that are unhelpful. Hence, it is time for a short explanation of the two concepts that might help clear up the confusion. To get there, we need a network—preferably a spine and leaf like the one shown below.
Based on the design of this fabric, is it nonblocking? It would certainly seem so at first blush. Assume every link is 10g, just to make the math easy, and ignore the ToR to server links, as these are not technically a part of the fabric itself. Assume the following four 10g flows are set up—
- B through [X1,Y1,Z2] towards A
- C through [X1,Y2,Z2] towards A
- D through [X1,Y3,Z2] towards A
- E through [X1,Y4,Z2] towards A
As there are four different paths between these four servers (B through E) and Z2, which serves as the ToR for A, all 40g of traffic can be delivered through the fabric without dropping or queuing a single packet (assuming, of course, that you can carry the traffic optimally, with no overhead, etc.—or reducing the four 10g flows slightly so they can all be carried over the network in this way). Hence, the fabric appears to be nonblocking.
What happens, however, if F transmits another 10g of traffic towards A at the X4 ToR? Again, even disregarding the link between Z2 and A, the fabric itself cannot carry 50g of data to Z2; the fabric must now block some traffic, either by dropping it, or by queuing it for some period of time. In packet switched networks, this kind of possibility can always be true.
Hence, you can design the fabric and the applications—the entire network-as-a-system—to reduce contention through the intelligent use of traffic engineering, admission policies (such as bandwidth calendaring). You can also manage contention through QoS policies and using flow control mechanisms that will signal senders to slow down when the network is congested.
But you cannot build a nonblocking packet switched network of any realistic size or scope. You can, of course, build a network that has two hosts, and enough bandwidth to support the maximum bandwidth of both hosts. But when you attach a third host, and then a fourth, etc., the problem of building a nonblocking fabric in a packet switched network becomes problematic; it is always possible for two sources to “gang up” on a single destination, overwhelming the capacity of the network.
It is possible, of course, to build a nonblocking fabric—so long as you use a synchronous or circuit switched network. The concept of a nonblocking network, in fact, comes out of the telephone world, where each user must only connect with one other user, and each user uses and has a fixed amount of bandwidth. In this world, it is possible to build a true nonblocking network fabric.
In the world of packet switching, the closest we can come is a noncontending network, which means every host can (in theory) send to every other host on the network at full rate. From there, it is up to the layout of the workloads on the fabric, and the application design, to reduce contention to the point where no blocking is taking place.
This is the kind of content that will be available in Ethan and I’s new book, which should be published around January of 2018, if the current schedule holds.
While most network engineers do not spend a lot of time thinking about environmentals, like power and cooling, physical space problems are actually one of the major hurdles to building truly large scale data centers. Consider this: a typical 1ru rack mount router weighs in at around 30 pounds, including the power supplies. Centralizing rack power, and removing the sheet metal, can probably reduce this by about 25% (if not more). By extension, centralizing power and removing the sheet metal from an entire data center’s worth of equipment could reduce the weight on the floor by about 10-15%—or rather, allow about 10-15% more equipment to be stacked into the same physical space. Cooling, cabling, and other considerations are similar—even paying for the sheet metal around each box to be formed and shipped adds costs.
What about blade mount systems? Most of these are designed for rather specialized environments, or they are designed for a single vendor’s blades. In the routing space, most of these solutions are actually chassis based systems, which are fraught with problems in large scale data center buildouts. The solution? Some form of open, foundation based standard that can be used by all vendors to build equipment sans the sheet metal with a common power and cooling specification.
Open19 is the hardware side of the disaggregation world; moving to a standardized rack allows users to mix and match compute, storage, and route/switch in a single “chassis,” using a range of vendors, without worrying about power and cooling differentials. Even for smaller buildouts who are interested in moving away from a vendor drive architecture, a standardized chassis format can make a lot of sense.
In this, the last post on DC fabrics as a Segment Routing use case, I mostly want to tie up some final loose ends. I will probably return to SR in the future to discuss other ideas and technical details.
Anyone who keeps up with LinkedIn knows anycast plays a major role in many parts of the infrastructure. This isn’t unique to LinkedIn, though; most DNS implementations and/or providers, as well as just about every large scale public facing web application, also uses anycast. Which leads to an obvious question—how would SR work with anycast? The answer turns out to be much simpler than it might appear. The small diagram below might be helpful—
Assume A and B have two copies of a single service running on them, and we want hosts behind F to use one service or the other, just depending on which the routing system happens to route towards first. This isn’t quite the classical case for anycast, as anycast normally involves choosing the closest service, and both of the services in this example are equal distance from the hosts—but this is going to be the case more often than not in a data center. In section 3.4 of draft-ietf-spring-segment-routing, we find—
An IGP-Anycast Segment is an IGP-prefix segment which does not identify a specific router, but a set of routers. The terms “Anycast Segment” or “Anycast-SID” are often used as an abbreviation. An “Anycast Segment” or “Anycast SID” enforces the ECMP-aware shortest-path forwarding towards the closest node of the anycast set. This is useful to express macro-engineering policies or protection mechanisms. An IGP-Anycast Segment MUST NOT reference a particular node. Within an anycast group, all routers MUST advertise the same prefix with the same SID value.
To advertise the anycast service running on both A and B, then, a single label can be assigned to both, and advertised as an Anycast-SID through the entire network. When F receives both, it will treat them (essentially) as a unicast route, and hence it will choose one of the various paths available to reach the service, or (perhaps) load share between them (depending on the implementation). The ideal implementation in a data center fabric would be to load share between the two advertisements.
Strict Shortest Path Flag
In section 3.2 of draft-ietf-isis-segment-routing-extensions we find—
Strict Shortest Path First (SPF) algorithm based on link metric. The algorithm is identical to algorithm 0 but algorithm 1 requires that all nodes along the path will honor the SPF routing decision. Local policy MUST NOT alter the forwarding decision computed by algorithm 1 at the node claiming to support algorithm 1.
What is this about? At any particular hop in the network, some local policy might override the result of the SPF algorithm IS-IS uses to calculate shortest (loop free) paths through the network. In some situations, it might be that this local policy, combined with the SR header information, could create a forwarding loop in the network. To avoid this, the strict flag can be set so the SPF calculation overrides any local policy.
Using SR without IGP distribution
The main problem an operator might face in deploying SR is in implementing the IGP and/or BGP extensions to carry the SR headers through the network. This could not nly expose scaling issues, it could also make the control very complex very quickly, perhaps offsetting some of the reasons SR is being deployed. There are several ways you could get around this problem if you were designing an SR implementation in a data center. For instance—
- Carry SR headers in gprc using YANG models (you should be thinking I2RS about now)
- Carry SR headers in a separate BGP session to each device (for instance, if you use eBGP for primary reachability in the fabric, you could configure an iBGP session to each ToR switch to carry SR headers)
There are many other ways to implement SR without using the IGP extensions specifically in a data center fabric.
SR is an interesting technology, particularly in the data center space. It’s much simpler than other forms of traffic engineering, in fact.
In the last post in this series, I discussed using SR labels to direct traffic from one flow onto, and from other flows off of, a particular path through a DC fabric. Throughout this series, though, I’ve been using node (or prefix) SIDs to direct the traffic. There is another kind of SID in SR that needs to be considered—the adj-sid. Let’s consider the same fabric used throughout this series—
So far, I’ve been describing the green marked path using the node or (loopback) prefix-sids:
[A,F,G,D,E]. What’s interesting is I could describe the same path using adj-sids:
[a->f,f->g,g->d,d->e], where the vector in each hop is described by a single entry on the SR stack. There is, in fact, no difference between the two ways of describing this path, as there is only one link between each pair of routers in the path. This means everything discussed in this series so far could be accomplished with either a set of adj SIDs ore a set of node (prefix) SIDs.
Given this, why are both types of SIDs defined? Assume we zoom in a little on the border leaf node in this topology, and find—
- Router E, a ToR on the “other side of the fabric,” wants to send traffic through Y rather than Z
- Router Y does not support SR, or is not connected to the SR control plane running on the fabric
What can Router E do to make certain traffic exits the border leaf towards Router Y, rather than Router Z? Begin with Router A building an adj-sid for each of it’s connections to the exiting routers—
- A->Y on A1 is given label 1500
- A->Y on A2 is given label 1501
- A->Y ECMP across both links is given label 1502
- A->Z is given label 1503
With Router advertising this set of labels, E can use the SR stack—
[H,A,1500]to push traffic towards Y along the A1 link
[H,A,1501]to push traffic towards Y along the A2 link
[H,A,1502]to push traffic towards Y using ECMP using both A1 and A2
This indicates there are two specific places where adj-sids can be useful in combination with node (or prefix) SIDs—
- When the router towards which traffic needs to be directed doesn’t participate in SR; this effectively widens the scope of SR’s effectiveness to one hop beyond the SR deployment itself
- When there is more than one link between to adjacent routers; this allows SR to choose one of a set of links among a group of bundled links, or a pair of parallel links that would normally be used through ECMP
Again, note that the entire deployment developed so far in this series could use either just adj-sids, or just node (prefix) SIDs; both are not required. To gain more finely grained control over the path taken through the fabric, and in situations where SR wants to be extended beyond the reach of the SR deployment itself, both types of SIDs are required.
This brings up another potential use case in a data center fabric, as well. So far, we’ve not dealt with overlay networks, only the fabric itself. Suppose, however, you have—
Where C, D, E, and F are ToR or leaf switches (routers), and A, B, G, and H are hosts (or switches inside hypervisors). Each of the two differently colored/dashed sets of lines represent a security domain of some type; however, this security domain appears, from IP’s perspective, to be a single subnet. This could be, for instance, a single subnet split into multiple broadcast domains using eVPNs, or simply some set of segments set up for security purposes. In this particular situation, G wants to send traffic to D without F receiving it. Given the servers themselves participate in the SR domain, how could this be accomplished? G could send this traffic with an SR stack telling E to transmit the traffic through the blue interface, rather than the green one. This could be a single SR stack, containing just a single adj-sid telling E to which interface to use when forwarding this traffic.
I had hoped to finish this series this week, but it looks like it’s going to take one more week to wrap up a few odds and ends…
A couple of weeks ago, I attended a special segment routing Networking Field Day. This set me to thinking about how I would actually use segment routing in a live data center. As always, I’m not so concerned about the configuration aspects, but rather with what bits and pieces I would (or could) put together to make something useful out of these particular 0’s and 1’s. The fabric below will be used as our example; we’ll work through this in some detail (which is why there is a “first part” marker in the title).
This is a Benes fabric, a larger variation of which which you might find in any number of large scale data center. In this network, there are many paths between A and E; three of them are marked out with red lines to give you the idea. Normally, the specific path taken by any given flow would be selected on a somewhat random basis, using a hash across various packet headers. What if I wanted to pin a particular flow, or set of flows, to the path outlined in green?
Let’s ask a different question first—why would I want to do such a thing? There are a number of reasons, of course, such as pinning an elephant flow to a single path so I can move other traffic around it. Or perhaps I want to move a specific small (mouse) flow onto this path, while somehow preventing other traffic from taking it. Given I have a good reason, though, how could I do this with segment routing?
Let’s begin at the beginning. Assume I’m running IS-IS on this network, I have MPLS forwarding enabled, but I don’t have any form of label distribution enabled (LDP, BGP on top, etc.). To make my life simple, I’m going to assign a loopback address to each router in the fabric, and either—
- just allow IS-IS to use the IPv6 link local addresses (don’t assign any IPv6 address to the fabric links in an IPv6 only fabric)
- assign some private IPv4 address and configure IS-IS to advertise only passive interfaces, so the fabric link addresses aren’t actually advertised into IS-IS
If you’re uncertain what either of these two options mean, you might want to take a run through my recent IS-IS Livelesson to understand IS-IS as a routing protocol better).
With this background, let’s roll segment routing onto this network. Segment routing, in order to allow many different transports, contains the concept of a Segment Identifier, or a SID. The SID is used for many things, a point which can make reading the segment routing drafts a bit confusing to read. For this particular network, though, we’re going to simplify to two specific kinds of SIDs, because these are the only two we really care about—
- IGP-Prefix Segment, defined in section 3.2 of draft-ietf-spring-segment-routing, and variously called the Prefix-SID, the IGP-Prefix-SID, and the Prefix Segment Identifier in various places. IS-IS carries this in the Prefix Segment Identifier (Prefix-SID Sub-TLV), defined in section 2.1 of draft-ietf-isis-segment-routing-extensions. This sub-TLV can be added to any of the four main “wide metric” TLVs in IS-IS (again, I’ll refer you here to the recent Livelesson if you need to brush up on IS-IS packet formatting).
- IGP-Adjacency Segment, defined in defined in section 3.5 of draft-ietf-spring-segment-routing, and variously called the Adj-SID and the IGP-Adjacency-SID in various other documents. IS-IS carries this in the Adjacency Segment Identifier, defined in section 2.2 of draft-ietf-isis-segment-routing-extensions. This sub-TLV can be carried in any of the IS node reachability TLVs, such as 22, 222, 23, etc.
These SIDs, in the world of MPLS, are actually just MPLS labels. This means you don’t need a separate form of MPLS label distribution if you’re using the IS-IS segment routing extensions; these labels can be carried in IS-IS itself, along with the topology and reachability information.
To get segment routing up and running, I’ll need each router in the network to create two different MPLS labels, AKA SIDs, and advertise them through IS-IS (using the correct sub-TLV, of course)—
- An IGP-Prefix segment for each loopback address.
- An IGP-Adjacency segment for each fabric interface.
This means Router A would create an IGP-Prefix segment for its loopback address, and an IGP-Adjacency segment towards B, F, and its other neighbors.
There is, in fact, another type of SID described in the segment routing documentation, a IGP-Node-Segment. This actually describes a loopback address for a particular node, and hence describes the device itself. This is discussed in section 2.1 of draft-ietf-isis-segment-routing-extensions as a single flag within the IGP-Prefix segment. In reality, there is no functional difference between a node identifier and a prefix identifier in this case, so there’s no need to spend a lot of time on this here.