Jason Wells, over on LinkedIn, has an article up about the end of MPLS; to wit—
MPLS, according to Akkiraju, is old-hat and inefficient – why should a branch office backhaul to get their cloud data, when Internet connections might be faster – and 100X cheaper? Cisco, in acquiring Viptela, has brought Akkiraju, his company, and his perspective back into the fold, perhaps heralding the beginning of the end of Cisco’s MPLS-based offerings (or at least the beginning of the end of the mindset that they should still have an MPLS-based offering).
To being—I actually work with Aryaka on occasion, and within the larger SD-WAN world more often (I am a member of the TAB over at Velocloud, for instance). This is decidedly not a post about the usefulness or future of SD-WAN solutions (though I do have opinions there, as you might have guessed). Rather, what I want to point out is that we, in the networking industry, tend to be rather sloppy about our language in ways that are not helpful.
To understand, it is useful to back up a few years and consider other technologies where our terms have become confused, and how it has impacted our ability to have effective discussions. When I first started in the networking world, there were gateways, there were routers, and there were switches. Gateways always looked at something beyond the destination IP address, and sometimes rewrote information in the packet, when forwarding or processing a packet. Routers always made a forwarding decisions based on the IP or network header, and modified the outer headers in the packet (the physical, or lower layer, headers). Specifically routers always perform a MAC header rewrite and decrement the Time to Live (TTL); switches never touched the packet, either the header or the contents.
There was, in the “old days,” another way in which routers and switches were different. Because routers must look deeper into the packet, and modify the packet, the original routing implementations could only be performed in software. Hence, for many years, routers processed packets in software, and switches processed packets in hardware.
When the original SSE was developed by Cisco, however, the marketing folks wanted emphasize the SSE could route in hardware. To convey this, they took the meaning of switching, that switching takes place in hardware, while routing takes place in software, and called these new deices layer 3 switches, Over time, of course, the layer 3 switch was reduced to just a switch, so now a switch is “anything that can process packets fast,” and routers are “anything that processes packets slow.”
The problem with this shift in terminology is that we no longer have a way to express the difference between a device that rewrites the MAC header and one that does not. A Top of Rack (ToR) switch in a data center may, in fact, perform a MAC header rewrite. So when someone says we should put a switch here, you have no idea if they mean a device that does MAC header rewrites or not. Routers are placed in the WAN, while switches are placed in the DC, even though they perform the same functions on the packet. Confusion ensues.
There is no way I can convince people to stop using router and switch interchangeably.
How does this relate to the article quoted above? What does MPLS mean? This is one of those terms that has two distinct meanings. It means, to at least some people, a widely used switching and tunneling technology. To others, however, it meas a particular kind of service—a layer 3 service running over an MPLS capable network operated as a service. The key point the article is making is that dedicated virtual circuits, provisioned and managed by a provider, are losing ground as a solution to SD-WAN solutions. Describing these services as MPLS uses one meaning of the term, but not the second.
The problem is, of course, that many people will read this sort of thing and say, “aha! Providers are getting rid of MPLS in their networks and replacing it with SD-WAN solutions.” This is simply a category error. It is a kind of service that is being replaced, not a technology.
The bottom line. This is nothing more than a call for consistent and clear terminology in the world of network engineering. I know we are fighting marketing machines that are looking for any sort of lever on which to stand, including “we use technology x, which makes our offering better.” This tendency, however, is bad for open and effective discussion in the network engineering community—and we should try to avoid it where we can.
In this, the last post on DC fabrics as a Segment Routing use case, I mostly want to tie up some final loose ends. I will probably return to SR in the future to discuss other ideas and technical details.
Anyone who keeps up with LinkedIn knows anycast plays a major role in many parts of the infrastructure. This isn’t unique to LinkedIn, though; most DNS implementations and/or providers, as well as just about every large scale public facing web application, also uses anycast. Which leads to an obvious question—how would SR work with anycast? The answer turns out to be much simpler than it might appear. The small diagram below might be helpful—
Assume A and B have two copies of a single service running on them, and we want hosts behind F to use one service or the other, just depending on which the routing system happens to route towards first. This isn’t quite the classical case for anycast, as anycast normally involves choosing the closest service, and both of the services in this example are equal distance from the hosts—but this is going to be the case more often than not in a data center. In section 3.4 of draft-ietf-spring-segment-routing, we find—
An IGP-Anycast Segment is an IGP-prefix segment which does not identify a specific router, but a set of routers. The terms “Anycast Segment” or “Anycast-SID” are often used as an abbreviation. An “Anycast Segment” or “Anycast SID” enforces the ECMP-aware shortest-path forwarding towards the closest node of the anycast set. This is useful to express macro-engineering policies or protection mechanisms. An IGP-Anycast Segment MUST NOT reference a particular node. Within an anycast group, all routers MUST advertise the same prefix with the same SID value.
To advertise the anycast service running on both A and B, then, a single label can be assigned to both, and advertised as an Anycast-SID through the entire network. When F receives both, it will treat them (essentially) as a unicast route, and hence it will choose one of the various paths available to reach the service, or (perhaps) load share between them (depending on the implementation). The ideal implementation in a data center fabric would be to load share between the two advertisements.
Strict Shortest Path Flag
In section 3.2 of draft-ietf-isis-segment-routing-extensions we find—
Strict Shortest Path First (SPF) algorithm based on link metric. The algorithm is identical to algorithm 0 but algorithm 1 requires that all nodes along the path will honor the SPF routing decision. Local policy MUST NOT alter the forwarding decision computed by algorithm 1 at the node claiming to support algorithm 1.
What is this about? At any particular hop in the network, some local policy might override the result of the SPF algorithm IS-IS uses to calculate shortest (loop free) paths through the network. In some situations, it might be that this local policy, combined with the SR header information, could create a forwarding loop in the network. To avoid this, the strict flag can be set so the SPF calculation overrides any local policy.
Using SR without IGP distribution
The main problem an operator might face in deploying SR is in implementing the IGP and/or BGP extensions to carry the SR headers through the network. This could not nly expose scaling issues, it could also make the control very complex very quickly, perhaps offsetting some of the reasons SR is being deployed. There are several ways you could get around this problem if you were designing an SR implementation in a data center. For instance—
- Carry SR headers in gprc using YANG models (you should be thinking I2RS about now)
- Carry SR headers in a separate BGP session to each device (for instance, if you use eBGP for primary reachability in the fabric, you could configure an iBGP session to each ToR switch to carry SR headers)
There are many other ways to implement SR without using the IGP extensions specifically in a data center fabric.
SR is an interesting technology, particularly in the data center space. It’s much simpler than other forms of traffic engineering, in fact.
In the last post in this series, I discussed using SR labels to direct traffic from one flow onto, and from other flows off of, a particular path through a DC fabric. Throughout this series, though, I’ve been using node (or prefix) SIDs to direct the traffic. There is another kind of SID in SR that needs to be considered—the adj-sid. Let’s consider the same fabric used throughout this series—
So far, I’ve been describing the green marked path using the node or (loopback) prefix-sids:
[A,F,G,D,E]. What’s interesting is I could describe the same path using adj-sids:
[a->f,f->g,g->d,d->e], where the vector in each hop is described by a single entry on the SR stack. There is, in fact, no difference between the two ways of describing this path, as there is only one link between each pair of routers in the path. This means everything discussed in this series so far could be accomplished with either a set of adj SIDs ore a set of node (prefix) SIDs.
Given this, why are both types of SIDs defined? Assume we zoom in a little on the border leaf node in this topology, and find—
- Router E, a ToR on the “other side of the fabric,” wants to send traffic through Y rather than Z
- Router Y does not support SR, or is not connected to the SR control plane running on the fabric
What can Router E do to make certain traffic exits the border leaf towards Router Y, rather than Router Z? Begin with Router A building an adj-sid for each of it’s connections to the exiting routers—
- A->Y on A1 is given label 1500
- A->Y on A2 is given label 1501
- A->Y ECMP across both links is given label 1502
- A->Z is given label 1503
With Router advertising this set of labels, E can use the SR stack—
[H,A,1500]to push traffic towards Y along the A1 link
[H,A,1501]to push traffic towards Y along the A2 link
[H,A,1502]to push traffic towards Y using ECMP using both A1 and A2
This indicates there are two specific places where adj-sids can be useful in combination with node (or prefix) SIDs—
- When the router towards which traffic needs to be directed doesn’t participate in SR; this effectively widens the scope of SR’s effectiveness to one hop beyond the SR deployment itself
- When there is more than one link between to adjacent routers; this allows SR to choose one of a set of links among a group of bundled links, or a pair of parallel links that would normally be used through ECMP
Again, note that the entire deployment developed so far in this series could use either just adj-sids, or just node (prefix) SIDs; both are not required. To gain more finely grained control over the path taken through the fabric, and in situations where SR wants to be extended beyond the reach of the SR deployment itself, both types of SIDs are required.
This brings up another potential use case in a data center fabric, as well. So far, we’ve not dealt with overlay networks, only the fabric itself. Suppose, however, you have—
Where C, D, E, and F are ToR or leaf switches (routers), and A, B, G, and H are hosts (or switches inside hypervisors). Each of the two differently colored/dashed sets of lines represent a security domain of some type; however, this security domain appears, from IP’s perspective, to be a single subnet. This could be, for instance, a single subnet split into multiple broadcast domains using eVPNs, or simply some set of segments set up for security purposes. In this particular situation, G wants to send traffic to D without F receiving it. Given the servers themselves participate in the SR domain, how could this be accomplished? G could send this traffic with an SR stack telling E to transmit the traffic through the blue interface, rather than the green one. This could be a single SR stack, containing just a single adj-sid telling E to which interface to use when forwarding this traffic.
I had hoped to finish this series this week, but it looks like it’s going to take one more week to wrap up a few odds and ends…
In the second post in this series, we considered the use of IGP-Prefix segments to carry a flow along a specific path in a data center fabric. Specifically, we looked at pulling the green flow in this diagram—
—along the path [A,F,G,D,E]. Let’s assume this single flow is an elephant flow that we’re trying to separate out from the rest of the traffic crossing the fabric. So—we’ve pulled the elephant flow onto its own path, but this still leaves other flows to simple ECMP forwarding through the fabric. This means some number of other flows are still going to follow the [A,F,G,D,E] path. The flows that are randomly selected (or selected by the ECMP has) to follow the same path as the elephant flow are still going to contend with the elephant flow for queue space, etc.
So we need more than just a way to pull an elephant flow onto a specific path. In fact, we also need a way to pull a specific set of flows off a particular path in the ECMP set. Returning to our diagram, assume we want all the traffic other than the elephant flow to be load shared between H and B, and never sent to F. How can we do this with segment routing? This is where a device assigning more than one label comes into play. The process is actually pretty simple from a forwarding perspective.
- H and B are assigned the same label, so A now has two paths with the same label—let’s say we use 1024 just as an example for the label
- When A classifies any traffic into the forwarding class that belongs to this label—essentially anything not belonging to the elephant flow we’re trying to isolate—it assigns this single label (1024)
- When A forwards the traffic with this label (1024) along the ECMP set formed by the two available paths for the label (1024)
The hard part here is not the forwarding, however—it’s the classification of the traffic and the provisioning of the tunnels through the fabric. Let me start with the easier problem first—the provisioning of the labels on the fabric. We could, of course, count on some sort of routing protocol, or something like LDP, to provide each router on the fabric with the correct tunnels. Or each device could be configured with, or self-assign, labels as needed, and the control plane could carry these labels to the other devices running the control plane.
But how should such labels be collected and distributed? BGP-LS, or rather BGP-Link State, is probably the cleanest option using a dynamic routing protocol. If OSPF or IS-IS are used, they will collect the labels on the local device and advertise them through various extensions—with the Type 1 LSA in OSPF, and the type 22 TLV in IS-IS. While each router in the fabric could find the label sets this way, there’s no way to bind the label with a forwarding class, or a specific type of traffic, or a specific application.
To solve this problem, BGP-LS can be used to gather the labels from the link state protocol, and hence into a controller sitting someplace on the fabric. In a BGP only fabric, BGP itself can carry the adj-sid labels to the controller directly. The controller can then set the correct policies at each ToR switch to impose the correct SR label stack to pull the traffic along the correct route in the fabric. All these solutions, of course, assume you’re using adj-sids as well as prefix-sids—which is overkill on a spine and leaf fabric.
There is another problem here, of course—how to get the ToR switch (router A in this instance) to pull the correct flows into the correct label stack. This problem, however, is an implementation issue on the individual ToR switches—there needs to be some way to classify the flow and change the forwarding behavior based on this classification that results in the correct label stack being pushed onto the packet as it’s forwarded by Router A.
Next time, we’ll wrap this series up with further thoughts on other uses for adj-sids, and also on how we could assign labels and impose policies in a way that avoids the (otherwise unneeded) adj-sids.
In the first post we covered a bit of the basics around segment routing in the data center. Let’s return to the first use case to see if we can figure out how we’d actually implement the type of traffic steering needed to segregate mouse and elephant flows. Let’s return to our fabric and traffic flows and think about how we could shape traffic using segment routing.
There are two obvious ways to shape traffic in this way—
The first way would be to impose a label stack that forces traffic along a path that touches, or passes through, each of the devices along the path. In this case, that would mean imposing a path on the traffic originating behind the ToR at A so it must pass through [F,G,D,E]. The flow of traffic through the data center will look something like—
- Somehow classify the traffic as belonging to the flow that should be shaped to follow only the [F,G,D,E] path
- Impose the path as a label stack, so the SR header (really just a label stack in this situation, remember?) will contain [F,G,D,E]
- Forward the packet, with the label, to the next hop in the stack, namely F
- F receives the packet, pops the top label, examines the next label in the SR header (MPLS label stack)
- F uses local forwarding information to send the packet to G
- G receives the packet, pops the top label, examines the next label in the SR header (MPLS label stack)
- G uses local forwarding information to send the packet to D
- D receives the packet, pops the top label, examines the next label in the SR header (MPLS label stack)
- D uses local forwarding information to send the packet to E
- E pops the last label and forwards based on the destination address in the IP header
This is fairly typical operation, just what you might expect to guide the traffic through the fabric.
This case is very similar to the first, only it adds an extra set of labels to the SR header (MPLS label stack) to forward the packets through the fabric. Let’s assume the inbound interface on each router in the fabric is assigned a letter/number combination (this would really just be another label in MPLS)—
- The inbound interface on F on the A->F link is labeled F1
- The inbound interface on G on the F->G link is labeled G1
- The inbound interface on D on the G->D link is labeled D1
- The inbound interface on E on the D->E link is labeled E1
Now when A imposes a label on packets being forwarded along this path, it will look like [F1,F,G1,G,D1,D,E1,E]. Notice the prefix segments are still in the SR stack; we’ve just added the interface labels along with the prefix segments. Processing this packet is a little different, as there are now specified interfaces along the path. Each switch is going to use the interface specified, rather than using locally calculated forwarding information.
- Somehow classify the traffic as belonging to the flow that should be shaped to follow only the [F1,F,G1,G,D1,D,E1,E] path
- Impose the path as a label stack, so the SR header (really just a label stack in this situation, remember?) will contain [F1,F,G1,G,D1,D,E1,E]
- Forward the packet out of the specified interface, F1, towards the next hop, which is F
- F will pop the inbound link label, F1, and its local label, F
- F will look up the outbound interface based on the next label, G1, and forward through that interface towards the next hop in the path, G
I’m not going to repeat the entire path here again, because it’s pretty much the same for each device in the path. Note the only difference between these two situations is the addition of the interface label. If it seems like the interface labels add complexity without adding value, you’re right—for this particular topology. The reason interface labels don’t add value here is because there is only one path between any pair of routers.
A later post will consider the value of interface labels