DC Fabric Segment Routing Use Case (3)

In the second post in this series, we considered the use of IGP-Prefix segments to carry a flow along a specific path in a data center fabric. Specifically, we looked at pulling the green flow in this diagram—


—along the path [A,F,G,D,E]. Let’s assume this single flow is an elephant flow that we’re trying to separate out from the rest of the traffic crossing the fabric. So—we’ve pulled the elephant flow onto its own path, but this still leaves other flows to simple ECMP forwarding through the fabric. This means some number of other flows are still going to follow the [A,F,G,D,E] path. The flows that are randomly selected (or selected by the ECMP has) to follow the same path as the elephant flow are still going to contend with the elephant flow for queue space, etc.

So we need more than just a way to pull an elephant flow onto a specific path. In fact, we also need a way to pull a specific set of flows off a particular path in the ECMP set. Returning to our diagram, assume we want all the traffic other than the elephant flow to be load shared between H and B, and never sent to F. How can we do this with segment routing? This is where a device assigning more than one label comes into play. The process is actually pretty simple from a forwarding perspective.

  • H and B are assigned the same label, so A now has two paths with the same label—let’s say we use 1024 just as an example for the label
  • When A classifies any traffic into the forwarding class that belongs to this label—essentially anything not belonging to the elephant flow we’re trying to isolate—it assigns this single label (1024)
  • When A forwards the traffic with this label (1024) along the ECMP set formed by the two available paths for the label (1024)

The hard part here is not the forwarding, however—it’s the classification of the traffic and the provisioning of the tunnels through the fabric. Let me start with the easier problem first—the provisioning of the labels on the fabric. We could, of course, count on some sort of routing protocol, or something like LDP, to provide each router on the fabric with the correct tunnels. Or each device could be configured with, or self-assign, labels as needed, and the control plane could carry these labels to the other devices running the control plane.

But how should such labels be collected and distributed? BGP-LS, or rather BGP-Link State, is probably the cleanest option using a dynamic routing protocol. If OSPF or IS-IS are used, they will collect the labels on the local device and advertise them through various extensions—with the Type 1 LSA in OSPF, and the type 22 TLV in IS-IS. While each router in the fabric could find the label sets this way, there’s no way to bind the label with a forwarding class, or a specific type of traffic, or a specific application.

To solve this problem, BGP-LS can be used to gather the labels from the link state protocol, and hence into a controller sitting someplace on the fabric. In a BGP only fabric, BGP itself can carry the adj-sid labels to the controller directly. The controller can then set the correct policies at each ToR switch to impose the correct SR label stack to pull the traffic along the correct route in the fabric. All these solutions, of course, assume you’re using adj-sids as well as prefix-sids—which is overkill on a spine and leaf fabric.

There is another problem here, of course—how to get the ToR switch (router A in this instance) to pull the correct flows into the correct label stack. This problem, however, is an implementation issue on the individual ToR switches—there needs to be some way to classify the flow and change the forwarding behavior based on this classification that results in the correct label stack being pushed onto the packet as it’s forwarded by Router A.

Next time, we’ll wrap this series up with further thoughts on other uses for adj-sids, and also on how we could assign labels and impose policies in a way that avoids the (otherwise unneeded) adj-sids.