Elephant Flows, Fabrics, and I2RS
The last post in this series on I2RS argues that this interface is designed to augment, rather than replace, the normal, distributed routing protocol. What sort of use case could we construct that would use I2RS in this way? What about elephant flows in data center fabrics? An earlier post considers how to solve the elephant flow using segment routing (SR); can elephant flows also be guided using I2RS? The network below will be used to consider this question.
Assume that A hashes a long lived elephant flow representing some 50% of the total bandwidth available on any single link in the fabric towards F. At the same time, A will hash other flows, represented by the red flow lines, onto each of the three links towards the core of the fabric in pretty much equal proportion. Smaller flows that are hashed onto the A->F link will likely suffer, while flows hashed onto the other two links will not.
This is a particularly bad problem in applications that have been decomposed into microservices, as the various components of the application tend to rely on fairly fixed delay and jitter budgets over the network to keep everything synchronized and running quickly. A single congested link can cause a single microservice to behave poorly, often reverberating throughout the entire application.
So how could I2RS be used to prevent, or resolve, this problem? To being, there must be a controller connected to the network that understands the network topology, and has fairly accurate information about the loading of each link in the fabric. The timeliness of this information is dependent, of course, on how long the elephant flows last, the sensitivity of the other flows, and many other factors—but this can all be a part of the information provided to the controller. Given this information, the controller can—
- Note the utilization spike on the A->F link
- Determine which flows should be placed on the A->F link, and which ones should be moved off
- Override the routes for the elephant flow at A so it will only use the A->F link (to prevent a reshuffling of the ECMP set from moving the flow to another link—or rather, to pin the flow to the link)
- Override the routes for the flows that need to be moved so they are, in fact, moved
Step 2 might not be easy to do—but since there is a controller in the network, there is no reason the controller couldn’t provide an API services can use to inform the controller about new (or upcoming scheduled) elephant flows. Information provided by the service, such as the source address, destination address, source port number, etc., could be correlated with information learned from the individual network forwarding devices in the network to trace the flow along the fabric.
Step 3, pinning the flow to a single path, is actually fairly simple—so long as the flow is represented by a single destination address. If this is the case, the I2RS controller can simply inject a route towards the elephant flow’s destination with a slightly lower metric, which will break the ECMP group up, and cause all the traffic to follow the single link. What if the elephant flow isn’t the only traffic being transmitted to a particular address? I will set this topic aside for the moment, and return to it in a future post.
Step 4 is a harder problem to solve. The main problem with this step is there is no such thing as a “negative route” in any routing protocol, nor any RIB implementation. There are two possible solutions, lacking a “negative route,” that can solve this problem—
- Overwrite each route passing over A->F other than the pinned flow with a route that has a higher cost, so the routes along A->F will fall out of the ECMP group
- Overwrite all the routes in each set of routes passing over A-F with routes that have a lower cost, so the routes along A->F will fall out of the ECMP group
The first of these two solutions is simpler to implement, but neither of them are ideal. The controller must use its knowledge of the routing table to construct a (potentially) large number of routes to remove traffic off the link to which the elephant flow is pinned. Some suggestions have been made in this area in the I2RS working group (in other words, potential work is still in progress in this area).
Note this entire process doesn’t replace whatever distributed control plane is already operating in the network—the controller could gather all the reachability and topology information, and build each device’s RIB completely. But there’s no need to do this—the controller can simply allow whatever local protocol is running (most likely BGP in the case of a large scale data center fabric), and override whatever routing information is necessary in specific cases.
In the next post, I will work through another use case, and then this series will start looking at specific models, etc.
A question to clarify. The controller uses only, or primarily, link A-F utilization to trigger four listed actions? But, according to the diagram, link D-E will be even more congested then A-F. Should the ultimate goal of the controller be to ensure split of the elephant flow from all micro flows entered through node A? I think that link utilization telemetry would provide useful information when only flows entering through node A are present. In more realistic scenario, when flows enter and leave the fabric everywhere, e2e performance measurement likely provides more meaningful information that can be used as trigger to change flow pinning.
There are three things you can do with an elephant flow —
1. Try to divide it up into its component bits, so it’s not an elephant flow. Sometimes you just can’t do this, though.
2. Pin the flow to a single path, but then you should also try to do #3, which is —
3. Keep the other traffic off of (or at least control the use of) the path you just pinned the flow too…
It’s #3 I find the hardest to do — and the least discussed in most of the literature I see around this problem. I hope that makes sense (and it’s always possible I have a slippage between the text and diagram!).
Unicast elephant flows is one aspect to pay attention to, another one, slightly harder imo due to a few more considerations is elephant multicast flows (and now particular with the 4K UHDT streams starting to kick off)… any thoughts on that one?
I think you could still handle these with segment routing, using the multicast segment — the problem will be hardware support for this one…