BGP and Suboptimal Route Reflection

One of the crucial points in understanding the operation of BGP is the reliance on the AS path to ensure all routes are loop-free. Within a single AS, however, there is no AS path. How, then, can you ensure the path through an AS is loop-free? The original plan was to fully mesh all the BGP speakers in the AS (a full mesh of iBGP speakers)—but building and maintaining a full mesh of iBGP speakers is difficult, so other solutions were quickly designed. The first of these as the BGP Confederation, which allows a set of autonomous systems to look like a single AS from the outside. This solution, however, is also cumbersome, so… the RR was invented.

TL;DR[time-span]

  • BGP RR’s abstract information in a way that can cause suboptimal routing
  • To resolve this suboptimal routing, additional paths are advertised to RRCs by RRs

 

The basic operation of an RR is fairy simple; as new attribute, the cluster list, is added to a route as is passes from client to server. The cluster list contains as list of the clusters the route has passed through, identified by the identifier of the route reflector that “heads” the cluster. If a route is advertised to an iBGP speaker with a cluster ID already on the cluster list, the route is ignored. In a sense, each cluster acts as a “sub-AS,” much like in a confederation, and the cluster ID acts as the sub-AS number. Loop-freeness is guaranteed by making certain the route is not advertised as reachable through the same RR cluster twice, much like eBGP loop-freeness is guaranteed by not advertising a route as reachable through the same AS twice.

Route reflectors emulate the eBGP construction in one other way, as well—only the best path from each cluster’s perspective is advertised. This improves scaling dramatically; with a full mesh of iBGP speakers, every BGP speaker will learn about every path available to reach a destination. If there are ten eBGP peers that can reach 2001:db8:3e8:100::/64, for instance, then every BGP speaker in the AS will know about all ten paths. If route reflectors are used instead of a full mesh of iBGP speakers, the RR server will choose the best exit point from the local AS, and advertise only that one route to each of its clients. An important point to note here: the best path is chosen from the perspective of the RR.

For instance, in this diagram, the best path to 2011:db8:3e8:100::/64 for B is through A. Since B is the route reflector, it will advertise only the path through A to its two clients, which are C and E. Because of this, E will not know about the path through C, even though this is a valid path—and a better path, from E’s perspective.

This is a classic case of aggregation and information hiding; B is removing some available routes towards its route reflector clients, which allows the control plane to scale. According to the state/optimization/surface triad, a reduction in the amount of state should have some effect on the optimization of traffic flow through the network. In this case, E has a suboptimal route.

How can this problem be solved? The most obvious way is to add more information back into the control plane. The solutions offered for this problem vary in which information is added back, and how that information is carried.

The first proposal for solving the suboptimal routing problem was for the RR to send all of its routes to each of its clients. Normally, if a BGP speaker sends a route for a destination, and then follows with another update about this same destination, the receiving speaker will discard the first advertisement. This is called an implicit withdraw. RFC7911 specifies a mechanism for a BGP speaker to carry more than one route for the same destination to a peer by including a Route Identifier (RID) with each route. The function of the route identifier is the same as the route distinguisher in a BGP-VPN deployment—to describe two different routes that happen to share the same destination prefix. By setting a different RID on each route, the reflector can advertise all of its route to its clients. The problem with this solution is that it “undoes” the scaling advantages conferred by the RR; every RRC receives every route. In large networks, the scaling and convergence speed differences can be dramatic.

It seems better to send only part of the routes, but which part? An alternative proposal is for the RR to compute the BGP bestpath from the perspective of each of its RRCs and send only the best route from that RRC’s perspective each RRC. There are several tradeoffs here, as well. First, the RRC must calculate the IGP path from each of its RRC’s in some way; this normally means running SPF from the perspective of each RRC. The RR must then run bestpath for each RRC and send the correct route to each one. Finally, the RRC must keep track of which route it has sent to each RRC, so it can determine when it needs to resend a new route based on changes in the network topology—both for changes in BGP and IGP reachability. This adds a lot of complexity to the BGP RR implementation, and creates a rather deep interaction surface between the IGP and BGP. It also just happens to assume there is an underlying link-state IGP in operation—which may not always be true.

Another option is for the RR to send the bestpath and the second-best path to each RRC. The assumption is one of these two paths will either be optimal, or close to optimal, for all RRCs. This is the solution most implementations have opted for, as it seems to strike a good balance between adding more control plane information, implementation complexity, and solving the suboptimal routing problem. Most implementations will allow the operator to send as many “second-best” paths as they desire to, which allows the operator to fine-tune optimization against state differently in diverse parts of their network.

2 Comments

  1. Michael Sifuentes on 22 October 2018 at 9:19 am

    Good article. I did not see a diagram to help illustrate the point.



    • Russ on 27 October 2018 at 1:46 pm

      Weird — I can’t find anything wrong with the image from here. Can you pm me someplace with the source of the article with the broken link? It would be helpful in figuring out what is going on.