Optimal Route Reflection: Next Hop Self

Recently, I posted a video short take I did on BGP optimal route reflection. A reader wrote in the comments to that post:

…why can’t Router set next hop self to updates to router E and avoid this suboptimal path?

To answer this question, it is best to return to the scene of the suboptimality—

To describe the problem again: A and C are sending the same route to B, which is a route reflector. B selects the best path from its perspective, which is through B, and sends this route to each of its clients. In this case, E will learn the path with a next hop of A, even though the path through C is closer from E’s perspective. In the video, I discuss several ways to solve this problem; one option I do not talk about is allowing B to set the next hop to itself. Would this work?

Before answering the question, however, it is important to make one observation: I have drawn this network with B as a router in the forwarding path. In many networks, the route reflector is a virtual machine, or a *nix host, and is not capable of forwarding the traffic required to self the next-hop to itself. There are many advantages to intentionally removing the route reflector from the forwarding path. So while setting nexthop-self might work in this situation, it will not work in all situations.

But will it work in this situation? Not necessarily. The shortest path, for D, is through C, rather than through A. B setting its next hop to itself is going to draw E’s traffic towards 100::/64 towards itself, which is still the longer path from E’s perspective. So while there are situations where setting nexthop-self will resolve this problem, this particular network is not one of them.

Research: BGP Routers and Parrots

The BGP specification suggests implementations should have three tables: the adj-rib-in, the loc-rib, and the adj-rib-out. The first of these three tables should contain the routes (NLRIs and attributes) transmitted by each of the speaker’s peers. The second table should contain the calculated best paths; these are the routes that will be (or are) installed in the local routing table and used to build a forwarding table. The third table contains the routes which have been sent to each peering speaker. Why three tables? Routing protocols standards are (sometimes—not always) written to provide the maximum clarity to how the protocol works to someone who is writing an implementation. Not every table or process described in the specification is implemented, or implemented the way it is described.

What happens when you implement things in a different way than the specification describes? In the case of BGP and the three RIBs, you can get duplicated BGP updates. What do parrots and BGP have in common describes two situations where the lack of a adj-rib-out can cause duplicate BGP updates to be sent.

David Hauweele, Bruno Quoitin, Cristel Pelsser, and Randy Bush. 2016. “What Do Parrots and BGP Rotuers Have in Common?” Computer Communications Review, July.

The authors of this paper begin by observing BGP updates from a full feed off the default free zone. The configuration of the network, however, is designed to provide not only the feed from a BGP speaker, but also the routes received by a BGP speaker, as shown in the illustration below.

In this figure, all the labeled routers are in separate BGP autonomous systems, and the links represent physical connections as well as eBGP sessions. The three BGP updates received by D are stored in three different logs which are time stamped so they can be correlated. The researchers found two instances where duplicate BGP updates were received at D.

In the first case, the best path at C switches between A and B because of the Multiple Exit Discriminator (MED), but the remainder of the update remains the same. C, however, strips the MED before transmitting the route to D, so D simply sees what appears to be duplicate updates. In the second case, the next hop changes because of an implicit withdraw based on a route change for the previous best path. For instance, C might choose A as the best path, but then A implicitly withdraws its path, leaving the path through B as the best. When this occurs, C recalculates the best path and sends it to D; since the next hop is stripped when C advertises the new route to D, this appears to be a duplicate at D.

In both of these cases, if C had an adj-rib-out, it would find the duplicate advertisement and squash it. However, since C has no record of what it has sent to D in the past, it must send information about all local best path changes to D. While this might seem like a trivial amount of processing, these additional updates can add enough load during link flap situations to make a material difference in processor utilization or speed of convergence.

Why do implementors decide not to include an adj-rib-out in their implementations, or why, when one is provided, do operators disable the adj-rib-out? Primarily because the adj-rib-out consumes local memory; it is cheaper to push the work to a peer than it is to keep local state that might only rarely be used. This is a classic case of reducing the complexity of the local implementation by pushing additional state (and hence complexity) into the overall system. The authors of the paper suggest a better balance might be achieved if implementations kept a small cache of the most recent updates transmitted to an adjacent speaker; this would allow the implementation to reduce memory usage, while also allowing it to prevent repeating recent updates.

BGP Hijacks: Two more papers consider the problem

The security of the global Default Free Zone DFZ) has been a topic of much debate and concern for the last twenty years (or more). Two recent papers have brought this issue to the surface once again—it is worth looking at what these two papers add to the mix of what is known, and what solutions might be available. The first of these—

Demchak, Chris, and Yuval Shavitt. 2018. “China’s Maxim – Leave No Access Point Unexploited: The Hidden Story of China Telecom’s BGP Hijacking.” Military Cyber Affairs 3 (1).

—traces the impact of Chinese “state actor” effects on BGP routing in recent years.

cross posted to CircleID

Whether these are actual attacks, or mistakes from human error for various reasons generally cannot be known, but the potential, at least, for serious damage to companies and institutions relying on the DFZ is hard to overestimate. This paper lays out the basic problem, and the works through a number of BGP hijacks in recent years, showing how they misdirected traffic in ways that could have facilitated attacks, whether by mistake or intentionally. For instance, quoting from the paper—

  • Starting from February 2016 and for about 6 months, routes from Canada to Korean government sites were hijacked by China Telecom and routed through China.
  • On October 2016, traffic from several locations in the USA to a large Anglo-American bank
  • headquarters in Milan, Italy was hijacked by China Telecom to China.
  • Traffic from Sweden and Norway to the Japanese network of a large American news organization was hijacked to China for about 6 weeks in April/May 2017.

What impact could such a traffic redirection have? If you can control the path of traffic while a TLS or SSL session is being set up, you can place your server in the middle as an observer. This can, in many situations, be avoided if DNSSEC is deployed to ensure the certificates used in setting up the TLS session is valid, but DNSSEC is not widely deployed, either. Another option is to simply gather encrypted traffic and either attempt to break the key, or use data analytics to understand what the flow is doing (a side channel attack).

What can be done about these kinds of problems? The “simplest”—and most naïve—answer is “let’s just secure BGP.” There are many, many problems with this solution. Some of them are highlighted in the second paper under review—

Bonaventure, Olivier. n.d. “A Survey among Network Operators on BGP Prefix Hijacking – Computer Communication Review.” Accessed November 3, 2018.

—which illustrates the objections providers have to the many forms of BGP security that have been proposed to this point. The first is, of course, that it is expensive. The ROI of the systems proposed thus far are very low; the cost is high, and the benefit to the individual provider is rather low. There is both a race to perfection problem here, as well as a tragedy of the commons problem. The race to perfection problem is this: we will not design, nor push for the deployment of, any system which does not “solve the problem entirely.” This has been the mantra behind BGPSEC, for instance. But not only is BGPSEC expensive—I would say to the point of being impossible to deploy—it is also not perfect.

The second problem in the ROI space is the tragedy of the commons. I cannot do much to prevent other people from misusing my routes. All I can really do is stop myself and my neighbors from misusing other people’s routes. What incentive do I have to try to make the routing in my neighborhood better? The hope that everyone else will do the same. Thus, the only way to maintain the commons of the DFZ is for everyone to work together for the common good. This is difficult. Worse than herding cats.

A second point—not well understood in the security world—is this: a core point of DFZ routing is that when you hand your reachability information to someone else, you lose control over that reachability information. There have been a number of proposals to “solve” this problem, but it is a basic fact that if you cannot control the path traffic takes through your network, then you have no control over the profitability of your network. This tension can be seen in the results of the survey above. People want security, but they do not want to release the information needed to make security happen. Both realities are perfectly rational!

Part of the problem with the “more strict,” and hence (considered) “more perfect” security mechanisms proposed is simply this: they are not quiet enough. They expose far too much information. Even systems designed to prevent information leakage ultimately leak too much.

So… what do real solutions on the ground look like?

One option is for everyone to encrypt all traffic, all the time. This is a point of debate, however, as it also damages the ability of providers to optimize their networks. One point where the plumbing allegory for networking breaks down is this: all bits of water are the same. Not all bits on the wire are the same.

Another option is to rely less on the DFZ. We already seem to be heading in this direction, if Geoff Huston and other researchers are right. Is this a good thing, or a bad one? It is hard to tell from this angle, but a lot of people think it is a bad thing.

Perhaps we should revisit some of the proposed BGP security solutions, reshaping some of them into something that is more realistic and deployable? Perhaps—but the community is going to let go of the “but it’s not perfect” line of thinking, and start developing some practical, deployable solutions that don’t leak so much information.

Finally, there is a solution Leslie Daigle and I have been tilting at for a couple of years now. Finding a way to build a set of open source tools that will allow any operator or provider to quickly and cheaply build an internal system to check the routing information available in their neighborhood on the ‘net, and mix local policy with that information to do some bare bones work to make their neighborhood a little cleaner. This is a lot harder than “just build some software” for various reasons; the work is often difficult—as Leslie says, it is largely a matter of herding cats, rather than inventing new things.

BGP and Suboptimal Route Reflection

One of the crucial points in understanding the operation of BGP is the reliance on the AS path to ensure all routes are loop-free. Within a single AS, however, there is no AS path. How, then, can you ensure the path through an AS is loop-free? The original plan was to fully mesh all the BGP speakers in the AS (a full mesh of iBGP speakers)—but building and maintaining a full mesh of iBGP speakers is difficult, so other solutions were quickly designed. The first of these as the BGP Confederation, which allows a set of autonomous systems to look like a single AS from the outside. This solution, however, is also cumbersome, so… the RR was invented.


  • BGP RR’s abstract information in a way that can cause suboptimal routing
  • To resolve this suboptimal routing, additional paths are advertised to RRCs by RRs


The basic operation of an RR is fairy simple; as new attribute, the cluster list, is added to a route as is passes from client to server. The cluster list contains as list of the clusters the route has passed through, identified by the identifier of the route reflector that “heads” the cluster. If a route is advertised to an iBGP speaker with a cluster ID already on the cluster list, the route is ignored. In a sense, each cluster acts as a “sub-AS,” much like in a confederation, and the cluster ID acts as the sub-AS number. Loop-freeness is guaranteed by making certain the route is not advertised as reachable through the same RR cluster twice, much like eBGP loop-freeness is guaranteed by not advertising a route as reachable through the same AS twice.

Route reflectors emulate the eBGP construction in one other way, as well—only the best path from each cluster’s perspective is advertised. This improves scaling dramatically; with a full mesh of iBGP speakers, every BGP speaker will learn about every path available to reach a destination. If there are ten eBGP peers that can reach 2001:db8:3e8:100::/64, for instance, then every BGP speaker in the AS will know about all ten paths. If route reflectors are used instead of a full mesh of iBGP speakers, the RR server will choose the best exit point from the local AS, and advertise only that one route to each of its clients. An important point to note here: the best path is chosen from the perspective of the RR.

For instance, in this diagram, the best path to 2011:db8:3e8:100::/64 for B is through A. Since B is the route reflector, it will advertise only the path through A to its two clients, which are C and E. Because of this, E will not know about the path through C, even though this is a valid path—and a better path, from E’s perspective.

This is a classic case of aggregation and information hiding; B is removing some available routes towards its route reflector clients, which allows the control plane to scale. According to the state/optimization/surface triad, a reduction in the amount of state should have some effect on the optimization of traffic flow through the network. In this case, E has a suboptimal route.

How can this problem be solved? The most obvious way is to add more information back into the control plane. The solutions offered for this problem vary in which information is added back, and how that information is carried.

The first proposal for solving the suboptimal routing problem was for the RR to send all of its routes to each of its clients. Normally, if a BGP speaker sends a route for a destination, and then follows with another update about this same destination, the receiving speaker will discard the first advertisement. This is called an implicit withdraw. RFC7911 specifies a mechanism for a BGP speaker to carry more than one route for the same destination to a peer by including a Route Identifier (RID) with each route. The function of the route identifier is the same as the route distinguisher in a BGP-VPN deployment—to describe two different routes that happen to share the same destination prefix. By setting a different RID on each route, the reflector can advertise all of its route to its clients. The problem with this solution is that it “undoes” the scaling advantages conferred by the RR; every RRC receives every route. In large networks, the scaling and convergence speed differences can be dramatic.

It seems better to send only part of the routes, but which part? An alternative proposal is for the RR to compute the BGP bestpath from the perspective of each of its RRCs and send only the best route from that RRC’s perspective each RRC. There are several tradeoffs here, as well. First, the RRC must calculate the IGP path from each of its RRC’s in some way; this normally means running SPF from the perspective of each RRC. The RR must then run bestpath for each RRC and send the correct route to each one. Finally, the RRC must keep track of which route it has sent to each RRC, so it can determine when it needs to resend a new route based on changes in the network topology—both for changes in BGP and IGP reachability. This adds a lot of complexity to the BGP RR implementation, and creates a rather deep interaction surface between the IGP and BGP. It also just happens to assume there is an underlying link-state IGP in operation—which may not always be true.

Another option is for the RR to send the bestpath and the second-best path to each RRC. The assumption is one of these two paths will either be optimal, or close to optimal, for all RRCs. This is the solution most implementations have opted for, as it seems to strike a good balance between adding more control plane information, implementation complexity, and solving the suboptimal routing problem. Most implementations will allow the operator to send as many “second-best” paths as they desire to, which allows the operator to fine-tune optimization against state differently in diverse parts of their network.