Optimal Route Reflection: Next Hop Self

Recently, I posted a video short take I did on BGP optimal route reflection. A reader wrote in the comments to that post:

…why can’t Router set next hop self to updates to router E and avoid this suboptimal path?

To answer this question, it is best to return to the scene of the suboptimality—

To describe the problem again: A and C are sending the same route to B, which is a route reflector. B selects the best path from its perspective, which is through B, and sends this route to each of its clients. In this case, E will learn the path with a next hop of A, even though the path through C is closer from E’s perspective. In the video, I discuss several ways to solve this problem; one option I do not talk about is allowing B to set the next hop to itself. Would this work?

Before answering the question, however, it is important to make one observation: I have drawn this network with B as a router in the forwarding path. In many networks, the route reflector is a virtual machine, or a *nix host, and is not capable of forwarding the traffic required to self the next-hop to itself. There are many advantages to intentionally removing the route reflector from the forwarding path. So while setting nexthop-self might work in this situation, it will not work in all situations.

But will it work in this situation? Not necessarily. The shortest path, for D, is through C, rather than through A. B setting its next hop to itself is going to draw E’s traffic towards 100::/64 towards B, which is still the longer path from E’s perspective. So while there are situations where setting nexthop-self will resolve this problem, this particular network is not one of them.

Weekend Reads 120618

Paris Traceroute allows you to deterministically measure particular paths by setting particular flow IDs — effectively setting combinations in packet headers to be treated consistently by routers on a path. Using this trick to run many traceroutes with different flow IDs to a target may reveal an entire topology, but without extra logic, it may involve tens of thousands of traceroute packets to do so. Many of those packets will be sent over hops that never change, making those packets redundant while also consuming bandwidth and time. —Kevin Vermeulen @APNIC

So, when people talk about serverless, what does it mean? Well, it doesn’t mean servers are GONE. Of course not: That “client-server model” is still the backbone of how things are getting done. Serverless refers to a developer’s ability to code, deploy, and create applications without having to know how to do the rest of it, like rack the servers, patch the operating system, and create container images. —Jen Wike Huger @opensource.com

The average size of Distributed Denial of Service (DDoS) attacks — where attackers seek to take down a website, application, or infrastructure by flooding it with requests — increased by 24% in 2018. In addition, the attack peak sizes have skyrocketed, as the memcached-based attacks that started in February 2018 ushered in the terabit era of attacks with a 1.7 Tbps attack in late February. —Steinthor Bjarnason @APNIC

In 2018, there was a large upswing in attacks using SSDP reflection, where the UDP source port wasn’t 1900 but a random value. To make a long story short, this is due to a bug in a popular Linux library called libupnp, which is included in most customer premises equipment (CPE) used to connect consumers to the Internet. These kinds of attacks are referred to as ‘SSDP diffraction’ to distinguish them from normal SSDP attacks. —Steinthor Bjarnason @APNIC

Many of the attack vectors that are used for DDoS attacks use services that are usually not sent across the Internet. Among those are SSDP, memcached, chargen and others. In addition, protocols like NTP, which are commonly used across the Internet, generate very little traffic and should not be allowed at high volumes. —Steinthor Bjarnason @APNIC

It looks like the smartphone party has come to an end. The slow down which began in the 2013/2014 timeframe has shifted to decline phase with fewer smartphones sold in 2017 as compared to the 2016 numbers. @CircleID

What if a hyperscale rack designer decided not to locally optimize legacy form factors for thermal management, but instead to start over and design a rack based on optimizing thermal efficiency? —Paul Teich @The Next Platform

Research: BGP Routers and Parrots

The BGP specification suggests implementations should have three tables: the adj-rib-in, the loc-rib, and the adj-rib-out. The first of these three tables should contain the routes (NLRIs and attributes) transmitted by each of the speaker’s peers. The second table should contain the calculated best paths; these are the routes that will be (or are) installed in the local routing table and used to build a forwarding table. The third table contains the routes which have been sent to each peering speaker. Why three tables? Routing protocols standards are (sometimes—not always) written to provide the maximum clarity to how the protocol works to someone who is writing an implementation. Not every table or process described in the specification is implemented, or implemented the way it is described.

What happens when you implement things in a different way than the specification describes? In the case of BGP and the three RIBs, you can get duplicated BGP updates. What do parrots and BGP have in common describes two situations where the lack of a adj-rib-out can cause duplicate BGP updates to be sent.

David Hauweele, Bruno Quoitin, Cristel Pelsser, and Randy Bush. 2016. “What Do Parrots and BGP Rotuers Have in Common?” Computer Communications Review, July. http://ccracmsigcomm.info.ucl.ac.be/wp-content/uploads/2016/07/sigcomm-ccr-paper26.pdf.

The authors of this paper begin by observing BGP updates from a full feed off the default free zone. The configuration of the network, however, is designed to provide not only the feed from a BGP speaker, but also the routes received by a BGP speaker, as shown in the illustration below.

In this figure, all the labeled routers are in separate BGP autonomous systems, and the links represent physical connections as well as eBGP sessions. The three BGP updates received by D are stored in three different logs which are time stamped so they can be correlated. The researchers found two instances where duplicate BGP updates were received at D.

In the first case, the best path at C switches between A and B because of the Multiple Exit Discriminator (MED), but the remainder of the update remains the same. C, however, strips the MED before transmitting the route to D, so D simply sees what appears to be duplicate updates. In the second case, the next hop changes because of an implicit withdraw based on a route change for the previous best path. For instance, C might choose A as the best path, but then A implicitly withdraws its path, leaving the path through B as the best. When this occurs, C recalculates the best path and sends it to D; since the next hop is stripped when C advertises the new route to D, this appears to be a duplicate at D.

In both of these cases, if C had an adj-rib-out, it would find the duplicate advertisement and squash it. However, since C has no record of what it has sent to D in the past, it must send information about all local best path changes to D. While this might seem like a trivial amount of processing, these additional updates can add enough load during link flap situations to make a material difference in processor utilization or speed of convergence.

Why do implementors decide not to include an adj-rib-out in their implementations, or why, when one is provided, do operators disable the adj-rib-out? Primarily because the adj-rib-out consumes local memory; it is cheaper to push the work to a peer than it is to keep local state that might only rarely be used. This is a classic case of reducing the complexity of the local implementation by pushing additional state (and hence complexity) into the overall system. The authors of the paper suggest a better balance might be achieved if implementations kept a small cache of the most recent updates transmitted to an adjacent speaker; this would allow the implementation to reduce memory usage, while also allowing it to prevent repeating recent updates.