Disaggregation and Business Value
I recently spoke at CHINOG on the business value of disaggregation, and participated in a panel on getting involved in the IETF. If you’re interested in these two talks, the videos are linked below.
Multicast is, at best, difficult to deploy in large scale networks—PIM sparse and BIDIR are both complex, adding large amounts of state to intermediate devices. In the worst case, there is no apparent way to deploy any existing version of PIM, such as large-scale spine and leaf networks (variations on the venerable Clos fabric). BEIR, described in RFC8279, aims to solve the per-device state of traditional multicast.
In this network, assume A has some packet that needs to be delivered to T, V, and X. A could generate three packets, each one addressed to one of the destinations—but replicating the packet at A is wastes network resources on the A->B link, at least. Using PIM, these three destinations could be placed in a multicast group (a multicast address can be created that describes T, V, and X as a single destination). After this, a reverse shortest path tree can be calculated from each of the destinations in the group towards the source, A, and the correct forwarding state (the outgoing interface list) be installed at each of the routers in the network (or at least along the correct paths). This, however, adds a lot of state to the network.
BIER provides another option.
Returning to the example, if A is the sender, B is called the Bit-Forwarding Ingress Router, or BFIR. When B receives this packet from A, it determines T, V, and X are the correct destinations, and examines its local routing table to determine T is connected to M, V, is connected to N, and X is connected to R. Each of these are called a Bit-Forwarding Egress Router (BFER) in the network; each has a particular bit set aside in a bit field. For instance, if the network operator has set up an 8-bit BIER bit field, M might be bit 3, N might be bit 4, and R might be bit 6.
Given this, A will examine its local routing table to find the shortest path to M, N, and R. It will find the shortest path to M is through D, the shortest path to N is through C, and the shortest path to R is through C. A will then create two copies of the packet (it will replicate the packet). It will encapsulate one copy in a BIER header, setting the third bit in the header, and send the packet on to D. It will encapsulate the second copy into a BIER header, as well, setting the fourth and sixth bits in the header, and send the packet on to C.
D will receive the packet, determine the destination is M (because the third bit is set in the BIER header), look up the shortest path to M in its local routing table, and forward the packet. When M receives the packet, it pops the BIER header, finds T is the final destination address, and forwards the packet.
When C receives the packet, it finds there are two destinations, R and N, based on the bits set in the BIER header. The shortest path to N is through H, so it clears bit 6 (representing R as a destination), and forwards the packet to H. When H receives the packet, it finds N is the destination (because bit 4 is set in the BIER header), looks up N in its local routing table, and forwards the packet along the shortest path towards N. N will remove the BIER header and forward the packet to V. C will also forward the packet to G after clearing bit 4 (representing N as a destination BFER). G will examine the BIER header, find that R is a destination, examine its local routing table for the shortest path to R, and forward the packet. R, on receiving the packet, will strip the BIER header and forward the packet to X.
While the BFIR must know how to translate a single packet into multiple destinations, either through a group address, manual configuration, or some other signaling mechanism, the state at the remaining routers in the network is a simple table of BIER bitfield mappings to BFER destination addresses. It does not matter how large the tree gets, the amount of state is held to the number of BFERs at the network edge. This is completely different from any version of PIM, where the state at every router along the path depends on the number of senders and receivers.
The tradeoff is the hardware of the BIER Forwarding Routers (BFRs) must be able to support looking up the destination based on the bit set in the BIER bitfield in the header. Since the size of the BIER bitfield is variable, this is… challenging. One option is to use an MPLS label stack as a way to express the BIER header, which leverages existing MPLS implementations. Still, though, MPLS label depth is often limited, the ability to forward to multiple destinations is challenging, and the additional signaling required is more than trivial.
BIER is an interesting technology; it seems ideal for use in highly meshed data center fabrics, and even in longer haul and wireless networks where bandwidth is at a premium. Whether or not the hardware challenges can be overcome is a question yet to be answered.
Recent Changes in LSR
Research: BGP Routers and Parrots
The BGP specification suggests implementations should have three tables: the adj-rib-in, the loc-rib, and the adj-rib-out. The first of these three tables should contain the routes (NLRIs and attributes) transmitted by each of the speaker’s peers. The second table should contain the calculated best paths; these are the routes that will be (or are) installed in the local routing table and used to build a forwarding table. The third table contains the routes which have been sent to each peering speaker. Why three tables? Routing protocols standards are (sometimes—not always) written to provide the maximum clarity to how the protocol works to someone who is writing an implementation. Not every table or process described in the specification is implemented, or implemented the way it is described.
What happens when you implement things in a different way than the specification describes? In the case of BGP and the three RIBs, you can get duplicated BGP updates. What do parrots and BGP have in common describes two situations where the lack of a adj-rib-out can cause duplicate BGP updates to be sent.
The authors of this paper begin by observing BGP updates from a full feed off the default free zone. The configuration of the network, however, is designed to provide not only the feed from a BGP speaker, but also the routes received by a BGP speaker, as shown in the illustration below.
In this figure, all the labeled routers are in separate BGP autonomous systems, and the links represent physical connections as well as eBGP sessions. The three BGP updates received by D are stored in three different logs which are time stamped so they can be correlated. The researchers found two instances where duplicate BGP updates were received at D.
In the first case, the best path at C switches between A and B because of the Multiple Exit Discriminator (MED), but the remainder of the update remains the same. C, however, strips the MED before transmitting the route to D, so D simply sees what appears to be duplicate updates. In the second case, the next hop changes because of an implicit withdraw based on a route change for the previous best path. For instance, C might choose A as the best path, but then A implicitly withdraws its path, leaving the path through B as the best. When this occurs, C recalculates the best path and sends it to D; since the next hop is stripped when C advertises the new route to D, this appears to be a duplicate at D.
In both of these cases, if C had an adj-rib-out, it would find the duplicate advertisement and squash it. However, since C has no record of what it has sent to D in the past, it must send information about all local best path changes to D. While this might seem like a trivial amount of processing, these additional updates can add enough load during link flap situations to make a material difference in processor utilization or speed of convergence.
Why do implementors decide not to include an adj-rib-out in their implementations, or why, when one is provided, do operators disable the adj-rib-out? Primarily because the adj-rib-out consumes local memory; it is cheaper to push the work to a peer than it is to keep local state that might only rarely be used. This is a classic case of reducing the complexity of the local implementation by pushing additional state (and hence complexity) into the overall system. The authors of the paper suggest a better balance might be achieved if implementations kept a small cache of the most recent updates transmitted to an adjacent speaker; this would allow the implementation to reduce memory usage, while also allowing it to prevent repeating recent updates.
IPv6 Security Considerations
When rolling out a new protocol such as IPv6, it is useful to consider the changes to security posture, particularly the network’s attack surface. While protocol security discussions are widely available, there is often not “one place” where you can go to get information about potential attacks, references to research about those attacks, potential counters, and operational challenges. In the case of IPv6, however, there is “one place” you can find all this information: draft-ietf-opsec-v6. This document is designed to provide information to operators about IPv6 security based on solid operational experience—and it is a must read if you have either deployed IPv6 or are thinking about deploying IPv6.
The draft is broken up into four broad sections; the first is the longest, addressing generic security considerations. The first consideration is whether operators should use Provider Independent (PI) or Provider Assigned (PA) address space. One of the dangers with a large address space is the sheer size of the potential routing table in the Default Free Zone (DFZ). If every network operator opted for an IPv6 /32, the potential size of the DFZ routing table is 2.4 billion routing entries. If you thought converging on about 800,000 routes is bad, just wait ‘til there are 2.4 billion routes. Of course, the actual PI space is being handed out on /48 boundaries, which makes the potential table size exponentially larger. PI space, then, is “bad for the Internet” in some very important ways.
This document provides the other side of the argument—security is an issue with PA space. While IPv6 was supposed to make renumbering as “easy as flipping a switch,” it does not, in fact, come anywhere near this. Some reports indicate IPv6 re-addressing is more difficult than IPv4. Long, difficult renumbering processes indicate many opportunities for failures in security, and hence a large attack surface. Preferring PI space over PA space becomes a matter of reducing the operational attack surface.
Another interesting question when managing an IPv6 network is whether static addressing should be used for some services, or if all addresses should be dynamically learned. There is a perception out there that because the IPv6 address space is so large, it cannot be “scanned” to find hosts to attack. As pointed out in this draft, there is research showing this is simply not true. Further, static addresses may expose specific servers or services to easy recognition by an attacker. The point the authors make here is that either way, endpoint security needs to rely on actual security mechanisms, rather than on hiding addresses in some way.
Other very useful topics considered here are Unique Local Addresses (ULAs), numbering and managing point-to-point links, privacy extensions for SLAAC, using a /64 per host, extension headers, securing DHCP, ND/RA filtering, and control plane security.
If you are deploying, or thinking about deploying, IPv6 in your network, this is a “must read” document.
On the ‘net: A Riff on RIFT
Today, an update on some compelling projects at IETF 102. Ours guest are Jeff Tantsura and Russ White. We review the following projects to see what’s new and understand what problems they’re solving: RIFT (Routing In Fat Trees), BIER (Bit Indexed Explicit Replication), PPR (Preferred Path Routing), and YANG data modeling. We also look at the state of SD-WAN, which is a bit of the Wild West, to look at standards and interoperability efforts underway. @Packet Pushers
Recent BGP Peering Enhancements
BGP is one of the foundational protocols that make the Internet “go;” as such, it is a complex intertwined system of different kinds of functionality bundled into a single set of TLVs, attributes, and other functionality. Because it is so widely used, however, BGP tends to gain new capabilities on a regular basis, making the Interdomain Routing (IDR) working group in the Internet Engineering Task Force (IETF) one of the consistently busiest, and hence one of the hardest to keep up with. In this post, I’m going to spend a little time talking about one area in which a lot of work has been taking place, the building and maintenance of peering relationships between BGP speakers.
The first draft to consider is Mitigating the Negative Impact of Maintenance through BGP Session Culling, which is a draft in an operations working group, rather than the IDR working group, and does not make any changes to the operation of BGP. Rather, this draft considers how BGP sessions should be torn down so traffic is properly drained, and the peering shutdown has the minimal effect possible. The normal way of shutting down a link for maintenance would be to for administrators to shut down BGP on the link, wait for traffic to subside, and then take the link down for maintenance. However, many operators simply do not have the time or capability to undertake scheduled shutdowns of BGP speakers. To resolve this problem, graceful shutdown capability was added to BGP in RFC8326. Not all implementations support graceful shutdown, however, so this draft suggests an alternate way to shut down BGP sessions, allowing traffic to drain, before a link is shut down: use link local filtering to block BGP traffic on the link, which will cause any existing BGP sessions to fail. Once these sessions have failed, traffic will drain off the link, allowing it to be safely shut down for maintenance. The draft discusses various timing issues in using this technique to reduce the impact of link removal due to maintenance (or other reasons).
Graceful shutdown, itself, is also in line to receive some new capabilities through Extended BGP Administrative Shutdown Communication. This draft is rather short, as it simply allows an operator to send a short freeform message (presumably in text format) along with the standard BGP graceful shutdown notification. This message can be printed on the console, or saved to syslog, to provide an operator with more information about why a particular BGP has been shut down, whether it coming back up again, how long the shutdown is expected to last, etc.
Graceful Restart (GR) is a long available feature in many BGP implementations that aims to prevent the disruption of traffic flow; the original purpose was to handle a route processor restart in a router where the line cards could continue forwarding traffic based on local forwarding tables (the FIB), including cases where one route processor fails, causing the router switches to a backup route processor in the same chassis. Over time, GR began to be applied to NOTIFICATION messages in BGP. For instance, if a BGP speaker receives a malformed message, it is required (by the BGP RFCs) to send a NOTIFICATION, which will cause the BGP session to be torn down and restarted. GR has been adapted to these situations, so traffic flow is either not impacted, or minimally impacted through the NOTIFICATION/session restart process. This same processing takes place for a hold timer timeout in BGP.
The problem is that only one of the two speakers in a restarting pair will normally retain its local forwarding information. The sending speaker will normally flush its local routing tables, and with them its local forwarding tables, on sending a BGP NOTIFICATION. Notification Message support for BGP Graceful Restart changes this processing, allowing both speakers to enter the “receiving speaker” mode, so both speakers would retain their local forwarding information. A signal is provided to allow the sending speaker to indicate the sessions should be hard reset, rather than gracefully reset, if needed.
Finally, BGP allows speakers to send a route with a next hop other than themselves; this is called a third party next hop, and is illustrated in the figure below.
In this network, router C’s best path to 2001:db8:3e8:100::/64 might be through A, but the operator may prefer this traffic pass through B. While it is possible to change the preferences so C chooses the path through B, there are some situations where it is better for A to advertise C as the next hop towards the destination (for instance, a route server would not normally advertise itself as the nexthop towards a destination). The problem with this situation is that B might not have the same capabilities as a BGP speaker as A. If B, for instance, cannot forward for IPv6, the situation shown in the illustration would clearly not work.
To resolve this, BGP Next-Hop dependent capabilities allows a speaker to advertise the capabilities of these alternate next hops to peered BGP speakers.