BGP

Is BGP Good Enough?

In a recent podcast, Ivan and Dinesh ask why there is a lot of interest in running link state protocols on data center fabrics. They begin with this point: if you have less than a few hundred switches, it really doesn’t matter what routing protocol you run on your data center fabric. Beyond this, there do not seem to be any problems to be solved that BGP cannot solve, so… why bother with a link state protocol? After all, BGP is much simpler than any link state protocol, and we should always solve all our problems with the simplest protocol possible.

TL;DR

  • BGP is both simple and complex, depending on your perspective
  • BGP is sometimes too much, and sometimes too little for data center fabrics
  • We are danger of treating every problem as a nail, because we have decided BGP is the ultimate hammer

 
Will these these contentions stand up to a rigorous challenge?

I will begin with the last contention first—BGP is simpler than any link state protocol. Consider the core protocol semantics of BGP and a link state protocol. In a link state protocol, every network device must have a synchronized copy of the Link State Database (LSDB). This is more challenging than BGP’s requirement, which is very distance-vector like; in BGP you only care if any pair of speakers have enough information to form loop-free paths through the network. Topology information is (largely) stripped out, metrics are simple, and shared information is minimized. It certainly seems, on this score, like BGP is simpler.

Before declaring a winner, however, this simplification needs to be considered in light of the State/Optimization/Surface triad.

When you remove state, you are always also reducing optimization in some way. What do you lose when comparing BGP to a link state protocol? You lose your view of the entire topology—there is no LSDB. Perhaps you do not think an LSDB in a data center fabric is all that important; the topology is somewhat fixed, and you probably are not going to need traffic engineering if the network is wired with enough bandwidth to solve all problems. Building a network with tons of bandwidth, however, is not always economically feasible. The more likely reality is there is a balance between various forms of quality of service, including traffic engineering, and throwing bandwidth at the problem. Where that balance is will probably vary, but to always assume you can throw bandwidth at the problem is naive.

There is another cost to this simplification, as well. Complexity is inserted into a network to solve hard problems. The most common hard problem complexity is used to solve is guarding against environmental instability. Again, a data center fabric should be stable; the topology should never change, reachability should never change, etc. We all know this is simply not true, however, or we would be running static routes in all of our data center fabrics. So why aren’t we?

Because data center fabrics, like any other network, do change. And when they do change, you want them to converge somewhat quickly. Is this not what all those ECMP parallel paths are for? In some situations, yes. In others, those ECMP paths actually harm BGP convergence speed. A specific instance: move an IP address from one ToR on your fabric to another, or from one virtual machine to another. In this situation, those ECMP paths are not working for you, they are working against you—this is, in fact, one of the worst BGP convergence scenarios you can face. IS-IS, specifically, will converge much faster than BGP in the case of detaching a leaf node from the graph and reattaching it someplace else.

Complexity can be seen from another perspective, as well. When considering BGP in the data center, we are considering one small slice of the capabilities of the protocol.

in the center of the illustration above there is a small grey circle representing the core features of BGP. The sections of the ten sided figure around it represent the features sets that have been added to BGP over the years to support the many places it is used. When we look at BGP for one specific use case, we see the one “slice,” the core functionality, and what we are building on top. The reality of BGP, from a code base and complexity perspective, is the total sum of all the different features added across the years to support every conceivable use case.

Essentially, BGP has become not only a nail, but every kind of nail, including framing nails, brads, finish nails, roofing nails, and all the other kinds. It is worse than this, though. BGP has also become the universal glue, the universal screw, the universal hook-and-loop fastener, the universal building block, etc.

BGP is not just the hammer with which we turn every problem into a nail, it is a universal hammer/driver/glue gun that is also the universal nail/screw/glue.

When you run BGP on your data center fabric, you are not just running the part you want to run. You are running all of it. The L3VPN part. The eVPN part. The intra-AS parts. The inter-AS parts. All of it. The apparent complexity may appear to be low, because you are only looking at one small slice of the protocol. But the real complexity, under the covers, where attack and interaction surfaces live, is very complex. In fact, by any reasonable measure, BGP might have the simplest set of core functions, but it is the most complicated routing protocol in existence.

In other words, complexity is sometimes a matter of perspective. In this perspective, IS-IS is much simpler. Note—don’t confuse our understanding of a thing with its complexity. Many people consider link state protocols more complex simply because they don’t understand them as well as BGP.

Let me give you an example of the problems you run into when you think about the complexity of BGP—problems you do not hear about, but exist in the real world. BGP uses TCP for transport. So do many applications. When multiple TCP streams interact, complex problems can result, such as the global synchronization of TCP streams. Of course we can solve this with some cool QoS, including WRED. But why do you want your application and control plane traffic interacting in this way in the first place? Maybe it is simpler just to separate the two?

Is BGP really simpler? From one perspective, it is simpler. From another, however, it is more complex.

Is BGP “good enough?” For some applications, it is. For others, however, it might not be.

You should decide what to run on your network based on application and business drivers, rather than “because it is good enough.” Which leads me back to where I often end up: If you haven’t found the trade-offs, you haven’t look hard enough.

Research: Facebook’s Edge Fabric

The Internet has changed dramatically over the last ten years; more than 70% of the traffic over the Internet is now served by ten Autonomous Systems (AS’), causing the physical topology of the Internet to be reshaped into more of a hub-and-spoke design, rather than the more familiar scale-free design (I discussed this in a post over at CircleID in the recent past, and others have discussed this as well). While this reshaping might be seen as a success in delivering video content to most Internet users by shortening the delivery route between the server and the user, the authors of the paper in review today argue this is not enough.

Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA, 418-431. DOI: https://doi.org/10.1145/3098822.3098853

Why is this not enough? The authors point to two problems in the routing protocol tying the Internet together: BGP. First, they state that BGP is not capacity-aware. It is important to remember that BGP is focused on policy, rather than capacity; the authors of this paper state they have found many instances where the preferred path, based on BGP policy, is not able to support the capacity required to deliver video services. Second, they state that BGP is not performance-aware. The selection criteria used by BGP, such as MED and Local Pref, do not correlate with performance.

Based on these points, the authors argue traffic needs to be routed more dynamically, in response to capacity and performance, to optimize efficiency. The paper presents the system Facebook uses to perform this dynamic routing, which they call Edge Fabric. As I am more interested in what this study reveals about the operation of the Internet than the solution Facebook has proposed to the problem, I will focus on the problem side of this paper. Readers are invited to examine the entire paper at the link above, or here, to see how Facebook is going about solving this problem.

The paper begins by examining the Facebook edge; as edges go, Facebook’s is fairly standard for a hyperscale provider. Facebook deploys Points of Presence, which are essentially private Content Delivery Network (CDN) compute and edge pushed to the edge, and hence as close to users as possible. To provide connectivity between these CDN nodes and their primary data center fabrics, Facebook uses transit provided through peering across the public ‘net. The problem Facebook is trying to solve is not the last mile connectivity, but rather the connectivity between these CDN nodes and their data center fabrics.

The authors begin with the observation that if left to its own decision process, BGP will evenly distribute traffic across all available peers, even though each peer is actually different levels of congestion. This is not a surprising observation. In fact, there was at least one last mile provider that used their ability to choose an upstream based on congestion in near real time. This capability was similar to the concept behind Performance Based Routing (PfR), developed by Cisco, which was then folded into DMVPN, and thus became part of the value play of most Software Defined Wide Area Network (SD-WAN) solutions.

The authors then note that BGP relies on rough proxies to indicate better performing paths. For instance, the shortest AS Path should, in theory, be shortest physical or logical path, as well, and hence the path with the lowest end-to-end time. In the same way, local preference is normally set to prefer peer connections rather than upstream or transit connections. This should mean traffic will take a shorter path through a peer connected to the destination network, rather than a path up through a transit provider, then back down to the connected network. This should result in traffic passing through less lightly loaded last mile provider networks, rather than more heavily used transit provider networks. The authors present research showing these policies can often harm performance, rather than enhancing it; sometimes it is better to push traffic to a transit peer, rather than to a directly connected peer.

How often are destination prefixes constrained by BGP into a lower performing path? The authors provide this illustration—

The percentage of impacted destination prefixes is, by Facebook’s measure, high. But what kind of solution might be used to solve this problem?

Note that no solution that uses static metrics for routing traffic will be able to solve these problems. What is required, if you want to solve these problems, is to measure the performance of specific paths to given destinations in near real time, and somehow adjust routing to take advantage of higher performance paths regardless of what the routing protocol metrics indicate. In other words, the routing protocol needs to find the set of possible loop-free paths, and some other system must choose which path among this set should be used to forward traffic. This is a classic example of the argument for layered control planes (such as this one).

Facebook’s solution to this problem is to overlay an SDN’ish solution on top of BGP. Their solution does not involve tunneling, like many SD-WAN solutions. Rather, they adjust the BGP metrics in near real time based on current measured congestion and performance. The paper goes on to describe their system, which only uses standard BGP metrics to steer traffic onto higher performance paths through the ‘net.

A few items of note from this research.

First, note that many of the policies set up by providers are not purely shorthand for performance; they actually represent a price/performance tradeoff. For instance, the use of local preference to send traffic to peers, rather than transits, is most often an economic decision. Providers, particularly edge providers, normally configure settlement-free peering with peers, and pay for traffic sent to an upstream transit provider. Directing more traffic at an upstream, rather than a peer, can have a significant financial impact. Hyperscalers, like Facebook, don’t often see these financial impacts, as they are purchasing connectivity from the provider. Over time, forcing providers to use more expensive links for performance reasons could increase cost, but in this situation the costs are not immediately felt, so the cost/performance feedback loop is somewhat muted.

Second, there is a fair amount of additional complexity in pulling this bit of performance out of the network. While it is sometimes worth adding complexity to increase complexity, this is not always true. It likely is for many hyperscalers, who’s business relies largely on engagement. Given there is a directly provable link between engagement and speed, every bit of performance makes a large difference. But this is simply not true of all networks.

Third, you can replicate this kind of performance-based routing in your network by creating a measurement system. You can then use the communities operators providers allow their customers to use to shape the direction of traffic flows to optimize traffic performance. This might not work in all cases, but it might give you a fair start on a similar system—if this kind of wrestling match with performance is valuable in your environment.

Another option might be to use an SD-WAN solution, which should have the measurement and traffic shaping capabilities “built in.”

Fourth, there is a real possibility of building a system that fails in the face of positive feedback loops or reduces performance in the face of negative feedback loops. Hysteresis, the tendency to cause a performance problem in the process of reacting to a performance problem, must be carefully considered when designing such as system, as well.

The Bottom Line

Statically defined metrics in dynamic control planes cannot provide optimal performance in near real time. Building a system that can involves a good bit of additional complexity—complexity that is often best handled in a layered control plane.

Are these kinds of tools suitable for a network other than Facebook? In the right situation, the answer is clearly yes. But heed the tradeoffs. If you haven’t found the tradeoff, you haven’t looked hard enough.

Research: Are We There Yet? RPKI Deployment Considered

The Resource Public Key Infrastructure (RPKI) system is designed to prevent hijacking of routes at their origin AS. If you don’t know how this system works (and it is likely you don’t, because there are only a few deployments in the world), you can review the way the system works by reading through this post here on rule11.tech.

Gilad, Yossi & Cohen, Avichai & Herzberg, Amir & Schapira, Michael & Shulman, Haya. (2017). Are We There Yet? On RPKI’s Deployment and Security. 10.14722/ndss.2017.23123.

The paper under review today examines how widely Route Origin Validation (ROV) based on the RPKI system has been deployed. The authors began by determining which Autonomous Systems (AS’) are definitely not deploying route origin validation. They did this by comparing the routes in the global RPKI database, which is synchronized among all the AS’ deploying the RPKI, to the routes in the global Default Free Zone (DFZ), as seen from 44 different route servers located throughout the world. In comparing these two, they found a set of routes which the RPKI system indicated should be originated from one AS, but were actually being originated from another AS in the default free zone.

Using this information, the researchers then looked for AS’ through which these routes with a mismatched RPKI and global table origin were advertised. If an AS accepted, and then readvertised, routes with mismatched RPKI and global table origins, they marked this AS as one that does not enforce route origin authentication.

A second, similar check was used to find the mirror set of AS’, those that do perform a route origin validation check. In this case, the authors traced the same type of route—those for which the origin AS  the route is advertised with does not match the originating AS in the RPKI–and discovered some AS’ will not readvertise such a route. These AS’ apparently do perform a check for the correct route origin information.

The result is that only one of the 20 Internet Service Providers (ISPs) with the largest number of customers performs route origination validation on the routes they receive. Out of the largest 100 ISPs (again based on customer AS count), 22 appear to perform a route origin validation check. These are very low numbers.

To double check these numbers, the researchers surveyed a group of ISPs, and found that very few of them claim to check the routes they receive against the RPKI database. Why is this? When asked, these providers gave two reasons.

First, these providers are concerned about the problems involved with their connectivity being impacted in the case of an RPKI system failure. For instance, it would be easy enough for a company to become involved in a contract dispute with their naming authority, or with some other organization (two organizations claiming the same AS number, for instance). These kinds of cases could result in many years of litigation, causing a company to effectively lose their connectivity to the global ‘net during the process. This might seem like a minor fear for some, and there might be possible mitigations, but the ‘net is much more statically defined than many people realize, and many operators operate on a razor thin margin. The disruptions caused by such an event could simply put a company out of business.

Second, there is a general perception that the RPKI database is not exactly a “clean” representation of the real world. Since the database is essentially self-reported, there is little incentive to make changes to the database once something in the real world has changed (such as the transfer of address space between organization). It only takes a small amount of old, stale, or incorrect information to reduce the usefulness of this kind of public database. The authors address this concern by examining the contents of the RPKI, and find that it does, in fact, contain a good bit of incorrect information. They develop a tool to help administrators find this information, but ultimately people must use these kinds of tools.

The point of the paper is that the RPKI system, which is seen as crucial to the security of the global Internet, is not being widely used, and deployment does not appear to be increasing over time. One possible takeaway is the community needs to band together and deploy this technology. Another might be that the RPKI is not a viable solution to the problem at hand for various technical and social reasons—it might be time to start looking for another alternative for solving this problem.

Recent BGP Peering Enhancements

BGP is one of the foundational protocols that make the Internet “go;” as such, it is a complex intertwined system of different kinds of functionality bundled into a single set of TLVs, attributes, and other functionality. Because it is so widely used, however, BGP tends to gain new capabilities on a regular basis, making the Interdomain Routing (IDR) working group in the Internet Engineering Task Force (IETF) one of the consistently busiest, and hence one of the hardest to keep up with. In this post, I’m going to spend a little time talking about one area in which a lot of work has been taking place, the building and maintenance of peering relationships between BGP speakers.

The first draft to consider is Mitigating the Negative Impact of Maintenance through BGP Session Culling, which is a draft in an operations working group, rather than the IDR working group, and does not make any changes to the operation of BGP. Rather, this draft considers how BGP sessions should be torn down so traffic is properly drained, and the peering shutdown has the minimal effect possible. The normal way of shutting down a link for maintenance would be to for administrators to shut down BGP on the link, wait for traffic to subside, and then take the link down for maintenance. However, many operators simply do not have the time or capability to undertake scheduled shutdowns of BGP speakers. To resolve this problem, graceful shutdown capability was added to BGP in RFC8326. Not all implementations support graceful shutdown, however, so this draft suggests an alternate way to shut down BGP sessions, allowing traffic to drain, before a link is shut down: use link local filtering to block BGP traffic on the link, which will cause any existing BGP sessions to fail. Once these sessions have failed, traffic will drain off the link, allowing it to be safely shut down for maintenance. The draft discusses various timing issues in using this technique to reduce the impact of link removal due to maintenance (or other reasons).

Graceful shutdown, itself, is also in line to receive some new capabilities through Extended BGP Administrative Shutdown Communication. This draft is rather short, as it simply allows an operator to send a short freeform message (presumably in text format) along with the standard BGP graceful shutdown notification. This message can be printed on the console, or saved to syslog, to provide an operator with more information about why a particular BGP has been shut down, whether it coming back up again, how long the shutdown is expected to last, etc.

Graceful Restart (GR) is a long available feature in many BGP implementations that aims to prevent the disruption of traffic flow; the original purpose was to handle a route processor restart in a router where the line cards could continue forwarding traffic based on local forwarding tables (the FIB), including cases where one route processor fails, causing the router switches to a backup route processor in the same chassis. Over time, GR began to be applied to NOTIFICATION messages in BGP. For instance, if a BGP speaker receives a malformed message, it is required (by the BGP RFCs) to send a NOTIFICATION, which will cause the BGP session to be torn down and restarted. GR has been adapted to these situations, so traffic flow is either not impacted, or minimally impacted through the NOTIFICATION/session restart process. This same processing takes place for a hold timer timeout in BGP.

The problem is that only one of the two speakers in a restarting pair will normally retain its local forwarding information. The sending speaker will normally flush its local routing tables, and with them its local forwarding tables, on sending a BGP NOTIFICATION. Notification Message support for BGP Graceful Restart changes this processing, allowing both speakers to enter the “receiving speaker” mode, so both speakers would retain their local forwarding information. A signal is provided to allow the sending speaker to indicate the sessions should be hard reset, rather than gracefully reset, if needed.

Finally, BGP allows speakers to send a route with a next hop other than themselves; this is called a third party next hop, and is illustrated in the figure below.

In this network, router C’s best path to 2001:db8:3e8:100::/64 might be through A, but the operator may prefer this traffic pass through B. While it is possible to change the preferences so C chooses the path through B, there are some situations where it is better for A to advertise C as the next hop towards the destination (for instance, a route server would not normally advertise itself as the nexthop towards a destination). The problem with this situation is that B might not have the same capabilities as a BGP speaker as A. If B, for instance, cannot forward for IPv6, the situation shown in the illustration would clearly not work.

To resolve this, BGP Next-Hop dependent capabilities allows a speaker to advertise the capabilities of these alternate next hops to peered BGP speakers.

RIPE NCC: The Future of BGP Security

I was recently invited to a webinar for the RIPE NCC about the future of BGP security. The entire series is well worth watching; I was in the final session, which was a panel discussion on where we are now, and where we might go to make BGP security better.

Reaction: DNS Complexity Lessons

Recently, Bert Hubert wrote of a growing problem in the networking world: the complexity of DNS. We have two systems we all use in the Internet, DNS and BGP. Both of these systems appear to be able to handle anything we can throw at them and “keep on ticking.”

this article was crossposted to CircleID

But how far can we drive the complexity of these systems before they ultimately fail? Bert posted this chart to the APNIC blog to illustrate the problem—

I am old enough to remember when the entire Cisco IOS Software (classic) code base was under 150,000 lines; today, I suspect most BGP and DNS implementations are well over this size. Consider this for a moment—a single protocol implementation that is larger than an entire Network Operating System ten to fifteen years back.

What really grabbed my attention, though, was one of the reasons Bert believes we have these complexity problems—

DNS developers frequently see immense complexity not as a problem but as a welcome challenge to be overcome. We say ‘yes’ to things we should say ‘no’ to. Less gifted developer communities would have to say no automatically since they simply would not be able to implement all that new stuff. We do not have this problem. We’re also too proud to say we find something (too) hard.

How often is this the problem in network design and deployment? “Oh, you want a stretched Ethernet link between two data centers 150 miles apart, and you want an eVPN control plane on top of the stretched Ethernet to support MPLS Traffic Engineering, and you want…” All the while the equipment budget is ringing up numbers in our heads, and the realyl cool stuff we will be able to play with is building up on the list we are writing in front of us. Then you hear the ultimate challenge—”if you were a real engineer, you could figure out how to do this all with a pair of routers I can buy down at the local office supply store.”

Some problems just do not need to be solved in the current system. Some problems just need to have their own system built for them, rather than reusing the same old stuff because, well, “we can.”

The real engineer is the one who knows how to say “no.”

On the ‘web: The Value of MANRS

Route leaks and Distributed Denial of Service (DDoS) attacks have been in the news a good deal over the last several years; but the average non-transit network operator might generally feel pretty helpless in the face of the onslaught. Perhaps you can buy a DDoS mitigation service or appliance, and deploy the ubiquitous firewall at the edge of your network, but there is not much else to be done, right? Or maybe wait on the Internet at large to “do something” about these problems by deploying some sort of BGP security. But will adopting a “secure edge,” and waiting for someone else to solve the problem, really help? @ECI

Section 10 Routing Loops

A (long) time ago, a reader asked me about RFC4456, section 10, which says:

Care should be taken to make sure that none of the BGP path attributes defined above can be modified through configuration when exchanging internal routing information between RRs and Clients and Non-Clients. Their modification could potentially result in routing loops. In addition, when a RR reflects a route, it SHOULD NOT modify the following path attributes: NEXT_HOP, AS_PATH, LOCAL_PREF, and MED. Their modification could potentially result in routing loops.

On first reading, this seems a little strange—how could modifying the next hop, Local Preference, or MED at a route reflector cause a routing loop? While contrived, the following network illustrates the principle.

Note the best path, from an IGP perspective, from C to E is through B, and the best path, from an IGP perspective, from B to D is through C. In this case, a route is advertised over eBGP from F towards E and D. These two eBGP speakers, in turn, advertise the route to their iBGP neighbors, B and C. Both B and C are route reflectors, so they both reflect the route on to A, which advertises the route to some other eBGP speaker outside AS65000 (not shown in the network diagram). In this case, assume the best path (for whatever reason) should be the route learned through D.

What happens if C changes the next hop for the route so it points to E rather than D? This should be fine, at first glance; when E receives traffic for the destination reachable through F, it will use the local eBGP route learned from F directly to forward the traffic. But there is a subtle problem here. Assume A receives both routes, one from B with a next hop of D, and one from C with a next hop of E. A, for whatever reason, chooses the path with a next hop of D. The best path to D, according to the IGP metrics, is through C, so A forwards the traffic to C.

C, however, has been configured to set the next hop to E through a local configuration. The best IGP path to E is through B, so C will forward the traffic towards B to be forwarded to E. B, however, has a next hop towards this destination of D, so when it receives packets destined beyond F in AS65001, it will examine its local routing table for the best path towards D, and find this is through C. Hence, B will forward the traffic to C to be forwarded towards D.

Thus a routing loop is formed because the best IGP path towards the next hop always points through another router with a next hop that points back to the router forwarding the traffic. The problem is B and C have inconsistent bestpaths, such that they each think the bestpath is through one another.

This is, of course, an artifact of overlaying two different control planes, each with their own rules about how to determine a loop free path to any given destination. This sort of problem can arise with any pair of control planes overlaid in this way.

What about MED, Local Preference, or the AS Path? C could modify any of these while reflecting the route to cause E to be chosen as the best exit point locally, while B and A continue to choose D as the best exit point. Any of these, then, can be used to create a routing loop in this topology.

Again, this is a somewhat contrived example, but if a loop can be contrived, then it will likely show up in more complex (and not-so-contrived) networks in the real world. It would be much easier to create a loop with a hierarchical route reflector, or even by causing an inconsistent route advertisement on the AS edge (two different eBGP speakers advertising different paths to a given destination reachable through the local AS).

Flowspec and RFC1998?

In a recent comment, Dave Raney asked:

Russ, I read your latest blog post on BGP. I have been curious about another development. Specifically is there still any work related to using BGP Flowspec in a similar fashion to RFC1998. In which a customer of a provider will be able to ask a provider to discard traffic using a flowspec rule at the provider edge. I saw that these were in development and are similar but both appear defunct. BGP Flowspec-ORF https://www.ietf.org/proceedings/93/slides/slides-93-idr-19.pdf BGP Flowspec Redirect https://tools.ietf.org/html/draft-ietf-idr-flowspec-redirect-ip-02.

This is a good question—to which there are two answers. The first is this service does exist. While its not widely publicized, a number of transit providers do, in fact, offer the ability to send them a flowspec community which will cause them to set a filter on their end of the link. This kind of service is immensely useful for countering Distributed Denial of Service (DDoS) attacks, of course. The problem is such services are expensive. The one provider I have personal experience with charges per prefix, and the cost is high enough to make it much less attractive.

Why would the cost be so high? The same reason a lot of providers do not filter for unicast Reverse Path Forwarding (uRPF) failures at scale—per packet filtering is very performance intensive, sometimes requiring recycling the packet in the ASIC. A line card normally able to support x customers without filtering may only be able to support x/2 customers with filtering. The provider has to pay for additional space, power, and configuration (the flowspec rules must be configured and maintained on the customer facing router). All of these things are costs the provider is going to pass on to their customers. The cost is high enough that I know very few people (in fact, so few as to be 0) network operators who will pay for this kind of service.

The second answer is there is another kind of service that is similar to what Dave is asking about. Many DDoS protection services offer their customers the ability to signal a request to the provider to block traffic from a particular source, or to help them manage a DDoS in some other way. This is very similar to the idea of interdomain flowspec, only using a different signaling mechanism. The signaling mechanism, in this case, is designed to allow the provider more leeway in how they respond to the request for help countering the DDoS. This system is called DDoS Open Threats Signaling; you can read more about it at this post I wrote at the ECI Telecom blog. You can also head over to the IETF DOTS WG page, and read through the drafts yourself.

Yes, I do answer reader comments… Sometimes just in email, and sometimes with a post—so comment away, ask questions, etc.

Do We Really Need a New BGP?

From time to time, I run across (yet another) article about why BGP is so bad, and how it needs to be replaced. This one, for instance, is a recent example.

cross posted at APNIC and CircleID

It seems the easiest way to solvet this problem is finding new people—ones who don’t make mistakes—to work on BGP configuration, building IRR databases, and deciding what should be included in BGP? Ivan points out how hopeless of a situation this is going to be, however. As Ivan says, you cannot solve people problems with technology. You can hint in the right direction, and you can try to make things a little more sane, and a little less complex, but people cannot be fixed with technology. Given we cannot fix the people problem, would replacing BGP itself really help? Is there anything we could do to make things better?

To understand the answer to these questions, it is important to tear down a major misconception about BGP. The misconception?

BGP is a routing protocol in the same sense as OSPF, IS-IS, or EIGRP.

BGP was not designed to be a routing protocol in the way other protocol were. It was designed to provide a loop free path through a series of independently operated networks, each with its own policy and business goals. In the sense that BGP provides a loop free route to a destination, it provides routing. But the “routing” it provides is largely couched in terms of explicit, rather than implicit, policy (see the note below). Loop free routes are not always the “shortest” path in terms of hop count, or the “lowest cost” path in terms of delay, or the “best available” path in terms of bandwidth, or anything else. This is why BGP relies on the AS Path to prevent loops. We call things “metrics” in BGP in a loose way, but they are really explicit expressions of policy.

Consider this: the primary policies anyone cares about in interdomain routing are: where do I want this traffic to exit my AS, and where do I want this traffic to enter my AS? The Local Preference is an expression of where traffic to this particular destination should exit this AS. The Multiple Exit Disciminator (MED) is an expression of where this AS would like to receive traffic being forwarded to this destination. Everything other than these are just tie breakers. All the rest of the stuff we do to try to influence the path of traffic into and out of an AS, like messing with the AS Path, are hacks. If you can get this pair of “things people really care about” into your head, the BGP bestpath process, and much of the routing that goes on in the DFZ, makes a lot more sense.

It really is that simple.

How does this relate to the problem of replacing BGP? There are several things you could improve about BGP, but automatic metrics are not one of them. There are, in fact, already “automatic metrics” in BGP, but “automatic metrics” like the IGP cost are tie breakers. A tie breaker is a convenient stand-in for what the protocol designer and/or implementor thinks the most natural policy should be. Whether or not they are right or wrong in a specific situation is a… guess.

What about something like the RPKI? The RPKI is not going to help in most situations where a human makes a mistake in a transit provider. It would help with transit edge failures and hijacks, but these are a different class of problem. You could ask for BGPsec to counter these problems, of course, but BGPsec would likely cause more problems than it solves (I’ve written on this before, here, here, here, here, and here, to start; you can find a lot more on rule11 by following this link).

Given replacing the metrics is not a possibility, and RPKI is only going to get you “so far,” what else can be done? There are, in fact, several practical steps that could be taken.

You could specify that BGP implementations should, by default, only advertise routes if there is some policy configured. Something like, say… RFC8212?

Giving operators more information to understand what they are configuring (perhaps by cleaning up the Internet Routing Registries?) would also be helpful. Perhaps we could build a graph overlay on top of the Default Free Zone (DFZ) so a richer set of policies could be expressed, and policies could be better observed and understood (but you have to convince the transit providers that this would not harm their business before this could happen).

Maybe we could also stop trying to use BGP as the trash can of the Internet, throwing anything we don’t know what else to do with in there. We’ve somehow forgotten the old maxim that a protocol is not done until we have removed everything that is not needed. Now our mantra seems to be “the protocol isn’t done until it solves every problem anyone has ever thought of.” We just keep throwing junk at BGP as if it is the abominable snowman—we assume it’ll bounce when it hits bottom. Guess what: it’s not, and it won’t.

Replacing BGP is not realistic—nor even necessary. Maybe it is best to put it this way:

  • BGP expresses policy
  • Policy is messy
  • Therefore, BGP is messy

We definitely need to work towards building good engineers and good tools—but replacing BGP is not going to “solve” either of these problems.

P.S. I have differentiated between “metrics” and “policy” here—but metrics can be seen as an implicit form of policy. Choosing the highest bandwidth path is a policy. Choosing the path with the shortest hop count is a policy, too. The shortest path (for some meaning of “shortest”) will always be provably loop free, so it is a useful way to always choose a loop free path in the face of simple, uniform, policies. But BGP doesn’t live in the world of simple uniform policies; it lives in the world of “more than one metric.” BGP lives in a world where different policies not only overlap, but directly compete. Computing a path with more than one metric is provably at least bistable, and often completely unstable, no matter what those metrics are.

P.P.S. This article is a more humorous take on finding perfect people.

BGPsec and Reality

From time to time, someone publishes a new blog post lauding the wonderfulness of BGPsec, such as this one over at the Internet Society. In return, I sometimes feel like I am a broken record discussing the problems with the basic idea of BGPsec—while it can solve some problems, it creates a lot of new ones. Overall, BGPsec, as defined by the IETF Secure Interdomain (SIDR) working group is a “bad idea,” a classic study in the power of unintended consequences, and the fond hope that more processing power can solve everything. To begin, a quick review of the operation of BGPsec might be in order. Essentially, each AS in the AS Path signs the “BGP update” as it passes through the internetwork, as shown below.

In this diagram, assume AS65000 is originating some route at A, and advertising it to AS65001 and AS65002 at B and C. At B, the route is advertised with a cryptographic signature “covering” the first two hops in the AS Path, AS65000 and AS65001. At C, the route is advertised with a cryptogrphic signature “covering” the first two hops in the AS Path, AS65000 and AS65002. When F advertises this route to H, at the AS65001 to AS65003 border, it again signs the AS Path, including the AS F is advertising the route to, so the signed path includes AS65000, AS65001, and AS65003.

To validate the route, H can use AS65000’s public key to verify the signature over the first two hops in the AS Path. This shows that AS65000 not only did advertise the route to AS65001, but also that it intended to advertise this route to AS65001. In this way, according to the folks working on BGPsec, the intention of AS65000 is laid bare, and the “path of the update” is cryptographically verified through the network.

Except, of course, there is no such thing as an “update” in BGP that is carried from A to H. Instead, at each router along the way, the information stored in the update is broken up and stored in different memory structures, and then rebuilt to be transmitted to specific peers as needed. BGPsec, then, begins with a misunderstanding of how BGP actually works; it attempts to validate the path of an update through an internetwork—and this turns out to be the one piece of information that doesn’t matter all that much in security terms.

But set this problem aside for a moment, and consider how this actually works. First, B, before the signatures, could have sent a single update to multiple peers. After the signatures, each peer must receive its own update. One of the primary ways BGP uses to increase performance is in gathering updates up and sending one update whenever possible using either a peer group or an update group. Worse yet, every reachable destination—NLRI—now must be carried in its own update. So this means no packing, and no peer groups. The signatures themselves must be added to the update packets, as well, which means they must be stored, carried across the wire, etc.

The general assumption in the BGPsec community is the resulting performance problems can be resolved by just upping the processor and bandwidth. That BGPsec has been around for 20 years, and the performance problem still hasn’t been solved is not something anyone seems to consider. 🙂 In practice, this also means replacing every eBGP speaker in the internetwork—perhaps hundreds of thousands of them in the ‘net—to support this functionality. “At what cost,” and “for what tradeoffs,” are questions that are almost never asked.

But let’s lay aside this problem for a moment, and just assumed every eBGP speaking router in the entire ‘net could be replaced tomorrow, at no cost to anyone. Okay, all the BGP AS Path problems are now solved right? Not so fast…

Assume, for a moment, that AS65000 and AS65001 break their peering relationship for some reason. At the moment the B to D peering relationship is shut down, D still has a copy of the signed updates it has been using. How long can AS65001 continue advertising connectivity to this route? The signatures are in band, carried in the BGP update as constructed at B, and transmitted to D. So long as AS65001 has a copy of a single update, it can appear to remain connected to AS65000, even though the connection has been shut down. The answer, then, is that AS65000 must somehow invalidate the updates it previously sent to AS65001. There are three ways to do this.

First, AS65000 could roll its public and private key pair. This might work, so long as peering and depeering events are relatively rare, and the risk from such depeering situations is small. But are they? Further, until the the new public and private key pairs are distributed, and until new routes can be sent through the internetwork using these new keys, the old keys must remain in place to prevent a routing disruption. How long is this? Even if it is 24 hours, probably a reasonable number, AS65001 has the means to grab traffic that is destined to AS65000 and do what it likes with that traffic. Are you comfortable with this?

Second, the community could build a certificate revocation list. This is a mess, so there’s no point in going there.

Third, you could put a timer in the BGP update, like a Link State Update. Once the timer runs down or our, the advertisement must be replaced. Given there are 800k routes in the default free zone, a timer of 24 hours (which would still make me uncomfortable in security terms), there would need to be 800k/24 hours updates per hour added to the load of every router in the Internet. On top of the reduced performance noted above.

Again, it is useful to set this problem aside, and assume it can be solved with the wave of a magic wand someplace. Perhaps someone comes up with a way to add a timer without causing any additional load, or a new form of revocation list is created that has none of the problems of any sort known about today. Given these, all the BGP AS Path problems in the Internet are solved, right?

Consider, for a moment, the position of AS65001 and AS65002. These are transit providers, companies that rely on their customer’s trust, and their ability to out compete in the area of peering, to make money. First, signing updates means that you are declaring, to the entire world, in a legally provable way, who your customers are. This, from what I understand of the provider business model, is not only a no-no, but a huge legal issue. But this is actually, still, the simpler problem to solve.

Second, you cannot deploy this kind of system with a single, centrally stored private key. Assume, for a moment, that you do solve the problem this way. What happens if a single eBGP speaker is compromised? What if you need to replace a single eBGP speaker? You must roll your AS level private key. And replace all your advertisements in the entire Internet. This, from a security standpoint, is a bad idea.

Okay—the reasonable alternative is to create a private key per eBGP speaker. This private key would have its own public key, which would, in turn, be signed by the AS level private key. There are two problems with this scheme, however. The first is: when H validates the signature on some update it has received, it must now find not only the AS level public keys for AS65000 and AS65001, it must find the public key for B and F. This is going to be messy. The second is: By examining the publickeys I receive in a collection of “every update on the Internet,” I can now map the actual peering points between every pair of autonomous systems in the world. All the secret sauce in peering relationships? Exposed. Which router (or set of routers) to attack to impact the business of a specific company? Exposed.

The bottom line is this: even setting aside BGPsec’s flawed view of the way BGP works, even setting aside BGPsec’s flawed view of what needs to be secured, even setting aside BGPsec implementations the benefit of doing the impossible (adding state and processing without impacting performance), even given some magical form of replay attack prevention that costs nothing, BGPsec still exposes information no-one really wants exposed. The tradeoffs are ultimately unacceptable.

Which all comes back to this: If you haven’t found the tradeoffs, you haven’t looked hard enough.

Reaction: Networks are not cars or cell phones

The network engineering world has long emphasized the longevity of the hardware we buy; I have sat through many vendor presentations where the salesman says “this feature set makes our product future proof! You can buy with confidence knowing this product will not need to be replaced for another ten years…” Over at the Networking Nerd, Tom has an article posted supporting this view of networking equipment, entitled Network Longevity: Think Car, not iPhone.

It seems, to me, that these concepts of longevity have the entire situation precisely backwards. These ideas of “car length longevity” and “future proof hardware” are looking at the network from the perspective of an appliance, rather than from the perspective as a set of services. Let me put this in a little bit of context by considering two specific examples.

In terms of cars, I have owned four in the last 31 years. I owned a Jeep Wrangler for 13 years, a second Jeep Wrangler for 8 years, and a third Jeep Wrangler for 9 years. I have recently switched to a Jeep Cherokee, which I’ve just about reached my first year driving.

What if I bought network equipment like I buy cars? What sort of router was available 9 years ago? That is 2008. I was still working at Cisco, and my lab, if I remember right, was made up of 7200’s and 2600’s. Younger engineers probably look at those model numbers and see completely different equipment than what I actually had; I doubt many readers of this blog ever deployed 7200’s of the kind I had in my lab in their networks. Do I really want to run a network today on 9 year old hardware? I don’t see how the answer to that question can be “yes.” Why?

First, do you really know what hardware capacity you will need in ten years? Really? I doubt your business leaders can tell you what products they will be creating in ten years beyond a general description, nor can they tell you how large the company will be, who their competitors will be, or what shifts might occur in the competitive landscape.

Hardware vendors try to get around this by building big chassis boxes, and selling blades that will slide into them. But does this model really work? The Cisco 7500 was the current chassis box 9 years ago, I think—even if you could get blades for it today, would it meet your needs? Would you really want to pay the power and cooling for an old 7500 for 9 years because you didn’t know if you would need one or seven slots nine years ago?

Building a hardware platform for ten years of service in a world where two years is too far to predict is like rearranging the chairs on the Titanic. It’s entertaining, perhaps, but it’s pretty pointless entertainment.

Second, why are we not taking the lessons of the compute and storage worlds into our thinking, and learning to scale out, rather than scaling up? We treat our routers like the server folks of yore—add another blade slot and make it go faster. Scale up makes your network do this—

Do you see those grey areas? They are costing you money. Do you enjoy defenestrating money?

These are symptoms of looking at the network as a bunch of wires and appliances, as hardware with a little side of software thrown in.

What about the software? Well, it may be hard to believe, but pretty much every commercial operating system available for routers today is an updated version of software that was available ten years ago. Some, in fact, are more than twenty years old. We don’t tend to see this, because we deploy routers and switches as appliances, which means we treat the software as just another form of hardware. We might deploy ten to fifteen different operating systems in our network without thinking about it—something we would never do in our data centers, or on our desktop computers.

So what this appliance based way of looking at things emphasizes is this: buy enough hardware to last you ten years, and treat the software a fungible—software is a second tier player that is a simple enabler for the expensive bits, the hardware. The problem with this view of things is it simply ignores reality. We need to reverse our thinking.

Software is the actual core of the network, not hardware.

If you look at the entire networking space from a software centric perspective, you can think a lot differently. It doesn’t matter what hardware you buy; what matters is what software it runs. This is the revolutionizing observation of white box, bright box, and disaggregated networking. Hardware is cheap, software is expensive. Hardware is CAPEX, software is OPEX. Hardware only loosely interacts with business and operations; software interacts with both.

The appliance model, and the idea of buying big iron like a car, is hampering the growth and usefulness of networks in real businesses. It is going to take a change to realize that most of us care much less about hardware than software in our daily lives, and to transfer this thinking to the network engineering realm.

It is time for a new way of looking at the network. A router is not a car, nor it is a cell phone. It is a router, and it deserves its own way of looking at value. The value is in connecting the software to the business, and the hardware to the speeds and feeds. These are separate problems which the appliance model ties into a single “thing.” This makes the appliance world bad for businesses, bad for network design, and bad for network engineers.

It’s time to rethink the way we look at network engineering to build networks that are better for business, to adjust our idea of future proof to mean a software based system that can be used across many generations of hardware, while hardware becomes a “just in time” component used and recycled as needs must.

On the ‘web: What’s Wrong with BGP

Our guests are Russ White, a network architect at LinkedIn; and Sue Hares, a consultant and chair of the Inter-Domain Routing Working Group at the IETF. They discuss the history of BGP, the original problems it was intended to solve, and what might change. This is an informed and wide-ranging conversation that also covers whitebox, software quality, and more. Thanks to Huawei, which covered travel and accommodations to enable the Packet Pushers to attend IETF 99 and record some shows to spread the news about IETF projects and initiatives.

You can jump to the original post on Packet Pushers here.

BGP Persistent Oscillation

After Daniel Walton visited the History of Networking at the Network Collective, I went back and poked at BGP permanent route oscillations just to refresh my memory. Since I spent the time, I thought it was worth a post, with some observations. When working with networking problems, it is always wise to begin with a network, so…

For those who are interested, I’m pretty much following RFC3345 in this explanation.

There are two BGP route reflectors here, in two different clusters, labeled A and D. The metric for each link is listed on the links between the RR clients, B, C, and E, and the RRs; the cost of the link between the RRs is 1. A single route, 2001:db8:3e8:100::/64 is being advertised in with an AS path of the same length from three different eBGP peering points, each with a different MED. E is receiving the route with a MED of 0, C with a MED of 1, and B with a MED of 10.

Starting with A, walk through one cycle of the persistent oscillation. At A there are two routes—

edge MED IGP Cost
C    1   4
B    10  5 (BEST)

When A runs the bestpath calculation, it will determine the best path should be through C, rather than B (because the IGP cost is lower; the MEDs are not compared due to different AS paths), so it will send an update to each of its peers about this change in its best path. This results in D having the following table—

edge MED IGP Cost
E    0   12 (BEST)
C    1   5

When D receives this update from A, it will calculate the best path towards 2001:db8:3e8:100::/64, and choose the path through E because this path has the lowest MED (the MEDs are compared here because the AS path is the same). On completing the best path calculation, E will send an update to its peers, letting them know about its new best path to the destination, which primarily means A. On receiving this update, A has three routes to the destination in its table—

edge MED IGP Cost
E    0   13
C    1   4
B    10  5 (BEST)

The key point to think through here is why the third route in the table is best. The BGP bestpath process compares each pair of routes, starting with the first pair. So the first two routes are compared, the best of those two is chosen and compared with the third, the best of these two is compared with the fourth, etc., until the end of the table is reached. Here the first two paths are compared, and the path through E because of the lower MED (the AS path is the same). When the path through E is compared to the path through B, however, the path through B wins because the IGP metric is lower (the AS path is different).

At this point, A will now send an update to its peers, specifically D, informing D of this change in its best path. Note this update removes any information about the path through C from D’s table, so D only has a partial view of the available paths. As a result, D will have the following in its table—

edge MED IGP Cost
E    0   12
B    10  6 (BEST)

D will select the path through B because the IGP cost is lower, and send an update to A with this new information. This will result in A having the table this process started with—

edge MED IGP Cost
C    1   4
B    10  5 (BEST)

What is interesting about this is the removal of information about the path through C from D’s view of the network. Essentially, what is happening here is D is switching between two different views of the network topology, one of which includes B, the other of which includes C. The reason the ADD_PATH extension solves this problem is that A and D both have a full view of every exit point once each BGP speaker sends every route to each destination, rather than just the best path.

This is, in effect, another instance of the inconsistency of a distributed database causing a persistent condition in a control plane. In (loosely!) CAP theorem terms, distributed routing protocol always choose accessibility (the local device can read the database to calculate loop free paths) and partitioning (the database is always copied to every device speaking the protocol) over consistency—eventually, or “not always,” consistent databases will always be the result of such a situation. As A and D read their databases, each of which contain incomplete information about the real state of the network, they will make different decisions about what the best path to the destination in question is. As they each change their views of the topology, they will send updated information to one another, causing the other BGP speaker to recompute its view of the topology, and…

Persistent BGP oscillation is an interesting study in the way consistency impacts distributed routing protocol design and convergence.