BGP

Recent BGP Peering Enhancements

BGP is one of the foundational protocols that make the Internet “go;” as such, it is a complex intertwined system of different kinds of functionality bundled into a single set of TLVs, attributes, and other functionality. Because it is so widely used, however, BGP tends to gain new capabilities on a regular basis, making the Interdomain Routing (IDR) working group in the Internet Engineering Task Force (IETF) one of the consistently busiest, and hence one of the hardest to keep up with. In this post, I’m going to spend a little time talking about one area in which a lot of work has been taking place, the building and maintenance of peering relationships between BGP speakers.

The first draft to consider is Mitigating the Negative Impact of Maintenance through BGP Session Culling, which is a draft in an operations working group, rather than the IDR working group, and does not make any changes to the operation of BGP. Rather, this draft considers how BGP sessions should be torn down so traffic is properly drained, and the peering shutdown has the minimal effect possible. The normal way of shutting down a link for maintenance would be to for administrators to shut down BGP on the link, wait for traffic to subside, and then take the link down for maintenance. However, many operators simply do not have the time or capability to undertake scheduled shutdowns of BGP speakers. To resolve this problem, graceful shutdown capability was added to BGP in RFC8326. Not all implementations support graceful shutdown, however, so this draft suggests an alternate way to shut down BGP sessions, allowing traffic to drain, before a link is shut down: use link local filtering to block BGP traffic on the link, which will cause any existing BGP sessions to fail. Once these sessions have failed, traffic will drain off the link, allowing it to be safely shut down for maintenance. The draft discusses various timing issues in using this technique to reduce the impact of link removal due to maintenance (or other reasons).

Graceful shutdown, itself, is also in line to receive some new capabilities through Extended BGP Administrative Shutdown Communication. This draft is rather short, as it simply allows an operator to send a short freeform message (presumably in text format) along with the standard BGP graceful shutdown notification. This message can be printed on the console, or saved to syslog, to provide an operator with more information about why a particular BGP has been shut down, whether it coming back up again, how long the shutdown is expected to last, etc.

Graceful Restart (GR) is a long available feature in many BGP implementations that aims to prevent the disruption of traffic flow; the original purpose was to handle a route processor restart in a router where the line cards could continue forwarding traffic based on local forwarding tables (the FIB), including cases where one route processor fails, causing the router switches to a backup route processor in the same chassis. Over time, GR began to be applied to NOTIFICATION messages in BGP. For instance, if a BGP speaker receives a malformed message, it is required (by the BGP RFCs) to send a NOTIFICATION, which will cause the BGP session to be torn down and restarted. GR has been adapted to these situations, so traffic flow is either not impacted, or minimally impacted through the NOTIFICATION/session restart process. This same processing takes place for a hold timer timeout in BGP.

The problem is that only one of the two speakers in a restarting pair will normally retain its local forwarding information. The sending speaker will normally flush its local routing tables, and with them its local forwarding tables, on sending a BGP NOTIFICATION. Notification Message support for BGP Graceful Restart changes this processing, allowing both speakers to enter the “receiving speaker” mode, so both speakers would retain their local forwarding information. A signal is provided to allow the sending speaker to indicate the sessions should be hard reset, rather than gracefully reset, if needed.

Finally, BGP allows speakers to send a route with a next hop other than themselves; this is called a third party next hop, and is illustrated in the figure below.

In this network, router C’s best path to 2001:db8:3e8:100::/64 might be through A, but the operator may prefer this traffic pass through B. While it is possible to change the preferences so C chooses the path through B, there are some situations where it is better for A to advertise C as the next hop towards the destination (for instance, a route server would not normally advertise itself as the nexthop towards a destination). The problem with this situation is that B might not have the same capabilities as a BGP speaker as A. If B, for instance, cannot forward for IPv6, the situation shown in the illustration would clearly not work.

To resolve this, BGP Next-Hop dependent capabilities allows a speaker to advertise the capabilities of these alternate next hops to peered BGP speakers.

RIPE NCC: The Future of BGP Security

I was recently invited to a webinar for the RIPE NCC about the future of BGP security. The entire series is well worth watching; I was in the final session, which was a panel discussion on where we are now, and where we might go to make BGP security better.

Reaction: DNS Complexity Lessons

Recently, Bert Hubert wrote of a growing problem in the networking world: the complexity of DNS. We have two systems we all use in the Internet, DNS and BGP. Both of these systems appear to be able to handle anything we can throw at them and “keep on ticking.”

this article was crossposted to CircleID

But how far can we drive the complexity of these systems before they ultimately fail? Bert posted this chart to the APNIC blog to illustrate the problem—

I am old enough to remember when the entire Cisco IOS Software (classic) code base was under 150,000 lines; today, I suspect most BGP and DNS implementations are well over this size. Consider this for a moment—a single protocol implementation that is larger than an entire Network Operating System ten to fifteen years back.

What really grabbed my attention, though, was one of the reasons Bert believes we have these complexity problems—

DNS developers frequently see immense complexity not as a problem but as a welcome challenge to be overcome. We say ‘yes’ to things we should say ‘no’ to. Less gifted developer communities would have to say no automatically since they simply would not be able to implement all that new stuff. We do not have this problem. We’re also too proud to say we find something (too) hard.

How often is this the problem in network design and deployment? “Oh, you want a stretched Ethernet link between two data centers 150 miles apart, and you want an eVPN control plane on top of the stretched Ethernet to support MPLS Traffic Engineering, and you want…” All the while the equipment budget is ringing up numbers in our heads, and the realyl cool stuff we will be able to play with is building up on the list we are writing in front of us. Then you hear the ultimate challenge—”if you were a real engineer, you could figure out how to do this all with a pair of routers I can buy down at the local office supply store.”

Some problems just do not need to be solved in the current system. Some problems just need to have their own system built for them, rather than reusing the same old stuff because, well, “we can.”

The real engineer is the one who knows how to say “no.”

On the ‘web: The Value of MANRS

Route leaks and Distributed Denial of Service (DDoS) attacks have been in the news a good deal over the last several years; but the average non-transit network operator might generally feel pretty helpless in the face of the onslaught. Perhaps you can buy a DDoS mitigation service or appliance, and deploy the ubiquitous firewall at the edge of your network, but there is not much else to be done, right? Or maybe wait on the Internet at large to “do something” about these problems by deploying some sort of BGP security. But will adopting a “secure edge,” and waiting for someone else to solve the problem, really help? @ECI

Section 10 Routing Loops

A (long) time ago, a reader asked me about RFC4456, section 10, which says:

Care should be taken to make sure that none of the BGP path attributes defined above can be modified through configuration when exchanging internal routing information between RRs and Clients and Non-Clients. Their modification could potentially result in routing loops. In addition, when a RR reflects a route, it SHOULD NOT modify the following path attributes: NEXT_HOP, AS_PATH, LOCAL_PREF, and MED. Their modification could potentially result in routing loops.

On first reading, this seems a little strange—how could modifying the next hop, Local Preference, or MED at a route reflector cause a routing loop? While contrived, the following network illustrates the principle.

Note the best path, from an IGP perspective, from C to E is through B, and the best path, from an IGP perspective, from B to D is through C. In this case, a route is advertised over eBGP from F towards E and D. These two eBGP speakers, in turn, advertise the route to their iBGP neighbors, B and C. Both B and C are route reflectors, so they both reflect the route on to A, which advertises the route to some other eBGP speaker outside AS65000 (not shown in the network diagram). In this case, assume the best path (for whatever reason) should be the route learned through D.

What happens if C changes the next hop for the route so it points to E rather than D? This should be fine, at first glance; when E receives traffic for the destination reachable through F, it will use the local eBGP route learned from F directly to forward the traffic. But there is a subtle problem here. Assume A receives both routes, one from B with a next hop of D, and one from C with a next hop of E. A, for whatever reason, chooses the path with a next hop of D. The best path to D, according to the IGP metrics, is through C, so A forwards the traffic to C.

C, however, has been configured to set the next hop to E through a local configuration. The best IGP path to E is through B, so C will forward the traffic towards B to be forwarded to E. B, however, has a next hop towards this destination of D, so when it receives packets destined beyond F in AS65001, it will examine its local routing table for the best path towards D, and find this is through C. Hence, B will forward the traffic to C to be forwarded towards D.

Thus a routing loop is formed because the best IGP path towards the next hop always points through another router with a next hop that points back to the router forwarding the traffic. The problem is B and C have inconsistent bestpaths, such that they each think the bestpath is through one another.

This is, of course, an artifact of overlaying two different control planes, each with their own rules about how to determine a loop free path to any given destination. This sort of problem can arise with any pair of control planes overlaid in this way.

What about MED, Local Preference, or the AS Path? C could modify any of these while reflecting the route to cause E to be chosen as the best exit point locally, while B and A continue to choose D as the best exit point. Any of these, then, can be used to create a routing loop in this topology.

Again, this is a somewhat contrived example, but if a loop can be contrived, then it will likely show up in more complex (and not-so-contrived) networks in the real world. It would be much easier to create a loop with a hierarchical route reflector, or even by causing an inconsistent route advertisement on the AS edge (two different eBGP speakers advertising different paths to a given destination reachable through the local AS).

Flowspec and RFC1998?

In a recent comment, Dave Raney asked:

Russ, I read your latest blog post on BGP. I have been curious about another development. Specifically is there still any work related to using BGP Flowspec in a similar fashion to RFC1998. In which a customer of a provider will be able to ask a provider to discard traffic using a flowspec rule at the provider edge. I saw that these were in development and are similar but both appear defunct. BGP Flowspec-ORF https://www.ietf.org/proceedings/93/slides/slides-93-idr-19.pdf BGP Flowspec Redirect https://tools.ietf.org/html/draft-ietf-idr-flowspec-redirect-ip-02.

This is a good question—to which there are two answers. The first is this service does exist. While its not widely publicized, a number of transit providers do, in fact, offer the ability to send them a flowspec community which will cause them to set a filter on their end of the link. This kind of service is immensely useful for countering Distributed Denial of Service (DDoS) attacks, of course. The problem is such services are expensive. The one provider I have personal experience with charges per prefix, and the cost is high enough to make it much less attractive.

Why would the cost be so high? The same reason a lot of providers do not filter for unicast Reverse Path Forwarding (uRPF) failures at scale—per packet filtering is very performance intensive, sometimes requiring recycling the packet in the ASIC. A line card normally able to support x customers without filtering may only be able to support x/2 customers with filtering. The provider has to pay for additional space, power, and configuration (the flowspec rules must be configured and maintained on the customer facing router). All of these things are costs the provider is going to pass on to their customers. The cost is high enough that I know very few people (in fact, so few as to be 0) network operators who will pay for this kind of service.

The second answer is there is another kind of service that is similar to what Dave is asking about. Many DDoS protection services offer their customers the ability to signal a request to the provider to block traffic from a particular source, or to help them manage a DDoS in some other way. This is very similar to the idea of interdomain flowspec, only using a different signaling mechanism. The signaling mechanism, in this case, is designed to allow the provider more leeway in how they respond to the request for help countering the DDoS. This system is called DDoS Open Threats Signaling; you can read more about it at this post I wrote at the ECI Telecom blog. You can also head over to the IETF DOTS WG page, and read through the drafts yourself.

Yes, I do answer reader comments… Sometimes just in email, and sometimes with a post—so comment away, ask questions, etc.

Do We Really Need a New BGP?

From time to time, I run across (yet another) article about why BGP is so bad, and how it needs to be replaced. This one, for instance, is a recent example.

cross posted at APNIC and CircleID

It seems the easiest way to solvet this problem is finding new people—ones who don’t make mistakes—to work on BGP configuration, building IRR databases, and deciding what should be included in BGP? Ivan points out how hopeless of a situation this is going to be, however. As Ivan says, you cannot solve people problems with technology. You can hint in the right direction, and you can try to make things a little more sane, and a little less complex, but people cannot be fixed with technology. Given we cannot fix the people problem, would replacing BGP itself really help? Is there anything we could do to make things better?

To understand the answer to these questions, it is important to tear down a major misconception about BGP. The misconception?

BGP is a routing protocol in the same sense as OSPF, IS-IS, or EIGRP.

BGP was not designed to be a routing protocol in the way other protocol were. It was designed to provide a loop free path through a series of independently operated networks, each with its own policy and business goals. In the sense that BGP provides a loop free route to a destination, it provides routing. But the “routing” it provides is largely couched in terms of explicit, rather than implicit, policy (see the note below). Loop free routes are not always the “shortest” path in terms of hop count, or the “lowest cost” path in terms of delay, or the “best available” path in terms of bandwidth, or anything else. This is why BGP relies on the AS Path to prevent loops. We call things “metrics” in BGP in a loose way, but they are really explicit expressions of policy.

Consider this: the primary policies anyone cares about in interdomain routing are: where do I want this traffic to exit my AS, and where do I want this traffic to enter my AS? The Local Preference is an expression of where traffic to this particular destination should exit this AS. The Multiple Exit Disciminator (MED) is an expression of where this AS would like to receive traffic being forwarded to this destination. Everything other than these are just tie breakers. All the rest of the stuff we do to try to influence the path of traffic into and out of an AS, like messing with the AS Path, are hacks. If you can get this pair of “things people really care about” into your head, the BGP bestpath process, and much of the routing that goes on in the DFZ, makes a lot more sense.

It really is that simple.

How does this relate to the problem of replacing BGP? There are several things you could improve about BGP, but automatic metrics are not one of them. There are, in fact, already “automatic metrics” in BGP, but “automatic metrics” like the IGP cost are tie breakers. A tie breaker is a convenient stand-in for what the protocol designer and/or implementor thinks the most natural policy should be. Whether or not they are right or wrong in a specific situation is a… guess.

What about something like the RPKI? The RPKI is not going to help in most situations where a human makes a mistake in a transit provider. It would help with transit edge failures and hijacks, but these are a different class of problem. You could ask for BGPsec to counter these problems, of course, but BGPsec would likely cause more problems than it solves (I’ve written on this before, here, here, here, here, and here, to start; you can find a lot more on rule11 by following this link).

Given replacing the metrics is not a possibility, and RPKI is only going to get you “so far,” what else can be done? There are, in fact, several practical steps that could be taken.

You could specify that BGP implementations should, by default, only advertise routes if there is some policy configured. Something like, say… RFC8212?

Giving operators more information to understand what they are configuring (perhaps by cleaning up the Internet Routing Registries?) would also be helpful. Perhaps we could build a graph overlay on top of the Default Free Zone (DFZ) so a richer set of policies could be expressed, and policies could be better observed and understood (but you have to convince the transit providers that this would not harm their business before this could happen).

Maybe we could also stop trying to use BGP as the trash can of the Internet, throwing anything we don’t know what else to do with in there. We’ve somehow forgotten the old maxim that a protocol is not done until we have removed everything that is not needed. Now our mantra seems to be “the protocol isn’t done until it solves every problem anyone has ever thought of.” We just keep throwing junk at BGP as if it is the abominable snowman—we assume it’ll bounce when it hits bottom. Guess what: it’s not, and it won’t.

Replacing BGP is not realistic—nor even necessary. Maybe it is best to put it this way:

  • BGP expresses policy
  • Policy is messy
  • Therefore, BGP is messy

We definitely need to work towards building good engineers and good tools—but replacing BGP is not going to “solve” either of these problems.

P.S. I have differentiated between “metrics” and “policy” here—but metrics can be seen as an implicit form of policy. Choosing the highest bandwidth path is a policy. Choosing the path with the shortest hop count is a policy, too. The shortest path (for some meaning of “shortest”) will always be provably loop free, so it is a useful way to always choose a loop free path in the face of simple, uniform, policies. But BGP doesn’t live in the world of simple uniform policies; it lives in the world of “more than one metric.” BGP lives in a world where different policies not only overlap, but directly compete. Computing a path with more than one metric is provably at least bistable, and often completely unstable, no matter what those metrics are.

P.P.S. This article is a more humorous take on finding perfect people.

BGPsec and Reality

From time to time, someone publishes a new blog post lauding the wonderfulness of BGPsec, such as this one over at the Internet Society. In return, I sometimes feel like I am a broken record discussing the problems with the basic idea of BGPsec—while it can solve some problems, it creates a lot of new ones. Overall, BGPsec, as defined by the IETF Secure Interdomain (SIDR) working group is a “bad idea,” a classic study in the power of unintended consequences, and the fond hope that more processing power can solve everything. To begin, a quick review of the operation of BGPsec might be in order. Essentially, each AS in the AS Path signs the “BGP update” as it passes through the internetwork, as shown below.

In this diagram, assume AS65000 is originating some route at A, and advertising it to AS65001 and AS65002 at B and C. At B, the route is advertised with a cryptographic signature “covering” the first two hops in the AS Path, AS65000 and AS65001. At C, the route is advertised with a cryptogrphic signature “covering” the first two hops in the AS Path, AS65000 and AS65002. When F advertises this route to H, at the AS65001 to AS65003 border, it again signs the AS Path, including the AS F is advertising the route to, so the signed path includes AS65000, AS65001, and AS65003.

To validate the route, H can use AS65000’s public key to verify the signature over the first two hops in the AS Path. This shows that AS65000 not only did advertise the route to AS65001, but also that it intended to advertise this route to AS65001. In this way, according to the folks working on BGPsec, the intention of AS65000 is laid bare, and the “path of the update” is cryptographically verified through the network.

Except, of course, there is no such thing as an “update” in BGP that is carried from A to H. Instead, at each router along the way, the information stored in the update is broken up and stored in different memory structures, and then rebuilt to be transmitted to specific peers as needed. BGPsec, then, begins with a misunderstanding of how BGP actually works; it attempts to validate the path of an update through an internetwork—and this turns out to be the one piece of information that doesn’t matter all that much in security terms.

But set this problem aside for a moment, and consider how this actually works. First, B, before the signatures, could have sent a single update to multiple peers. After the signatures, each peer must receive its own update. One of the primary ways BGP uses to increase performance is in gathering updates up and sending one update whenever possible using either a peer group or an update group. Worse yet, every reachable destination—NLRI—now must be carried in its own update. So this means no packing, and no peer groups. The signatures themselves must be added to the update packets, as well, which means they must be stored, carried across the wire, etc.

The general assumption in the BGPsec community is the resulting performance problems can be resolved by just upping the processor and bandwidth. That BGPsec has been around for 20 years, and the performance problem still hasn’t been solved is not something anyone seems to consider. 🙂 In practice, this also means replacing every eBGP speaker in the internetwork—perhaps hundreds of thousands of them in the ‘net—to support this functionality. “At what cost,” and “for what tradeoffs,” are questions that are almost never asked.

But let’s lay aside this problem for a moment, and just assumed every eBGP speaking router in the entire ‘net could be replaced tomorrow, at no cost to anyone. Okay, all the BGP AS Path problems are now solved right? Not so fast…

Assume, for a moment, that AS65000 and AS65001 break their peering relationship for some reason. At the moment the B to D peering relationship is shut down, D still has a copy of the signed updates it has been using. How long can AS65001 continue advertising connectivity to this route? The signatures are in band, carried in the BGP update as constructed at B, and transmitted to D. So long as AS65001 has a copy of a single update, it can appear to remain connected to AS65000, even though the connection has been shut down. The answer, then, is that AS65000 must somehow invalidate the updates it previously sent to AS65001. There are three ways to do this.

First, AS65000 could roll its public and private key pair. This might work, so long as peering and depeering events are relatively rare, and the risk from such depeering situations is small. But are they? Further, until the the new public and private key pairs are distributed, and until new routes can be sent through the internetwork using these new keys, the old keys must remain in place to prevent a routing disruption. How long is this? Even if it is 24 hours, probably a reasonable number, AS65001 has the means to grab traffic that is destined to AS65000 and do what it likes with that traffic. Are you comfortable with this?

Second, the community could build a certificate revocation list. This is a mess, so there’s no point in going there.

Third, you could put a timer in the BGP update, like a Link State Update. Once the timer runs down or our, the advertisement must be replaced. Given there are 800k routes in the default free zone, a timer of 24 hours (which would still make me uncomfortable in security terms), there would need to be 800k/24 hours updates per hour added to the load of every router in the Internet. On top of the reduced performance noted above.

Again, it is useful to set this problem aside, and assume it can be solved with the wave of a magic wand someplace. Perhaps someone comes up with a way to add a timer without causing any additional load, or a new form of revocation list is created that has none of the problems of any sort known about today. Given these, all the BGP AS Path problems in the Internet are solved, right?

Consider, for a moment, the position of AS65001 and AS65002. These are transit providers, companies that rely on their customer’s trust, and their ability to out compete in the area of peering, to make money. First, signing updates means that you are declaring, to the entire world, in a legally provable way, who your customers are. This, from what I understand of the provider business model, is not only a no-no, but a huge legal issue. But this is actually, still, the simpler problem to solve.

Second, you cannot deploy this kind of system with a single, centrally stored private key. Assume, for a moment, that you do solve the problem this way. What happens if a single eBGP speaker is compromised? What if you need to replace a single eBGP speaker? You must roll your AS level private key. And replace all your advertisements in the entire Internet. This, from a security standpoint, is a bad idea.

Okay—the reasonable alternative is to create a private key per eBGP speaker. This private key would have its own public key, which would, in turn, be signed by the AS level private key. There are two problems with this scheme, however. The first is: when H validates the signature on some update it has received, it must now find not only the AS level public keys for AS65000 and AS65001, it must find the public key for B and F. This is going to be messy. The second is: By examining the publickeys I receive in a collection of “every update on the Internet,” I can now map the actual peering points between every pair of autonomous systems in the world. All the secret sauce in peering relationships? Exposed. Which router (or set of routers) to attack to impact the business of a specific company? Exposed.

The bottom line is this: even setting aside BGPsec’s flawed view of the way BGP works, even setting aside BGPsec’s flawed view of what needs to be secured, even setting aside BGPsec implementations the benefit of doing the impossible (adding state and processing without impacting performance), even given some magical form of replay attack prevention that costs nothing, BGPsec still exposes information no-one really wants exposed. The tradeoffs are ultimately unacceptable.

Which all comes back to this: If you haven’t found the tradeoffs, you haven’t looked hard enough.

Reaction: Networks are not cars or cell phones

The network engineering world has long emphasized the longevity of the hardware we buy; I have sat through many vendor presentations where the salesman says “this feature set makes our product future proof! You can buy with confidence knowing this product will not need to be replaced for another ten years…” Over at the Networking Nerd, Tom has an article posted supporting this view of networking equipment, entitled Network Longevity: Think Car, not iPhone.

It seems, to me, that these concepts of longevity have the entire situation precisely backwards. These ideas of “car length longevity” and “future proof hardware” are looking at the network from the perspective of an appliance, rather than from the perspective as a set of services. Let me put this in a little bit of context by considering two specific examples.

In terms of cars, I have owned four in the last 31 years. I owned a Jeep Wrangler for 13 years, a second Jeep Wrangler for 8 years, and a third Jeep Wrangler for 9 years. I have recently switched to a Jeep Cherokee, which I’ve just about reached my first year driving.

What if I bought network equipment like I buy cars? What sort of router was available 9 years ago? That is 2008. I was still working at Cisco, and my lab, if I remember right, was made up of 7200’s and 2600’s. Younger engineers probably look at those model numbers and see completely different equipment than what I actually had; I doubt many readers of this blog ever deployed 7200’s of the kind I had in my lab in their networks. Do I really want to run a network today on 9 year old hardware? I don’t see how the answer to that question can be “yes.” Why?

First, do you really know what hardware capacity you will need in ten years? Really? I doubt your business leaders can tell you what products they will be creating in ten years beyond a general description, nor can they tell you how large the company will be, who their competitors will be, or what shifts might occur in the competitive landscape.

Hardware vendors try to get around this by building big chassis boxes, and selling blades that will slide into them. But does this model really work? The Cisco 7500 was the current chassis box 9 years ago, I think—even if you could get blades for it today, would it meet your needs? Would you really want to pay the power and cooling for an old 7500 for 9 years because you didn’t know if you would need one or seven slots nine years ago?

Building a hardware platform for ten years of service in a world where two years is too far to predict is like rearranging the chairs on the Titanic. It’s entertaining, perhaps, but it’s pretty pointless entertainment.

Second, why are we not taking the lessons of the compute and storage worlds into our thinking, and learning to scale out, rather than scaling up? We treat our routers like the server folks of yore—add another blade slot and make it go faster. Scale up makes your network do this—

Do you see those grey areas? They are costing you money. Do you enjoy defenestrating money?

These are symptoms of looking at the network as a bunch of wires and appliances, as hardware with a little side of software thrown in.

What about the software? Well, it may be hard to believe, but pretty much every commercial operating system available for routers today is an updated version of software that was available ten years ago. Some, in fact, are more than twenty years old. We don’t tend to see this, because we deploy routers and switches as appliances, which means we treat the software as just another form of hardware. We might deploy ten to fifteen different operating systems in our network without thinking about it—something we would never do in our data centers, or on our desktop computers.

So what this appliance based way of looking at things emphasizes is this: buy enough hardware to last you ten years, and treat the software a fungible—software is a second tier player that is a simple enabler for the expensive bits, the hardware. The problem with this view of things is it simply ignores reality. We need to reverse our thinking.

Software is the actual core of the network, not hardware.

If you look at the entire networking space from a software centric perspective, you can think a lot differently. It doesn’t matter what hardware you buy; what matters is what software it runs. This is the revolutionizing observation of white box, bright box, and disaggregated networking. Hardware is cheap, software is expensive. Hardware is CAPEX, software is OPEX. Hardware only loosely interacts with business and operations; software interacts with both.

The appliance model, and the idea of buying big iron like a car, is hampering the growth and usefulness of networks in real businesses. It is going to take a change to realize that most of us care much less about hardware than software in our daily lives, and to transfer this thinking to the network engineering realm.

It is time for a new way of looking at the network. A router is not a car, nor it is a cell phone. It is a router, and it deserves its own way of looking at value. The value is in connecting the software to the business, and the hardware to the speeds and feeds. These are separate problems which the appliance model ties into a single “thing.” This makes the appliance world bad for businesses, bad for network design, and bad for network engineers.

It’s time to rethink the way we look at network engineering to build networks that are better for business, to adjust our idea of future proof to mean a software based system that can be used across many generations of hardware, while hardware becomes a “just in time” component used and recycled as needs must.

On the ‘web: What’s Wrong with BGP

Our guests are Russ White, a network architect at LinkedIn; and Sue Hares, a consultant and chair of the Inter-Domain Routing Working Group at the IETF. They discuss the history of BGP, the original problems it was intended to solve, and what might change. This is an informed and wide-ranging conversation that also covers whitebox, software quality, and more. Thanks to Huawei, which covered travel and accommodations to enable the Packet Pushers to attend IETF 99 and record some shows to spread the news about IETF projects and initiatives.

You can jump to the original post on Packet Pushers here.

BGP Persistent Oscillation

After Daniel Walton visited the History of Networking at the Network Collective, I went back and poked at BGP permanent route oscillations just to refresh my memory. Since I spent the time, I thought it was worth a post, with some observations. When working with networking problems, it is always wise to begin with a network, so…

For those who are interested, I’m pretty much following RFC3345 in this explanation.

There are two BGP route reflectors here, in two different clusters, labeled A and D. The metric for each link is listed on the links between the RR clients, B, C, and E, and the RRs; the cost of the link between the RRs is 1. A single route, 2001:db8:3e8:100::/64 is being advertised in with an AS path of the same length from three different eBGP peering points, each with a different MED. E is receiving the route with a MED of 0, C with a MED of 1, and B with a MED of 10.

Starting with A, walk through one cycle of the persistent oscillation. At A there are two routes—

edge MED IGP Cost
C    1   4
B    10  5 (BEST)

When A runs the bestpath calculation, it will determine the best path should be through C, rather than B (because the IGP cost is lower; the MEDs are not compared due to different AS paths), so it will send an update to each of its peers about this change in its best path. This results in D having the following table—

edge MED IGP Cost
E    0   12 (BEST)
C    1   5

When D receives this update from A, it will calculate the best path towards 2001:db8:3e8:100::/64, and choose the path through E because this path has the lowest MED (the MEDs are compared here because the AS path is the same). On completing the best path calculation, E will send an update to its peers, letting them know about its new best path to the destination, which primarily means A. On receiving this update, A has three routes to the destination in its table—

edge MED IGP Cost
E    0   13
C    1   4
B    10  5 (BEST)

The key point to think through here is why the third route in the table is best. The BGP bestpath process compares each pair of routes, starting with the first pair. So the first two routes are compared, the best of those two is chosen and compared with the third, the best of these two is compared with the fourth, etc., until the end of the table is reached. Here the first two paths are compared, and the path through E because of the lower MED (the AS path is the same). When the path through E is compared to the path through B, however, the path through B wins because the IGP metric is lower (the AS path is different).

At this point, A will now send an update to its peers, specifically D, informing D of this change in its best path. Note this update removes any information about the path through C from D’s table, so D only has a partial view of the available paths. As a result, D will have the following in its table—

edge MED IGP Cost
E    0   12
B    10  6 (BEST)

D will select the path through B because the IGP cost is lower, and send an update to A with this new information. This will result in A having the table this process started with—

edge MED IGP Cost
C    1   4
B    10  5 (BEST)

What is interesting about this is the removal of information about the path through C from D’s view of the network. Essentially, what is happening here is D is switching between two different views of the network topology, one of which includes B, the other of which includes C. The reason the ADD_PATH extension solves this problem is that A and D both have a full view of every exit point once each BGP speaker sends every route to each destination, rather than just the best path.

This is, in effect, another instance of the inconsistency of a distributed database causing a persistent condition in a control plane. In (loosely!) CAP theorem terms, distributed routing protocol always choose accessibility (the local device can read the database to calculate loop free paths) and partitioning (the database is always copied to every device speaking the protocol) over consistency—eventually, or “not always,” consistent databases will always be the result of such a situation. As A and D read their databases, each of which contain incomplete information about the real state of the network, they will make different decisions about what the best path to the destination in question is. As they each change their views of the topology, they will send updated information to one another, causing the other BGP speaker to recompute its view of the topology, and…

Persistent BGP oscillation is an interesting study in the way consistency impacts distributed routing protocol design and convergence.

Optimal Route Reflection


There are—in theory—three ways BGP can be deployed within a single AS. You can deploy a full mesh of iBGP peers; this might be practical for a small’ish deployment (say less than 10), but it quickly becomes a management problem in larger, or constantly changing, deployments. You can deploy multiple BGP confederations; creating internal autonomous systems that are invisible to the world because the internal AS numbers are stripped at the real eBGP edge.

The third solution is (probably) the only solution anyone reading this has deployed in a production network: route reflectors. A quick review might be useful to set the stage.

In this diagram, B and E are connected to eBGP peers, each of which is advertising a different destination; F is advertising the 100::64 prefix, and G is advertising the 101::/64 prefix. Assume A is the route reflector, and B,C, D, and E are route reflector clients. What happens when F advertises 100::/64 to B?

  • B receives the route and advertises it through iBGP to A
  • A adds its router ID to the cluster list, and reflect the route to C, D, and E
  • E receives this route and advertises it through its eBGP session towards G
  • C does not advertise 100::/64 towards D, because D is an iBGP peer (not configured as a route reflector)
  • D does not advertise 100::/64 towards C, because C is an iBGP peer (not configured as a route reflector)

Even if D did readvertise the route towards C, and C back towards A, A would reject the route because its router ID is in the cluster list. Although the improper use of route reflectors can get you into a lot of trouble, the usage depicted here is fairly simple. Here A will only have one path towards 100::/64, so it will only have one possible path across which to run the BGP bestpath calculation.

The case of 101::/64 is a little different, however. The oddity here is the link metrics. In this network, A is going to receive two routes towards 101::/64, through D and E. Assuming all other things are equal (such as the local preference), A will choose the path to the speaker within the AS with the lowest IGP metric. Hence A will choose the path through E, advertising this route to B, C, and D. What if A were not a route reflector? If every router within the AS were part of an iBGP full mesh, what would happen? In this case:

  • B would receive three two routes to 101::/64, one from D with an IGP metric of 30, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, B will choose the path through E to reach 101::/64.
  • C would receive two routes to 101::/64, one from D with an IGP metric of 10, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, C will choose the path through D to reach 101::/64.

Inserting the route reflector, A, into the network does not change the best path to 101::/64 from the perspective of B, but it does change C’s best path from D to E. How can the shortest path be restored in the network? The State/Optimization/Surface (SOS) three way trade off tells use there are two possible solutions—either the state removed by the route reflector must be restored into BGP, or some interaction surface needs to be enabled between BGP and some other system in the network that has the information required to restore optimal routing.

The first of these two options, restoring the state removed through route reflection, is represented by two different solutions, one of which can be considered a subset of the other. The first solution is for the route reflector, A, to send all the routes to 101::/64 to every route reflector client. This is called add paths, and is documented in RFC7911. The problem with this solution is the amount of additional state.

A second option is to provide some set of paths beyond the best path to each client, but not the entire set of paths. This solution still attacks the suboptimal problem by adding state that was removed through the reflection process. In this case, however, rather than adding back all the state, a subset of state is added back. The state added back is normally the second best path, which is enough to provide enough information to re-optimize the network, but minimal enough to not overwhelm BGP.

What about the other option—allowing BGP to interact with some other system that has the information required to tell BGP specifically which state will allow the route reflector clients to compute the optimal path through the network? This third solution is described in BGP Optimal Route Reflection (BGP-ORR). To understand this solution, begin by asking: why does removing BGP advertisements from the control plane cause suboptimal routing? The answer to this question is: because the route reflector client does not have all the available routes, it cannot compare the IGP metric of every path in order to determine the shortest path.

In other words, C actually has two paths to 101::/64, one through A and another through D. If C knew about these two paths, it could compare the two IGP costs, through A and through D, and choose the closest exit point out of the AS. What other router in the netwok has all the relevant information? The route reflector—A. If a link state IGP is being used in this network, A can calculate the shortest path from C to both of the potential exit points, D and E. Further, because it is the route reflector, A knows about both of the routes to reach 101::/64. Hence, A can compute the best path as C would compute it, taking into account the IGP metric for both exit points, and send C the route it knows the BGP best path process on C will choose anyway. This is exactly what BGP Optimal Route Reflection (BGP-ORR) describes.

Hopefully this short tour through BGP route reflection, the problem route reflection causes by removing state from the network, and the potential solutions, is useful in understanding the various drafts and solutions being proposed.

Blocking a DDoS Upstream

In the first post on DDoS, I considered some mechanisms to disperse an attack across multiple edges (I actually plan to return to this topic with further thoughts in a future post). The second post considered some of the ways you can scrub DDoS traffic. This post is going to complete the basic lineup of reacting to DDoS attacks by considering how to block an attack before it hits your network—upstream.

The key technology in play here is flowspec, a mechanism that can be used to carry packet level filter rules in BGP. The general idea is this—you send a set of specially formatted communities to your provider, who then automagically uses those communities to create filters at the inbound side of your link to the ‘net. There are two parts to the flowspec encoding, as outlined in RFC5575bis, the match rule and the action rule. The match rule is encoded as shown below—

There are a wide range of conditions you can match on. The source and destination addresses are pretty straight forward. For the IP protocol and port numbers, the operator sub-TLVs allow you to specify a set of conditions to match on, and whether to AND the conditions (all conditions must match) or OR the conditions (any condition in the list may match). Ranges of ports, greater than, less than, greater than or equal to, less than or equal to, and equal to are all supported. Fragments, TCP header flags, and a number of other header information can be matched on, as well.

Once the traffic is matched, what do you do with it? There are a number of rules, including—

  • Controlling the traffic rate in either bytes per second or packets per second
  • Redirect the traffic to a VRF
  • Mark the traffic with a particular DSCP bit
  • Filter the traffic

If you think this must be complicated to encode, you are right. That’s why most implementations allow you to set pretty simple rules, and handle all the encoding bits for you. Given flowspec encoding, you should just be able to detect the attack, set some simple rules in BGP, send the right “stuff” to your provider, and watch the DDoS go away. …right… If you have been in network engineering since longer than “I started yesterday,” you should know by now that nothing is ever that simple.

If you don’t see a tradeoff, you haven’t looked hard enough.

First, from a provider’s perspective, flowspec is an entirely new attack surface. You cannot let your customer just send you whatever flowspec rules they like. For instance, what if your customer sends you a flowspec rule that blocks traffic to one of your DNS servers? Or, perhaps, to one of their competitors? Or even to their own BGP session? Most providers, to prevent these types of problems, will only apply any flowspec initiated rules to the port that connects to your network directly. This protects the link between your network and the provider, but there is little way to prevent abuse if the provider allows these flowspec rules to be implemented deeper in their network.

Second, filtering costs money. This might not be obvious at a single link scale, but when you start considering how to filter multiple gigabits of traffic based on deep packet inspection sorts of rules—particularly given the ability to combine a number of rules in a single flowspec filter rule—filtering requires a lot of resources during the actual packet switching process. There is a limited number of such resources on any given packet processing engine (ASIC), and a lot of customers who are likely going to want to filter. Since filtering costs the provider money, they are most likely going to charge for flowspec, limit which customers can send them flowspec rules (generally grounded in the provider’s perception of the customer’s cluefulness), and even limit the number of flowspec rules that can be implemented at any given time.

There is plenty of further reading out there on configuring and using flowspec, and it is likely you will see changes in the way flowspec is encoded in the future. Some great places to start are—

One final thought as I finish this post off. You should not just rely on technical tools to block a DDoS attack upstream. If you can figure out where the DDoS is coming from, or track it down to a small set of source autonomous systems, you should find some way to contact the operator of the AS and let them know about the DDoS attack. This is something Mara and I will be covering in an upcoming webinar over at ipspace.net—watch for more information on this as we move through the summer.