But how far can we drive the complexity of these systems before they ultimately fail? Bert posted this chart to the APNIC blog to illustrate the problem—
I am old enough to remember when the entire Cisco IOS Software (classic) code base was under 150,000 lines; today, I suspect most BGP and DNS implementations are well over this size. Consider this for a moment—a single protocol implementation that is larger than an entire Network Operating System ten to fifteen years back.
What really grabbed my attention, though, was one of the reasons Bert believes we have these complexity problems—
How often is this the problem in network design and deployment? “Oh, you want a stretched Ethernet link between two data centers 150 miles apart, and you want an eVPN control plane on top of the stretched Ethernet to support MPLS Traffic Engineering, and you want…” All the while the equipment budget is ringing up numbers in our heads, and the realyl cool stuff we will be able to play with is building up on the list we are writing in front of us. Then you hear the ultimate challenge—”if you were a real engineer, you could figure out how to do this all with a pair of routers I can buy down at the local office supply store.”
Some problems just do not need to be solved in the current system. Some problems just need to have their own system built for them, rather than reusing the same old stuff because, well, “we can.”
The real engineer is the one who knows how to say “no.”
Nick Russo and I stopped by the Network Collective last week to talk about BGP traffic engineering—and in the process I confused BGP deterministic MED and always compare MED. I’ve embedded the video below.
Care should be taken to make sure that none of the BGP path attributes defined above can be modified through configuration when exchanging internal routing information between RRs and Clients and Non-Clients. Their modification could potentially result in routing loops. In addition, when a RR reflects a route, it SHOULD NOT modify the following path attributes: NEXT_HOP, AS_PATH, LOCAL_PREF, and MED. Their modification could potentially result in routing loops.
On first reading, this seems a little strange—how could modifying the next hop, Local Preference, or MED at a route reflector cause a routing loop? While contrived, the following network illustrates the principle.
Note the best path, from an IGP perspective, from C to E is through B, and the best path, from an IGP perspective, from B to D is through C. In this case, a route is advertised over eBGP from F towards E and D. These two eBGP speakers, in turn, advertise the route to their iBGP neighbors, B and C. Both B and C are route reflectors, so they both reflect the route on to A, which advertises the route to some other eBGP speaker outside AS65000 (not shown in the network diagram). In this case, assume the best path (for whatever reason) should be the route learned through D.
What happens if C changes the next hop for the route so it points to E rather than D? This should be fine, at first glance; when E receives traffic for the destination reachable through F, it will use the local eBGP route learned from F directly to forward the traffic. But there is a subtle problem here. Assume A receives both routes, one from B with a next hop of D, and one from C with a next hop of E. A, for whatever reason, chooses the path with a next hop of D. The best path to D, according to the IGP metrics, is through C, so A forwards the traffic to C.
C, however, has been configured to set the next hop to E through a local configuration. The best IGP path to E is through B, so C will forward the traffic towards B to be forwarded to E. B, however, has a next hop towards this destination of D, so when it receives packets destined beyond F in AS65001, it will examine its local routing table for the best path towards D, and find this is through C. Hence, B will forward the traffic to C to be forwarded towards D.
Thus a routing loop is formed because the best IGP path towards the next hop always points through another router with a next hop that points back to the router forwarding the traffic. The problem is B and C have inconsistent bestpaths, such that they each think the bestpath is through one another.
This is, of course, an artifact of overlaying two different control planes, each with their own rules about how to determine a loop free path to any given destination. This sort of problem can arise with any pair of control planes overlaid in this way.
What about MED, Local Preference, or the AS Path? C could modify any of these while reflecting the route to cause E to be chosen as the best exit point locally, while B and A continue to choose D as the best exit point. Any of these, then, can be used to create a routing loop in this topology.
Again, this is a somewhat contrived example, but if a loop can be contrived, then it will likely show up in more complex (and not-so-contrived) networks in the real world. It would be much easier to create a loop with a hierarchical route reflector, or even by causing an inconsistent route advertisement on the AS edge (two different eBGP speakers advertising different paths to a given destination reachable through the local AS).
Russ, I read your latest blog post on BGP. I have been curious about another development. Specifically is there still any work related to using BGP Flowspec in a similar fashion to RFC1998. In which a customer of a provider will be able to ask a provider to discard traffic using a flowspec rule at the provider edge. I saw that these were in development and are similar but both appear defunct. BGP Flowspec-ORF https://www.ietf.org/proceedings/93/slides/slides-93-idr-19.pdf BGP Flowspec Redirect https://tools.ietf.org/html/draft-ietf-idr-flowspec-redirect-ip-02.
This is a good question—to which there are two answers. The first is this service does exist. While its not widely publicized, a number of transit providers do, in fact, offer the ability to send them a flowspec community which will cause them to set a filter on their end of the link. This kind of service is immensely useful for countering Distributed Denial of Service (DDoS) attacks, of course. The problem is such services are expensive. The one provider I have personal experience with charges per prefix, and the cost is high enough to make it much less attractive.
Why would the cost be so high? The same reason a lot of providers do not filter for unicast Reverse Path Forwarding (uRPF) failures at scale—per packet filtering is very performance intensive, sometimes requiring recycling the packet in the ASIC. A line card normally able to support x customers without filtering may only be able to support x/2 customers with filtering. The provider has to pay for additional space, power, and configuration (the flowspec rules must be configured and maintained on the customer facing router). All of these things are costs the provider is going to pass on to their customers. The cost is high enough that I know very few people (in fact, so few as to be 0) network operators who will pay for this kind of service.
To understand the answer to these questions, it is important to tear down a major misconception about BGP. The misconception?
BGP is a routing protocol in the same sense as OSPF, IS-IS, or EIGRP.
BGP was not designed to be a routing protocol in the way other protocol were. It was designed to provide a loop free path through a series of independently operated networks, each with its own policy and business goals. In the sense that BGP provides a loop free route to a destination, it provides routing. But the “routing” it provides is largely couched in terms of explicit, rather than implicit, policy (see the note below). Loop free routes are not always the “shortest” path in terms of hop count, or the “lowest cost” path in terms of delay, or the “best available” path in terms of bandwidth, or anything else. This is why BGP relies on the AS Path to prevent loops. We call things “metrics” in BGP in a loose way, but they are really explicit expressions of policy.
Consider this: the primary policies anyone cares about in interdomain routing are: where do I want this traffic to exit my AS, and where do I want this traffic to enter my AS? The Local Preference is an expression of where traffic to this particular destination should exit this AS. The Multiple Exit Disciminator (MED) is an expression of where this AS would like to receive traffic being forwarded to this destination. Everything other than these are just tie breakers. All the rest of the stuff we do to try to influence the path of traffic into and out of an AS, like messing with the AS Path, are hacks. If you can get this pair of “things people really care about” into your head, the BGP bestpath process, and much of the routing that goes on in the DFZ, makes a lot more sense.
It really is that simple.
How does this relate to the problem of replacing BGP? There are several things you could improve about BGP, but automatic metrics are not one of them. There are, in fact, already “automatic metrics” in BGP, but “automatic metrics” like the IGP cost are tie breakers. A tie breaker is a convenient stand-in for what the protocol designer and/or implementor thinks the most natural policy should be. Whether or not they are right or wrong in a specific situation is a… guess.
What about something like the RPKI? The RPKI is not going to help in most situations where a human makes a mistake in a transit provider. It would help with transit edge failures and hijacks, but these are a different class of problem. You could ask for BGPsec to counter these problems, of course, but BGPsec would likely cause more problems than it solves (I’ve written on this before, here,here,here,here,and here, to start; you can find a lot more on rule11 by following this link).
Given replacing the metrics is not a possibility, and RPKI is only going to get you “so far,” what else can be done? There are, in fact, several practical steps that could be taken.
You could specify that BGP implementations should, by default, only advertise routes if there is some policy configured. Something like, say… RFC8212?
Maybe we could also stop trying to use BGP as the trash can of the Internet, throwing anything we don’t know what else to do with in there. We’ve somehow forgotten the old maxim that a protocol is not done until we have removed everything that is not needed. Now our mantra seems to be “the protocol isn’t done until it solves every problem anyone has ever thought of.” We just keep throwing junk at BGP as if it is the abominable snowman—we assume it’ll bounce when it hits bottom. Guess what: it’s not, and it won’t.
Replacing BGP is not realistic—nor even necessary. Maybe it is best to put it this way:
BGP expresses policy
Policy is messy
Therefore, BGP is messy
We definitely need to work towards building good engineers and good tools—but replacing BGP is not going to “solve” either of these problems.
P.S. I have differentiated between “metrics” and “policy” here—but metrics can be seen as an implicit form of policy. Choosing the highest bandwidth path is a policy. Choosing the path with the shortest hop count is a policy, too. The shortest path (for some meaning of “shortest”) will always be provably loop free, so it is a useful way to always choose a loop free path in the face of simple, uniform, policies. But BGP doesn’t live in the world of simple uniform policies; it lives in the world of “more than one metric.” BGP lives in a world where different policies not only overlap, but directly compete. Computing a path with more than one metric is provably at least bistable, and often completely unstable, no matter what those metrics are.
From time to time, someone publishes a new blog post lauding the wonderfulness of BGPsec, such as this one over at the Internet Society. In return, I sometimes feel like I am a broken record discussing the problems with the basic idea of BGPsec—while it can solve some problems, it creates a lot of new ones. Overall, BGPsec, as defined by the IETF Secure Interdomain (SIDR) working group is a “bad idea,” a classic study in the power of unintended consequences, and the fond hope that more processing power can solve everything. To begin, a quick review of the operation of BGPsec might be in order. Essentially, each AS in the AS Path signs the “BGP update” as it passes through the internetwork, as shown below.
In this diagram, assume AS65000 is originating some route at A, and advertising it to AS65001 and AS65002 at B and C. At B, the route is advertised with a cryptographic signature “covering” the first two hops in the AS Path, AS65000 and AS65001. At C, the route is advertised with a cryptogrphic signature “covering” the first two hops in the AS Path, AS65000 and AS65002. When F advertises this route to H, at the AS65001 to AS65003 border, it again signs the AS Path, including the AS F is advertising the route to, so the signed path includes AS65000, AS65001, and AS65003.
To validate the route, H can use AS65000’s public key to verify the signature over the first two hops in the AS Path. This shows that AS65000 not only did advertise the route to AS65001, but also that it intended to advertise this route to AS65001. In this way, according to the folks working on BGPsec, the intention of AS65000 is laid bare, and the “path of the update” is cryptographically verified through the network.
Except, of course, there is no such thing as an “update” in BGP that is carried from A to H. Instead, at each router along the way, the information stored in the update is broken up and stored in different memory structures, and then rebuilt to be transmitted to specific peers as needed. BGPsec, then, begins with a misunderstanding of how BGP actually works; it attempts to validate the path of an update through an internetwork—and this turns out to be the one piece of information that doesn’t matter all that much in security terms.
But set this problem aside for a moment, and consider how this actually works. First, B, before the signatures, could have sent a single update to multiple peers. After the signatures, each peer must receive its own update. One of the primary ways BGP uses to increase performance is in gathering updates up and sending one update whenever possible using either a peer group or an update group. Worse yet, every reachable destination—NLRI—now must be carried in its own update. So this means no packing, and no peer groups. The signatures themselves must be added to the update packets, as well, which means they must be stored, carried across the wire, etc.
The general assumption in the BGPsec community is the resulting performance problems can be resolved by just upping the processor and bandwidth. That BGPsec has been around for 20 years, and the performance problem still hasn’t been solved is not something anyone seems to consider. 🙂 In practice, this also means replacing every eBGP speaker in the internetwork—perhaps hundreds of thousands of them in the ‘net—to support this functionality. “At what cost,” and “for what tradeoffs,” are questions that are almost never asked.
But let’s lay aside this problem for a moment, and just assumed every eBGP speaking router in the entire ‘net could be replaced tomorrow, at no cost to anyone. Okay, all the BGP AS Path problems are now solved right? Not so fast…
Assume, for a moment, that AS65000 and AS65001 break their peering relationship for some reason. At the moment the B to D peering relationship is shut down, D still has a copy of the signed updates it has been using. How long can AS65001 continue advertising connectivity to this route? The signatures are in band, carried in the BGP update as constructed at B, and transmitted to D. So long as AS65001 has a copy of a single update, it can appear to remain connected to AS65000, even though the connection has been shut down. The answer, then, is that AS65000 must somehow invalidate the updates it previously sent to AS65001. There are three ways to do this.
First, AS65000 could roll its public and private key pair. This might work, so long as peering and depeering events are relatively rare, and the risk from such depeering situations is small. But are they? Further, until the the new public and private key pairs are distributed, and until new routes can be sent through the internetwork using these new keys, the old keys must remain in place to prevent a routing disruption. How long is this? Even if it is 24 hours, probably a reasonable number, AS65001 has the means to grab traffic that is destined to AS65000 and do what it likes with that traffic. Are you comfortable with this?
Second, the community could build a certificate revocation list. This is a mess, so there’s no point in going there.
Third, you could put a timer in the BGP update, like a Link State Update. Once the timer runs down or our, the advertisement must be replaced. Given there are 800k routes in the default free zone, a timer of 24 hours (which would still make me uncomfortable in security terms), there would need to be 800k/24 hours updates per hour added to the load of every router in the Internet. On top of the reduced performance noted above.
Again, it is useful to set this problem aside, and assume it can be solved with the wave of a magic wand someplace. Perhaps someone comes up with a way to add a timer without causing any additional load, or a new form of revocation list is created that has none of the problems of any sort known about today. Given these, all the BGP AS Path problems in the Internet are solved, right?
Consider, for a moment, the position of AS65001 and AS65002. These are transit providers, companies that rely on their customer’s trust, and their ability to out compete in the area of peering, to make money. First, signing updates means that you are declaring, to the entire world, in a legally provable way, who your customers are. This, from what I understand of the provider business model, is not only a no-no, but a huge legal issue. But this is actually, still, the simpler problem to solve.
Second, you cannot deploy this kind of system with a single, centrally stored private key. Assume, for a moment, that you do solve the problem this way. What happens if a single eBGP speaker is compromised? What if you need to replace a single eBGP speaker? You must roll your AS level private key. And replace all your advertisements in the entire Internet. This, from a security standpoint, is a bad idea.
Okay—the reasonable alternative is to create a private key per eBGP speaker. This private key would have its own public key, which would, in turn, be signed by the AS level private key. There are two problems with this scheme, however. The first is: when H validates the signature on some update it has received, it must now find not only the AS level public keys for AS65000 and AS65001, it must find the public key for B and F. This is going to be messy. The second is: By examining the publickeys I receive in a collection of “every update on the Internet,” I can now map the actual peering points between every pair of autonomous systems in the world. All the secret sauce in peering relationships? Exposed. Which router (or set of routers) to attack to impact the business of a specific company? Exposed.
The bottom line is this: even setting aside BGPsec’s flawed view of the way BGP works, even setting aside BGPsec’s flawed view of what needs to be secured, even setting aside BGPsec implementations the benefit of doing the impossible (adding state and processing without impacting performance), even given some magical form of replay attack prevention that costs nothing, BGPsec still exposes information no-one really wants exposed. The tradeoffs are ultimately unacceptable.
Which all comes back to this: If you haven’t found the tradeoffs, you haven’t looked hard enough.
It seems, to me, that these concepts of longevity have the entire situation precisely backwards. These ideas of “car length longevity” and “future proof hardware” are looking at the network from the perspective of an appliance, rather than from the perspective as a set of services. Let me put this in a little bit of context by considering two specific examples.
In terms of cars, I have owned four in the last 31 years. I owned a Jeep Wrangler for 13 years, a second Jeep Wrangler for 8 years, and a third Jeep Wrangler for 9 years. I have recently switched to a Jeep Cherokee, which I’ve just about reached my first year driving.
What if I bought network equipment like I buy cars? What sort of router was available 9 years ago? That is 2008. I was still working at Cisco, and my lab, if I remember right, was made up of 7200’s and 2600’s. Younger engineers probably look at those model numbers and see completely different equipment than what I actually had; I doubt many readers of this blog ever deployed 7200’s of the kind I had in my lab in their networks. Do I really want to run a network today on 9 year old hardware? I don’t see how the answer to that question can be “yes.” Why?
First, do you really know what hardware capacity you will need in ten years? Really? I doubt your business leaders can tell you what products they will be creating in ten years beyond a general description, nor can they tell you how large the company will be, who their competitors will be, or what shifts might occur in the competitive landscape.
Hardware vendors try to get around this by building big chassis boxes, and selling blades that will slide into them. But does this model really work? The Cisco 7500 was the current chassis box 9 years ago, I think—even if you could get blades for it today, would it meet your needs? Would you really want to pay the power and cooling for an old 7500 for 9 years because you didn’t know if you would need one or seven slots nine years ago?
Building a hardware platform for ten years of service in a world where two years is too far to predict is like rearranging the chairs on the Titanic. It’s entertaining, perhaps, but it’s pretty pointless entertainment.
Second, why are we not taking the lessons of the compute and storage worlds into our thinking, and learning to scale out, rather than scaling up? We treat our routers like the server folks of yore—add another blade slot and make it go faster. Scale up makes your network do this—
Do you see those grey areas? They are costing you money. Do you enjoy defenestrating money?
These are symptoms of looking at the network as a bunch of wires and appliances, as hardware with a little side of software thrown in.
What about the software? Well, it may be hard to believe, but pretty much every commercial operating system available for routers today is an updated version of software that was available ten years ago. Some, in fact, are more than twenty years old. We don’t tend to see this, because we deploy routers and switches as appliances, which means we treat the software as just another form of hardware. We might deploy ten to fifteen different operating systems in our network without thinking about it—something we would never do in our data centers, or on our desktop computers.
So what this appliance based way of looking at things emphasizes is this: buy enough hardware to last you ten years, and treat the software a fungible—software is a second tier player that is a simple enabler for the expensive bits, the hardware. The problem with this view of things is it simply ignores reality. We need to reverse our thinking.
Software is the actual core of the network, not hardware.
If you look at the entire networking space from a software centric perspective, you can think a lot differently. It doesn’t matter what hardware you buy; what matters is what software it runs. This is the revolutionizing observation of white box, bright box, and disaggregated networking. Hardware is cheap, software is expensive. Hardware is CAPEX, software is OPEX. Hardware only loosely interacts with business and operations; software interacts with both.
The appliance model, and the idea of buying big iron like a car, is hampering the growth and usefulness of networks in real businesses. It is going to take a change to realize that most of us care much less about hardware than software in our daily lives, and to transfer this thinking to the network engineering realm.
It is time for a new way of looking at the network. A router is not a car, nor it is a cell phone. It is a router, and it deserves its own way of looking at value. The value is in connecting the software to the business, and the hardware to the speeds and feeds. These are separate problems which the appliance model ties into a single “thing.” This makes the appliance world bad for businesses, bad for network design, and bad for network engineers.
It’s time to rethink the way we look at network engineering to build networks that are better for business, to adjust our idea of future proof to mean a software based system that can be used across many generations of hardware, while hardware becomes a “just in time” component used and recycled as needs must.
Our guests are Russ White, a network architect at LinkedIn; and Sue Hares, a consultant and chair of the Inter-Domain Routing Working Group at the IETF. They discuss the history of BGP, the original problems it was intended to solve, and what might change. This is an informed and wide-ranging conversation that also covers whitebox, software quality, and more. Thanks to Huawei, which covered travel and accommodations to enable the Packet Pushers to attend IETF 99 and record some shows to spread the news about IETF projects and initiatives.
After Daniel Walton visited the History of Networking at the Network Collective, I went back and poked at BGP permanent route oscillations just to refresh my memory. Since I spent the time, I thought it was worth a post, with some observations. When working with networking problems, it is always wise to begin with a network, so…
For those who are interested, I’m pretty much following RFC3345 in this explanation.
There are two BGP route reflectors here, in two different clusters, labeled A and D. The metric for each link is listed on the links between the RR clients, B, C, and E, and the RRs; the cost of the link between the RRs is 1. A single route, 2001:db8:3e8:100::/64 is being advertised in with an AS path of the same length from three different eBGP peering points, each with a different MED. E is receiving the route with a MED of 0, C with a MED of 1, and B with a MED of 10.
Starting with A, walk through one cycle of the persistent oscillation. At A there are two routes—
edge MED IGP Cost
C 1 4
B 10 5 (BEST)
When A runs the bestpath calculation, it will determine the best path should be through C, rather than B (because the IGP cost is lower; the MEDs are not compared due to different AS paths), so it will send an update to each of its peers about this change in its best path. This results in D having the following table—
edge MED IGP Cost
E 0 12 (BEST)
C 1 5
When D receives this update from A, it will calculate the best path towards 2001:db8:3e8:100::/64, and choose the path through E because this path has the lowest MED (the MEDs are compared here because the AS path is the same). On completing the best path calculation, E will send an update to its peers, letting them know about its new best path to the destination, which primarily means A. On receiving this update, A has three routes to the destination in its table—
edge MED IGP Cost
E 0 13
C 1 4
B 10 5 (BEST)
The key point to think through here is why the third route in the table is best. The BGP bestpath process compares each pair of routes, starting with the first pair. So the first two routes are compared, the best of those two is chosen and compared with the third, the best of these two is compared with the fourth, etc., until the end of the table is reached. Here the first two paths are compared, and the path through E because of the lower MED (the AS path is the same). When the path through E is compared to the path through B, however, the path through B wins because the IGP metric is lower (the AS path is different).
At this point, A will now send an update to its peers, specifically D, informing D of this change in its best path. Note this update removes any information about the path through C from D’s table, so D only has a partial view of the available paths. As a result, D will have the following in its table—
edge MED IGP Cost
E 0 12
B 10 6 (BEST)
D will select the path through B because the IGP cost is lower, and send an update to A with this new information. This will result in A having the table this process started with—
edge MED IGP Cost
C 1 4
B 10 5 (BEST)
What is interesting about this is the removal of information about the path through C from D’s view of the network. Essentially, what is happening here is D is switching between two different views of the network topology, one of which includes B, the other of which includes C. The reason the ADD_PATH extension solves this problem is that A and D both have a full view of every exit point once each BGP speaker sends every route to each destination, rather than just the best path.
This is, in effect, another instance of the inconsistency of a distributed database causing a persistent condition in a control plane. In (loosely!) CAP theorem terms, distributed routing protocol always choose accessibility (the local device can read the database to calculate loop free paths) and partitioning (the database is always copied to every device speaking the protocol) over consistency—eventually, or “not always,” consistent databases will always be the result of such a situation. As A and D read their databases, each of which contain incomplete information about the real state of the network, they will make different decisions about what the best path to the destination in question is. As they each change their views of the topology, they will send updated information to one another, causing the other BGP speaker to recompute its view of the topology, and…
Persistent BGP oscillation is an interesting study in the way consistency impacts distributed routing protocol design and convergence.
There are—in theory—three ways BGP can be deployed within a single AS. You can deploy a full mesh of iBGP peers; this might be practical for a small’ish deployment (say less than 10), but it quickly becomes a management problem in larger, or constantly changing, deployments. You can deploy multiple BGP confederations; creating internal autonomous systems that are invisible to the world because the internal AS numbers are stripped at the real eBGP edge.
The third solution is (probably) the only solution anyone reading this has deployed in a production network: route reflectors. A quick review might be useful to set the stage.
In this diagram, B and E are connected to eBGP peers, each of which is advertising a different destination; F is advertising the 100::64 prefix, and G is advertising the 101::/64 prefix. Assume A is the route reflector, and B,C, D, and E are route reflector clients. What happens when F advertises 100::/64 to B?
B receives the route and advertises it through iBGP to A
A adds its router ID to the cluster list, and reflect the route to C, D, and E
E receives this route and advertises it through its eBGP session towards G
C does not advertise 100::/64 towards D, because D is an iBGP peer (not configured as a route reflector)
D does not advertise 100::/64 towards C, because C is an iBGP peer (not configured as a route reflector)
Even if D did readvertise the route towards C, and C back towards A, A would reject the route because its router ID is in the cluster list. Although the improper use of route reflectors can get you into a lot of trouble, the usage depicted here is fairly simple. Here A will only have one path towards 100::/64, so it will only have one possible path across which to run the BGP bestpath calculation.
The case of 101::/64 is a little different, however. The oddity here is the link metrics. In this network, A is going to receive two routes towards 101::/64, through D and E. Assuming all other things are equal (such as the local preference), A will choose the path to the speaker within the AS with the lowest IGP metric. Hence A will choose the path through E, advertising this route to B, C, and D. What if A were not a route reflector? If every router within the AS were part of an iBGP full mesh, what would happen? In this case:
B would receive three two routes to 101::/64, one from D with an IGP metric of 30, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, B will choose the path through E to reach 101::/64.
C would receive two routes to 101::/64, one from D with an IGP metric of 10, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, C will choose the path through D to reach 101::/64.
Inserting the route reflector, A, into the network does not change the best path to 101::/64 from the perspective of B, but it does change C’s best path from D to E. How can the shortest path be restored in the network? The State/Optimization/Surface (SOS) three way trade off tells use there are two possible solutions—either the state removed by the route reflector must be restored into BGP, or some interaction surface needs to be enabled between BGP and some other system in the network that has the information required to restore optimal routing.
The first of these two options, restoring the state removed through route reflection, is represented by two different solutions, one of which can be considered a subset of the other. The first solution is for the route reflector, A, to send all the routes to 101::/64 to every route reflector client. This is called add paths,and is documented in RFC7911. The problem with this solution is the amount of additional state.
A second option is to provide some set of paths beyond the best path to each client, but not the entire set of paths. This solution still attacks the suboptimal problem by adding state that was removed through the reflection process. In this case, however, rather than adding back all the state, a subset of state is added back. The state added back is normally the second best path, which is enough to provide enough information to re-optimize the network, but minimal enough to not overwhelm BGP.
What about the other option—allowing BGP to interact with some other system that has the information required to tell BGP specifically which state will allow the route reflector clients to compute the optimal path through the network? This third solution is described in BGP Optimal Route Reflection (BGP-ORR). To understand this solution, begin by asking: why does removing BGP advertisements from the control plane cause suboptimal routing? The answer to this question is: because the route reflector client does not have all the available routes, it cannot compare the IGP metric of every path in order to determine the shortest path.
In other words, C actually has two paths to 101::/64, one through A and another through D. If C knew about these two paths, it could compare the two IGP costs, through A and through D, and choose the closest exit point out of the AS. What other router in the netwok has all the relevant information? The route reflector—A. If a link state IGP is being used in this network, A can calculate the shortest path from C to both of the potential exit points, D and E. Further, because it is the route reflector, A knows about both of the routes to reach 101::/64. Hence, A can compute the best path as C would compute it, taking into account the IGP metric for both exit points, and send C the route it knows the BGP best path process on C will choose anyway. This is exactly what BGP Optimal Route Reflection (BGP-ORR) describes.
Hopefully this short tour through BGP route reflection, the problem route reflection causes by removing state from the network, and the potential solutions, is useful in understanding the various drafts and solutions being proposed.
In the first post on DDoS, I considered some mechanisms to disperse an attack across multiple edges (I actually plan to return to this topic with further thoughts in a future post). The second post considered some of the ways you can scrub DDoS traffic. This post is going to complete the basic lineup of reacting to DDoS attacks by considering how to block an attack before it hits your network—upstream.
The key technology in play here is flowspec, a mechanism that can be used to carry packet level filter rules in BGP. The general idea is this—you send a set of specially formatted communities to your provider, who then automagically uses those communities to create filters at the inbound side of your link to the ‘net. There are two parts to the flowspec encoding, as outlined in RFC5575bis, the match rule and the action rule. The match rule is encoded as shown below—
There are a wide range of conditions you can match on. The source and destination addresses are pretty straight forward. For the IP protocol and port numbers, the operator sub-TLVs allow you to specify a set of conditions to match on, and whether to AND the conditions (all conditions must match) or OR the conditions (any condition in the list may match). Ranges of ports, greater than, less than, greater than or equal to, less than or equal to, and equal to are all supported. Fragments, TCP header flags, and a number of other header information can be matched on, as well.
Once the traffic is matched, what do you do with it? There are a number of rules, including—
Controlling the traffic rate in either bytes per second or packets per second
Redirect the traffic to a VRF
Mark the traffic with a particular DSCP bit
Filter the traffic
If you think this must be complicated to encode, you are right. That’s why most implementations allow you to set pretty simple rules, and handle all the encoding bits for you. Given flowspec encoding, you should just be able to detect the attack, set some simple rules in BGP, send the right “stuff” to your provider, and watch the DDoS go away. …right… If you have been in network engineering since longer than “I started yesterday,” you should know by now that nothing is ever that simple.
If you don’t see a tradeoff, you haven’t looked hard enough.
First, from a provider’s perspective, flowspec is an entirely new attack surface. You cannot let your customer just send you whatever flowspec rules they like. For instance, what if your customer sends you a flowspec rule that blocks traffic to one of your DNS servers? Or, perhaps, to one of their competitors? Or even to their own BGP session? Most providers, to prevent these types of problems, will only apply any flowspec initiated rules to the port that connects to your network directly. This protects the link between your network and the provider, but there is little way to prevent abuse if the provider allows these flowspec rules to be implemented deeper in their network.
Second, filtering costs money. This might not be obvious at a single link scale, but when you start considering how to filter multiple gigabits of traffic based on deep packet inspection sorts of rules—particularly given the ability to combine a number of rules in a single flowspec filter rule—filtering requires a lot of resources during the actual packet switching process. There is a limited number of such resources on any given packet processing engine (ASIC), and a lot of customers who are likely going to want to filter. Since filtering costs the provider money, they are most likely going to charge for flowspec, limit which customers can send them flowspec rules (generally grounded in the provider’s perception of the customer’s cluefulness), and even limit the number of flowspec rules that can be implemented at any given time.
There is plenty of further reading out there on configuring and using flowspec, and it is likely you will see changes in the way flowspec is encoded in the future. Some great places to start are—
One final thought as I finish this post off. You should not just rely on technical tools to block a DDoS attack upstream. If you can figure out where the DDoS is coming from, or track it down to a small set of source autonomous systems, you should find some way to contact the operator of the AS and let them know about the DDoS attack. This is something Mara and I will be covering in an upcoming webinar over at ipspace.net—watch for more information on this as we move through the summer.
Your first line of defense to any DDoS, at least on the network side, should be to disperse the traffic across as many resources as you can. Basic math implies that if you have fifteen entry points, and each entry point is capable of supporting 10g of traffic, then you should be able to simply absorb a 100g DDoS attack while still leaving 50g of overhead for real traffic (assuming perfect efficiency, of course—YMMV). Dispersing a DDoS in this way may impact performance—but taking bandwidth and resources down is almost always the wrong way to react to a DDoS attack.
But what if you cannot, for some reason, disperse the attack? Maybe you only have two edge connections, or if the size of the DDoS is larger than your total edge bandwidth combined? It is typically difficult to mitigate a DDoS attack, but there is an escalating chain of actions you can take that often prove useful. Let’s deal with local mitigation techniques first, and then consider some fancier methods.
TCP SYN filtering: A lot of DDoS attacks rely on exhausting TCP open resources. If all inbound TCP sessions can be terminated in a proxy (such as a load balancer), the proxy server may be able to screen out half open and poorly formed TCP open requests. Some routers can also be configured to hold TCP SYNs for some period of time, rather than forwarding them on to the destination host, in order to block half open connections. This type of protection can be put in place long before a DDoS attack occurs.
Limiting Connections: It is likely that DDoS sessions will be short lived, while legitimate sessions will be longer lived. The different may be a matter of seconds, or even milliseconds, but it is often enough to be a detectable difference. It might make sense, then, to prefer existing connections over new ones when resources start to run low. Legitimate users may wait longer to connect when connections are limited, but once they are connected, they are more likely to remain connected. Application design is important here, as well.
Aggressive Aging: In cache based systems, one way to free up depleted resources quickly is to simply age them out faster. The length of time a connection can be held open can often be dynamically adjusted in applications and hosts, allowing connection information to be removed from memory faster when there are fewer connection slots available. Again, this might impact live customer traffic, but it is still a useful technique when in the midst of an actual attack.
Blocking Bogon Sources: While there is a well known list of bogon addresses—address blocks that should never be routed on the global ‘net—these lists should be taken as a starting point, rather than as an ending point. Constant monitoring of traffic patterns on your edge can give you a lot of insight into what is “normal” and what is not. For instance, if your highest rate of traffic normally comes from South America, and you suddenly see a lot of traffic coming from Australia, either you’ve gone viral, or this is the source of the DDoS attack. It isn’t alway useful to block all traffic from a region, or a set of source addresses, but it is often useful to use the techniques listed above more heavily on traffic that doesn’t appear to be “normal.”
There are, of course, other techniques you can deploy against DDoS attacks—but at some point, you are just not going to have the expertise or time to implement every possible counter. This is where appliance and service (cloud) based services come into play. There are a number of appliance based solutions out there to scrub traffic coming across your links, such as those made by Arbor. The main drawback to these solutions is they scrub the traffic after it has passed over the link into your network. This problem can often be resolved by placing the appliance in a colocation facility and directing your traffic through the colo before it reaches your inbound network link.
There is one open source DDoS scrubbing option in this realm, as well, which uses a combination of FastNetMon, InfluxDB, Grefana, Redis, Morgoth, and Bird to create a solution you can run locally on a spun VM, or even bare metal on a self built appliance wired in between your edge router and the rest of the network (in the DMZ). This option is well worth looking at, if not to deploy, but to better understand how the kind of dynamic filtering performed by commercially available appliances works.
If the DDoS must be stopped before it reached your edge link, and you simply cannot handle the volume of the attacks, then the best solution might be a cloud based filtering solution. These tend to be expensive, and they also tend to increase latency for your “normal” user traffic in some way. The way these normally work is the DDoS provider advertises your routes, or redirects your DNS address to their servers. This draws all your inbound traffic into their network, which it is scrubbed using advanced techniques. Once the traffic is scrubbed, it is either tunneled or routed back to your network (depending on how it was captured in the first place). Most large providers offer scrubbing services, and there are several public offerings available independent of any upstream you might choose (such as Verisign’s line of services).
Distributed Denial of Service is a big deal—huge pools of Internet of Things (IoT) devices, such as security cameras, are compromised by botnets and being used for large scale DDoS attacks. What are the tools in hand to fend these attacks off? The first misconception is that you can actually fend off a DDoS attack. There is no magical tool you can deploy that will allow you to go to sleep every night thinking, “tonight my network will not be impacted by a DDoS attack.” There are tools and services that deploy various mechanisms that will do the engineering and work for you, but there is no solution for DDoS attacks.
One such reaction tool is spreading the attack. In the network below, the network under attack has six entry points.
Assume the attacker has IoT devices scattered throughout AS65002 which they are using to launch an attack. Due to policies within AS65002, the DDoS attack streams are being forwarded into AS65001, and thence to A and B. It would be easy to shut these two links down, forcing the traffic to disperse across five entries rather than two (B, C, D, E, and F). By splitting the traffic among five entry points, it may be possible to simply eat the traffic—each flow is now less than one half the size of the original DDoS attack, perhaps within the range of the servers at these entry points to discard the DDoS traffic.
However—this kind of response plays into the attacker’s hand, as well. Now any customer directly attached to AS65001, such as G, will need to pass through AS65002, from whence the attacker has launched the DDoS, and enter into the same five entry points. How happy do you think the customer at G would be in this situation? Probably not very…
Is there another option? Instead of shutting down these two links, it would make more sense to try to reduce the volume of traffic coming through the links and leave them up. To put it more shortly—if the DDoS attack is reducing the total amount of available bandwidth you have at the edge of your network, it does not make a lot of sense to reduce the available amount of bandwidth at your edge in response. What you want to do, instead, is reapportion the traffic coming in to each edge so you have a better chance of allowing the existing servers to simply discard the DDoS attack.
One possible solution is to prepend the AS path of the anycast address being advertised from one of the service instances. Here, you could add one prepend to the route advertisement from C, and check to see if the attack traffic is spread more evenly across the three sites. As we’ve seen in other posts, however, this isn’t always an effective solution (see thesethreeposts). Of course, if this is an anycast service, we can’t really break up the address space into smaller bits. So what else can be done?
There is a way to do this with BGP—using communities to restrict the scope of the routes being advertised by A and B. For instance, you could begin by advertising the routes to the destinations under attack towards AS65001 with the NO_PEER community. Given that AS65002 is a transit AS (assume it is for the this exercise), AS65001 would accept the routes from A and B, but would not advertise them towards AS65002. This means G would still be able to reach the destinations behind A and B through AS65001, but the attack traffic would still be dispersed across five entry points, rather than two. There are other mechanisms you could use here; specifically, some providers allow you to set a community that tells them not to advertise a route towards a specific AS, whether than AS is a peer or a customer. You should consult with your provider about this, as every provider uses a different set of communities, formatted in slightly different ways—your provider will probably point you to a web page explaining their formatting.
If NO_PEER does not work, it is possible to use NO_ADVERTISE, which blocks the advertisement of the destinations under attack to any of AS65001’s connections of whatever kind. G may well still be able to use the connections to A and B from AS65001 if it is using a default route to reach the Internet at large.
It is, of course, to automate this reaction through a set of scripts—but as always, it is important to keep a short leash on such scripts. Humans need to be alerted to either make the decision to use these communities, or to continue using these communities; it is too easy for a false positive to lead to a real problem.
Of course, this sort of response is also not possible for networks with just one or two connection points to the Internet.
But in all cases, remember that shutting down links the face of DDoS is rarely ever a real solution. You do not want to be reducing your available bandwidth when you are under attack specifically designed to exhaust available bandwidth (or other resources). Rather, if you can, find a way to disperse the attack.
P.S. Yes, I have covered this material before—but I decided to rebuild this post with more in depth information, and to use to kick off a small series on DDoS protection.
While Flowspec has been around for a while (RFC5575 was published in 2009), deployment across AS boundaries has been somewhat slow. The primary concern in deploying flowspec is the ability to shoot oneself in the foot, particularly as poening Flowspec to customers can also open an entirely new, and not well understood, attack surface. Often Flowspec is only offered to customers the provider considers technically competent to reduce this line of risk, or does not offer Flowspec filtering at all.
A second concern is the simple cost of filtering packets. In theory, ASICs can filter packets based on a variety of parameters cheaply. Theory doesn’t always easily translate to practice, however. It is often the case that filtering packets is not a cheap operation, even in the ASIC. The kind and granularity of the filter, and how it is applied, can make a big difference in the cost of implementation.
Regardless, recent work in Flowspec is quite interesting; particularly the ability to redirect flows, rather than simply filtering them. Of course, the original RFCs did allow for the redirection of flows into a VRF on the local router, but this leaves a good bit to be desired. To make such a system work, you must actually have a VRF into which to redirect traffic; for one-off situations, such as directing attack traffic to a honey pot, building the VRF and populating it can be more work than capturing the traffic is worth. A newer draft, draft-ietf-idr-flowspec-path-redirect, aims to resolve this.
Before getting to the draft specifics, however, it is useful to review the basic concept of Flowspec, particularly for readers who might not be familiar with them. Essentially, Flowspec is a new address family carried in BGP where the reachable address (the Network Layer Reachability Information, or NLRI), actually carries a filter set. Communities attached to the NLRI (as attributes) provide a set of actions the router should take if a packet matches the filter criteria carried in the NLRI.
The filter formatting is a bit challenging to understand, as it is set up by bit and nibble, with the ability to string filters together with AND and OR. For instance:
You can filter based on a TCP destination port number AND a specific TCP flag by stringing together a type 4 filter and a type 9 filter (in that order specifically), setting the a bit on the second filter in the NLRI to indicate both filter conditions must be matched
You can filter based on a range of TCP destination port numbers, say ports 1000-2000, using a type 4 filter with a “greater than or equal to” operator combined with a “1000” value, combined with a “less than or equal to” operator combined with a “2000” value
Just about any combination of port numbers, source addresses, destination addresses, protocol types, and flags can be pulled into a single Flowspec match; hence the difficulty in building ASICs that will support the full range of filtering options in a way that is efficient from a switching perspective and inexpensive (remember: pick two).
In the original Flowspec specification, extended communities are used to describe the action a router should take once a packet has matched a particular filter. For instance—
It is 0x8008 that is interesting for the present. This redirect extended community encodes a route target, which in turn identifies a Virtual Routing/Forwarding (VRF) table on the local router. Normally, routes are placed in a particular VRF by attaching a route target community to a route carried in BGP. By matching the target VRF in the Flowspec action with a target VRF in routes carried in BGP, you can dump specific packets into a specific VRF. For instance, if you have two routes to 2001:db8:3e8:100::/64, one of which points to a honeypot box, and the other of which transits your network to a customer’s network, you can—
Advertise a route to the same destination as the customer route (2001:db8:3e8:100::/64 in this case) into a special “honeypot” VRF by attaching some community to the route advertisement
Detect a problem with traffic sourced from some IP address
Advertise the route into the customer’s VRF with a Flowspec rule moving traffic sourced from the attack address and destined to 2001:db8:3e8:100::/64 into the honeypot VRF
Resulting in the traffic being forwarded based on the next hop in the honeypot VRF, rather than the customer’s VRF
It would be nice to make the reaction in the example given a bit simpler; this is what draft-ietf-idr-flowspec-path-redirect does. Rather than the “action community” in Flowspec encoding a VRF to redirect traffic in to, the “action community” can point into a redirect action table, or point the packet at a specific tunnel. For instance, if you would like to redirect traffic sourced from address and destined to 2001:db8:3e8:100::/64 into a segment path through your network (ending at the same hooneypot, to continue the example), you could—
Use the indirection community in the “action community” in the Flowspec advertisement (the actual community string has not been assigned at the time of this writing)
Set the indirection id part of the community to Node ID
This directs the traffic to the first node in the segment. Other options are a AS, an SR anycast ID, an SR multicast ID, and several others. The indirection ID can indicate the packet should be redirected according to a local table; in this case, it is up to the operator to figure out how to get the table created and filled with the right information.
This solution is still moderately complex—but BGP is increasingly complex, and Flowspec adds another layer of complexity into the mix. What this solution does do, however, is to allow operators to combine Flowspec with segment routing, or a simple table of operations, that makes it easier to deploy Flowspec in many situations outside the more traditional VPN types of services. This is particularly true for large “enterprise” operators with a BGP core in their network who would like to use Flowspec over a wider range of applications.