Last week I began discussing why AS Path Prepend doesn’t always affect traffic the way we think it will. Two other observations from the research paper I’m working off of are:
- Adding two prepends will move more traffic than adding a single prepend
- It’s not possible to move traffic incrementally by prepending; when it works, prepending will end up moving most of the traffic from one inbound path to another
A slightly more complex network will help explain these two observations.
Assume AS65000 would like to control the inbound path for 100::/64. I’ve added a link between AS65001 and 65002 here, but we will still find prepending a single AS to the path won’t make much difference in the path used to reach 100::/64. Why?
Because most providers will have a local policy configured—using local preference—that causes them to choose any available customer connection over other paths. AS65001, on receiving the route to 100::/64 from AS65000, will set the local preference so it will prefer this route over any other route, including the one learned from AS65002. So while the cause is a little different in this case than the situation covered in the first post, the result is the same.
We can, of course, prepend twice onto the AS Path rather than once. What impact would that have here? It still won’t impact the traffic originating in 65005 because AS65001 is the only path available towards 100::64 from their perspective. Prepending cannot change anything if there’s only one path.
However, if most of the traffic destined to 100::/64 coming from AS65006, 7, and 8 rather than from AS65005, prepending two times will allow AS65000 to shift the traffic from the path through AS65002 to the path through AS65001. This example might seem a little contrived. Still, it’s pretty similar to networks that have one connection to some local provider (a cable company or something similar) and one connection to a more prominent national or international provider. Any time you are connected to two different providers who have different ranges of connectivity, prepending two autonomous systems on the AS Path will probably be able to shift traffic from one inbound link to another.
What about prepending more than two hops to the AS Path? Each additional prepend going to shift smaller amounts of traffic. It makes sense that increasing the number of prepends doesn’t shift much more because the further away you get from the edge of the Internet, the more fully connected the autonomous systems are, and the more likely you are to run into some other policy that will override the AS Path in determining the best path. The average length of the AS Path in the Internet is around four; prepending more than this normally won’t have much of an effect on traffic flow
The second question above can also be answered by looking at this network. Why can’t you shift traffic incrementally by prepending onto the AS Path? Because the connectivity close to the edge is probably not meshy enough. You can’t shift over just the traffic from one AS or another; you can only shift traffic from the entire set of autonomous systems behind your upstream from one inbound link to another. You can adjust traffic on a per-prefix basis, however, which can be useful for balancing between two inbound links.
What can you do to control inbound traffic with more certainty? Take a look at this older post for thoughts on using communities and de-aggregation to steer traffic.
Just about everyone prepends AS’ to shift inbound traffic from one provider to another—but does this really work? First, a short review on prepending, and then a look at some recent research in this area.
What is prepending meant to do?
Looking at this network diagram, the idea is for AS6500 (each router is in its own AS) to steer traffic through AS65001, rather than AS65002, for 100::/64. The most common method to trying to accomplish this is AS65000 can prepend its own AS number on the AS Path Multiple times. Increasing the length of the AS Path will, in theory, cause a route to be less preferred.
In this case, suppose AS65000 prepends its own AS number on the AS Path once before advertising the route towards AS65001, and not towards AS65002. Assuming there is no link between AS65001 and AS65002, what would we expect to happen? What we would expect is AS65001 will receive one route towards 100::/64 with an AS Path of 2 and use this route. AS65002 will, likewise, receive one route towards 100::/64 with an AS Path of 1 and use this route.
AS65003, however, will receive two routes towards 100::/64, one with an AS Path of 3 through AS65001, and one with an AS Path of 2 through AS65002. All other things being equal (local preference, etc.), AS65003 will prefer the route with the shorter AS Path through AS65002, and select that path to reach 100::/64. AS65004 will only receive one path towards 100::/64, the one through AS65002, because AS65003 will only advertise its best path to AS65004.
The obvious question—how much good does this really do? The only impact on the best path is two hops away, as AS65003, and beyond. The route chosen by AS65001 and AS65002 will not be affected by the prepending.
A recent paper found—
You might expect As Path prepending to have a much more consistent effect on inbound traffic. Why doesn’t it?
What might not be obvious (the danger of simplified diagrams): if autonomous systems directly attached to AS65001 originate most of the traffic destined to 100::/64, no amount of prepending is going to make any difference in the inbound traffic flow. Assume AS5001 has a connection to some cloud service, AS65002 does not have a connection to the same cloud service, and 100::64 is a local server that communicates with this cloud service on a regular basis. Since AS65001 is the only AS transiting traffic from the cloud service to the server located on the 100::/64 subnet, and AS65001 only has one route to 100::/64, you are not going to be able to shift traffic off that single path no matter how many times you prepend.
The first rule of prepending is location matters. You have to know where the traffic you want to shift is originating, and whether or not it can be shifted.
In my next post on this topic, I’ll continue exploring AS path prepending more in light of the results of the research paper above.
Intentionally poisoning BGP routes in the Default-Free Zone (DFZ) would always be a bad thing, right? Actually, this is a fairly common method to steer traffic flows away from and through specific autonomous systems. How does this work, how common is it, and who does this? Jared Smith joins us on this episode of the Hedge to discuss the technique, and his research into how frequently it is used.
BGP is widely used as an IGP in the underlay of modern DC fabrics. This series argues this is not the best long-term solution to the problem of routing in fabrics because BGP is not ideal for this use case. This post will consider the potential harm we are doing to the larger Internet by pressing BGP into a role it was not originally designed to fulfill—an underlay protocol or an IGP.
My last post described the kinds of configuration required to make BGP work on a DC fabric—it turns out that the configuration of each BGP speaker on the fabric is close to unique. It is possible to automate configuring each speaker—but it would be better if we could get closer to autonomic operation.
To move BGP closer to autonomic operation in a DC fabric, there are several things we can do. First, we can allow a BGP speaker to peer with any other BGP speaker it receives an open message from—this is often called promiscuous mode. While each router in the fabric will still need to be configured with the right autonomous system, at least we won’t need to configure the correct peers on each router (including the remote AS).
Note, however, that using this kind of promiscuous peering does come with a set of tradeoffs (if you’re reading this blog, you know there will be tradeoffs). BGP speakers running in promiscuous mode open a large attack surface on the control plane of the network. We can close this attack surface by configuring authentication on all BGP speakers … but we are now adding complexity to reduce complexity. We could also reduce the scope of the attack surface by never permitting BGP to peer beyond a single hop, and then filtering all BGP packets at the fabric edge. Again, just a bit more complexity to manage—but remember that the road to highly fragile and complex systems is always paved with individual steps that never, on their own, seem to add “too much complexity.”
The second thing we can do to move BGP closer to autonomic operation is to advertise routes to every connected peer without any policy configured. This does, again, introduce some tradeoffs, particularly in the realm of security, but let’s leave that aside for the moment.
Assume we can create a version of BGP that has these modifications—it always accepts any peer from any other AS, and it advertises all routes without any policy configured. Put these features behind a single knob which also includes setting the MRAI to 0 or 1, tightens up the dampening parameters, and adjusts a few other things to make BGP work better in a DC fabric.
As an experiment, let’s enable this DC fabric knob on a BGP speaker at the edge of a dual-homed “enterprise customer.” What will happen?
The enterprise network will automatically peer to any speaker that sends an open message—a huge security hole on the open Internet—and it will advertise every route it learns even though there is no policy configured. This second issue—advertising routes with no policy configured—can cause the enterprise network to become a transit between two much larger provider networks, crashing out some small corner of the Internet.
This might seem like a trivial issue. After all, just don’t ever enable the DC fabric knob on an eBGP peering session upstream into the DFZ, or any other “real” internetwork. Sure, and just don’t ever hit the brakes when you mean to hit the accelerator, or the accelerator when you mean to hit the brakes. If I had a dime for every time we “just don’t ever make that mistake …” Well, I wouldn’t be blogging, I’d be relaxing in the sun someplace (okay, I’m not likely to ever stop working to sit around and “relax” all the time, but you get the picture anyway).
Maybe—just maybe—it would really be better overall to use two different protocols for IGP and EGP work. Maybe—just maybe—it’s better not to mix these two different kinds of functions in a single protocol. Not only is the single resulting protocol bound to be really complex (most BGP implementations are now over 100,000 lines of code, after all), but it will end up being really easy to make really bad mistakes.
No tool is omnicompetent. If you found a tool that was, in fact, omnicompetent, it would also be the most dangerous tool in your toolbox.
Before I continue, I want to remind you what the purpose of this little series of posts is. The point is not to convince you to never use BGP in the DC underlay ever again. There’s a lot of BGP deployed out there, and there are lot of tools that assume BGP in the underlay. I doubt any of that is going to change. The point is to make you stop and think!
Why are we deploying BGP in this way? Is this the right long-term solution? Should we, as a community, be rethinking our desire to use BGP for everything? Are we just “following the crowd” because … well … we think it’s what the “cool kids” are doing, or because “following the crowd” is what we always seem to do?
In my last post, I argued that BGP converges much more slowly than the other options available for the DC fabric underlay control plane. The pushback I received was two-fold. First, the overlay converges fast enough; the underlay convergence time does not really factor into overall convergence time. Second, there are ways to fix things.
If the first pushback is always true—the speed of the underlay control plane convergence does not matter—then why have an underlay control plane at all? Why not just use a single, merged, control plane for both underlay and overlay? Or … to be a little more shocking, if the speed at which the underlay control plane converges does not matter, why not just configure the entire underlay using … static routes?
The reason we use a dynamic underlay control plane is because we need this foundational connectivity for something. So long as we need this foundational connectivity for something, then that something is always going to be better if it is faster rather than slower.
The second pushback is more interesting. Essentially—because we work on virtual things rather than physical ones, just about anything can be adapted to serve any purpose. I can, for instance, replace BGP’s bestpath algorithm with Dijkstra’s SPF, and BGP’s packet format with a more straight-forward TLV format emulating a link-state protocol, and then say, “see, now BGP looks just like a link-state protocol … I made BGP work really well on a DC fabric.”
Yes, of course you can do these things. Somewhere along the way we became convinced that we are being really clever when we adapt a protocol to do something it wasn’t designed to do, but I’m not certain this is a good way of going about building reliable systems.
Okay, back to the point … the next reason we should rethink BGP on the DC fabric is because it is complex to configure when its being used as an IGP. In my last post, when discussing the configuration required to make BGP converge, I noted AS numbers and AS Path filters must be laid out in a very specific way, following where each device is located in the fabric. The MRAI must be taken down to some minimum on every device (either 0 or 1 second), and individual peers must be configured.
Further, if you are using a version of BGP that follows the IETF’s BCPs for the protocol, you must configure some sort of filter (generally a permit all) to get a BGP speaker to advertise anything to an eBGP peer. If you’re using iBGP, you need to configure route reflectors and tell BGP to advertise multiple paths.
There are two ways to solve this problem. First, you can automate all this configuration—of course! I am a huge fan of automation. It’s an important tool because it can make your network consistent and more secure.
But I’m also realistic enough to know that adding the complexity of an automation system on top of a too-complex system to make things simpler is probably not a really good idea. To give a visual example, consider the possibility of automatically wiping your mouth while eating soup.
Yes, automation can be taken too far. A good rule of thumb might be: automation works best on systems intentionally designed to be simple enough to automate. In this case, perhaps it would be simpler to just use a protocol more directly designed so solve the problem at hand, rather than trying to automate our way out of the problem.
Second, you can modify BGP to be a better fit for use as an IGP in various ways. This post has already run far too long, however, so … I’ll hold off on talking about this until the next post.
The fist post on this topic considered some basic definitions and the reasons why I am writing this series of posts. The second considered the convergence speed of BGP on a dense topology such as a DC fabric, and what mechanisms we normally use to improve BGP’s convergence speed. This post considers some of the objections to slow convergence speed—convergence speed is not important, and ECMP with high fanouts will take care of any convergence speed issues. The network below will be used for this discussion.
Two servers are connected to this five-stage butterfly: S1 and S2 Assume, for a moment, that some service is running on both S1 and S2. This service is configured in active-active mode, with all data synchronized between the servers. If some fabric device, such as C7, fails, traffic destined to either S1 or S2 across that device will be very quickly (within tens of milliseconds) rerouted through some other device, probably C6, to reach the same destination. This will happen no matter what routing protocol is being used in the underlay control plane—so why does BGP’s convergence speed matter? Further, if these services are running in the overlay, or they are designed to discover failed servers and adjust accordingly, it would seem like the speed at which the underlay converges just does not matter.
Consider, however, the case where the services running on S1 and S2 are both reachable through an eVPN overlay with tunnel tail-ends landing on the ToR switch through which each server connects to the fabric. Applications accessing these services, for this example, either access the service via a layer 2 MAC address or through a single (anycast) IP address representing the service, rather than any particular instance. To make all of this work, there would be one tunnel tail-end landing on A8, and another landing on E8.
Now what happens if A8 fails? For the duration of the underlay control plane convergence the tunnel tail-end at A8 will appear to be reachable to the overlay. Thus the overlay tunnel will remain up and carrying traffic to a black hole on one of the routers adjacent to A8. In the case of a service reachable via anycast, the application can react in one of two ways—it can fail out operations taking place during the underlay’s convergence, or it can wait. Remember that one second is an eternity in the world of customer-facing services, and that BGP can easily take up to one second to converge in this situation.
A rule of thumb for network design—it’s not the best-case that controls network performance, it’s the worst-case convergence.
The convergence speed of the underlay leaks through to the state of the overlay. The questions that should pop into your mind about right now is—can you be certain this kind of situation cannot happen in your current network, can you be certain it will never happen, and can you be certain this will not have an impact on application performance? I don’t see how the answer to those questions can be yes. The bottom line: convergence speed should be left out of the equation when building a DC fabric. There may be times when you control the applications, and hence can push the complexity of dealing with slow convergence to the application developers—but this seems like a vanishingly small number of cases. Further, is pushing solving for slow convergence to the application developer optimal?
My take on the argument that convergence speed doesn’t matter, then, is that it doesn’t hold up under deeper scrutiny.
as I noted when I started this series—I’m not arguing that we should rip BGP out of every DC fabric … instead, what I’m trying to do is to stir up a conversation and to get my readers to think more deeply about their design choices, and how those design choices work out in the real world
In my last post on this topic, I laid out the purpose of this series—to start a discussion about whether BGP is the ideal underlay control plane for a DC fabric—and gave some definitions. Here, I’d like to dive into the reasons to not use BGP as a DC fabric underlay control plane—and the first of these reasons is BGP converges very slowly and requires a lot of help to converge at all.
Examples abound. I’ve seen the results of two testbeds in the last several years where a DC fabric was configured with each router (switch, if you prefer) in a separate AS, and some number of routes pushed into the network. In both cases—one large-scale, the other a more moderately scaled network on physical hardware—BGP simply failed to converge. Why? A quick look at how BGP converges might help explain these results.
Assume we are watching the 110::/64 route (attached to A, on the left side of the diagram), at P. What happens when A loses it’s connection to 110::/64? Assuming every router in this diagram is in a different AS, and the AS path length is the only factor determining the best path at every router.
Watching the route to 110::/64 at P, you would see the route move from G to M as the best path, then from M to K, then from K to N, and then finally completely drop out of P’s table. This is called the hunt because BGP “hunts,” apparently trying every path from the current best path to the longest possible path before finally removing the route from the network entirely. BGP isn’t really “hunting;” this is just an artifact of the way BGP speakers receive, process, and send updates through the network.
If you consider a more complex topology, like a five-stage butterfly fabric, you will find there are many (very many) alternate longer-length paths available for BGP to hunt through on a withdraw. Withdrawing thousands of routes at the same time, combined with the impact of the hunt, can put BGP in a state where it simply never converges.
To get BGP to converge, various techniques must be used. For instance, placing all the routers in the spine so they are in the AS, configuring path filters at ToR switches so they are never used as a transit path, etc. Even when these techniques are used, however, BGP can still require a minute or so to perform a withdraw.
This means the BGP configuration cannot be the same on every device—it is determined by where the device is located—which harms repeatability, the BGP configuration must contain complex filters, and messing up the configuration can bring the entire fabric down.
There are several counters to the problem of slow convergence, and the complex configurations required to make BGP converge more quickly, but this post is pushing against its limit … so I’ll leave these until next time.