When prepend fails, what next? (1)
So you want to load share better on your inbound ‘net links. If you look around the ‘web, it won’t take long to find a site that explains how to configure AS Path Prepending. So the next time you have downtime, you configure it up, turn everything back on, and… Well, it moved some traffic, but not as much as you’d like. So you wait ’til the next scheduled maintenance window and configure a couple of extra prepends into the mix. Now you fire it all back up and… not much happens. Why not? There are a couple of reasons prepending isn’t always that effective—but it primarily has to do with the way the Internet itself tends to be built. Let’s use the figure below as an example network.
You’re sitting at AS65000, and you’re trying to get the traffic to be relatively balanced across the 65001->65000 and the 65004->65000 links. Say you’ve prepended towards AS65001, as that’s the provider sending you more traffic. Assume, for a moment, that AS65003 accepts routes from both AS65001 and AS65004 on an equal basis. When you prepend, you’re causing the route towards your destinations to appear to be longer from AS65003’s perspective. This path will be affected by the first prepend.
But now consider the second prepend—will it have any impact on the traffic flow? AS65003 only has two paths to the destination, one through AS65001 and one through AS65004. It can only choose one of these two paths. If the single prepend worked, a second prepend isn’t going to make any difference. This alerts us to the first problem with prepending: it’s only as effective when it’s within the realistic parameters of the AS Path. Adding 256 prepends in this network isn’t going to have any impact more than the first prepend.
If the effectiveness of prepending is related to the overall path length through the network (edge to edge), then we should ask—what is the average path length of the global Internet? As it turns out, there are folks who measure this sort of thing on a regular basis, and have for quite a long time (in terms of Internet time scales)—CAIDA, RIPE, and Potaroo, for instance, all have pretty extensive measurements taken from the Internet Default Free Zone (DFZ) over time. Here is a chart of the average AS Path length in the DFZ since 1998:
As it turns out, the average AS Path length hasn’t changed much in the last eight years—even though the number of routes and the number of connected autonomous systems has dramatically increased over that same time period. The lesson here is the first AS path prepend is probably going to have the most impact, the second will have a lesser impact, and after that—you’re probably just typing for the fun of it.
There are two other reasons prepending can fail.
First, consider the connection between AS65001 and AS65004. We know this is some sort of peering relationship; it could be settlement free, it could have some sort of settlement on it, or—well, who knows? But one thing you can know is that AS65001 is always going to prefer your route from you over your route learned through AS65004. AS65001 is going to configure this preference using LOCAL_PREF, which comes way before your puny little AS Path Prepend. Bottom line? You’re never going to draw traffic across the 65004->65001 link using prepend.
Second, consider AS65002 sitting up in the corner. Once again, note that AS65001 is always going to prefer routes to its customer learned from its customers. So to add one more to the point above, you’re never going to get the traffic from AS65002 to travel through AS65004 instead of AS65001.
All this to say: if a majority of your traffic is being sourced from one of your two provider’s customers, prepend is going to be useless in redirecting that traffic through another provider.
Now that we know why prepend doesn’t always work, what can we do about it? We’ll save the answer ’til next week’s Design Board.
thanks for this nice tidbit of knowledge (and making me really think about this).
I think what most people have in mind when they think about BGP is that path-selection is based on AS-path length which in general holds when we consider our AS as the source of traffic (i.e how do we go from my AS to Google’s AS).
However when we talk about load balancing, we need to think the other way around (we are now the destination and not the source) and so we need to consider how traffic originating in another AS would select it’s path (and how we can influence that).
I’d be curious to know what your thoughts are on prepending in an internal environment (for instance to force all traffic for a DMZ through one side (active-passive) by using prepending) but also how this “rates” in terms of complexity and of course how you solve this use-case.. 🙂
Hi Russ , For the local preference which you are talking about in AS 65001 it works if u have multiple exit points right in 65001 towards 65000 ? But in this case since there is only exit point we cant influence right ,it has go through 65004. Please correct me if i am wrong.
Avinash — Thanks for stopping by! You can only use local pref if there are two exits, correct — but in this case, you’re trying influence traffic inbound to AS65000, which has two entry points. You could do this by setting the local preference in AS65001 — but I’d rather wait ’til the next post to explain this more, rather than explaining in the comments (where most folks won’t see it!).
thank you russ..waiting for your post 🙂
russ i had this doubt bcos in the diagram i see 65001 has only router which has 2 exit points one to 65004 and towards 65000 so can we configure local preference to exit out of the interface connecting to 65000 🙂 please let me know..
Got it russ..just tested this out…
IOU2#show ip bgp
BGP table version is 2, local router ID is 18.104.22.168
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> 22.214.171.124/24 10.10.10.1 0 150 0 65000 65000 65000 i
* 126.96.36.199 0 65004 65000 i
local preference takes precedence..
Realistic solution: Leak longer matches for traffic-engineering (split a /16 into two /17s) which might be feasible. Of course, this won’t work for an IPv4 /24. The alternative is use aggregation to combine the longer matches into a shorter match to negatively bias a path. Both of these implementations are effectively identical. This doesn’t solve all problems, though.
Less realistic solution: Use the BGP cost-community with pre-bestpath POI. This assumes the remote-AS honors said community. This community is unfortunately non-transitive … unless you embed it in AIGP. Then it magically becomes transitive, and functions as you would think, even across AS boundaries (true in IOS-XE at least, and I believe IOS-XR as well, it’s been awhile).
As Nick Russo indicated your going to habe to leak longer or shorter prefixes depending on how you want the traffic to flow. AS Prepend is a nice story but unrealistic in the real world. Ive been doing this for nearly 20 years and I havent seen a provider yet who doesnt local pref their own routes or prefer their own paths regardless of the as path length. Most of the weight it or use local pref as you indicated.
That being said, you can use items such as local pref (hopefully there is IBGP between your egress routers). Another way I have done this is BGP conditional route advertisement. This really good if you own less than a /24. I almost forgot to mention that providers really wont accept any route smaller than a /24, for example a /25, so that presents another issue. These are all great things to think about. The last is the return traffic going out. You want to make sure that if traffic enters one side that it leaves the same very side. You dont want asymetric routing especially if you have firewalls because reverse path forwarding would be completely broken. Your firewalls will not be very happy.
The best way that I have got this to work is to have a transit network that sits on the outside between your 2 routers. If they are in different locations then things like a gre tunnel can be used. Now you and form an IBG peer with one another and use items such as local pref and you can exchange your full tables between the routers for full reachability.
Just my 2 cents
De-aggregation is evil…
You are the one to pay for upgrades…
At least advertize longer prefixes with NO_EXPORT and icw the aggregate
Jeff — don’t worry, I’ll get to the other side of the de-agg story before it’s over with… 🙂