BGP Policies (Part 5)

At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.

In this post I’m going to cover AS Path Prepending from the perspective of AS65001 in the following network—

Since the length of the AS Path plays a role in choosing which path to use when forwarding traffic towards a given reachable destination, many (if not most) operators prepend the AS Path when advertising routes to a peer. Thus an AS Path of [65001], when advertised towards AS65003, can become [65001,65001] by adding one prepend, [65001,65001,65001] by adding two prepends, etc. Most BGP implementations allow an operator to prepend as many times as they would like, so it is possible to see twenty, thirty, or even higher numbers of prepends.
Note: The usefulness of prepending is generally restricted to around two or three, as the average length of an AS Path in the global Internet is around 4 hops.

If AS65001 would like traffic destined to 100::/64 to enter from AS65003 rather than AS65002, it can prepend the AS Path at every peering point with AS65002 (A and B) with two hops (sending [65001,65001,65001] to AS65002). If preference, MED, and all other metrics are equal, AS65002 would then prefer the path with the shorter AS Path through AS65003, rather than the path directly into AS65001 (either through A or B).

That all metrics are equal is not likely, however. AS65002 will probably have preference set so routes learned directly from customers (such as AS65001) are selected over routes learned from peers (such as AS65003). The impact of prepending on route selection by directly connected peers is, therefore, uncertain.

Moving one step out in the network, consider the routes received by AS65004 to reach 100::/64. There will be one route along [65002,65001,65001,65001], and another with an AS Path of [65003,65001]. All other things being equal (same preference, etc.), AS65004 will choose to send traffic destined to 100::/64 through AS65003 rather than AS65002. How likely is it all the other BGP metrics will be equal at AS65004? So long as the peering between AS65004, AS65003, and AS65002 are all of the same type, the odds are high—so prepending can help move some (not all) traffic from one inbound link to another.

Because AS Path prepending has variable results over time, operators using this technique often “just try it” to see what the effect will be. There’s no real way to predict how effective prepending any number of times will be in moving traffic from one inbound link to another.

What if AS65001 does not want traffic destined to 100::/64 to traverse AS6505? For instance, suppose AS6506 s on across an ocean, mountain range, or other difficult-to-cross geographic feature. AS65005 crosses this geography via a satellite link, while AS65004 crosses the same geography via an optical cable. Sine optical cable runs can provide better delay and jitter than a satellite link, AS65001 may desire to choose which of these two autonomous systems is traversed to reach 100::/64.

This cannot be directly accomplished using AS Path prepend, as both AS65004 and AS65005 will both receive the same prepended path.

To express this kind of policy, some operators allow their customers to set communities that cause the operator to remotely prepend a given route advertisement. For instance, NTT allows their customers to set a community that will cause NTT to prepend specific routes when those routes are advertised to specific autonomous systems—in this case, AS65001 could add the community 65421:65005 to the advertisement for 100::/64, which would cause NTT to prepend AS65001 when advertising 100::64 to AS65005, and not prepend anything when advertising 100::/64 to AS65004.

This technique is subject to the same caveats as using AS Path prepend locally—it may work in some situations, or it may not—because the local operator does not have visibility into the policies of the operators they are trying to influence.

BGP Policies (Part 4)

At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.

In this post, I’ll cover the first of a few ways to give surrounding autonomous systems a hint about where traffic should enter a network. Note this is one of the most vexing problems in BGP policy, so there will be a lot of notes across the next several posts about why some solutions don’t work all that well, or when they will and won’t work.

There are at least three reasons an operator may want to control the point at which traffic enters their network, including:

  • Controlling the inbound load on each link. It might be important to balance inbound and outbound load to maintain settlement-free peering, or to equally use all available inbound bandwidth, or to ensure the quality of experience is not impacted by overusing a single link.
  • Accounting for geographically dispersed entry points. For instance, while the two entry points into AS65001 might appear to be topologically close, they might be geographically diverse, with one being in South America and the other being in North America.
  • Ensuring flows requiring symmetric paths are properly handled. A common use case is the use of stateful packet filters or port address translators, both of which require inbound and outbound traffic to be routed through a single device.

All these reasons apply to all kinds of network operators, so this section will examine the various techniques used to control traffic entry points from the perspective of AS65001 in the following network—

 

Policies designed to control the point at which traffic enters an operator’s network will often conflict with policies designed to control the point at which traffic exits some other operator’s network. For instance, AS65001’s policy that all traffic destined to 100::/64 enter the network from AS65002 may conflict with AS6500’2 policy that all traffic destined to 100::/64 leave its network by being forwarded to AS65003.

This effect is not just seen between directly connected autonomous systems. For instance, AS65001’s policy that all traffic destined to 100::/64 enter the network through AS65002 may conflict with AS65004’s policy that all traffic to that same destination exit the network by being forwarded to AS65003.

The original intent of BGP policy was the policy of the sender overrides the policy of the receiver, as expressed in the design of the metrics (the multiple exit discriminator, or MED, has a lower priority than the preference). In real deployments, however, exit and entry policies are more fluid and entangled. These relationships will be considered in each of the sections below, each of which describes a different way to influence or control how traffic destined to a single reachable destination.

Let’s begin with the Multiple Exist Discriminator, or MED.

MED is a suggestion or request to neighboring autonomous systems to forward traffic for reachable destination along a particular path. For instance, AS65001 may desire for traffic being sent to 100::/64 be sent to B in the network diagram, rather than to A or through its link to AS65003.

However, the MED is not a transitive attribute of a BGP route. This means that if AS65001 sets the MED so that entry B is preferred, and sends this MED to AS65003, AS65003 will strip (or reset) the MED before advertising 100::/64 to either AS65004 or AS65002.

MED, in this case, would be useful to help AS65002 determine whether to send this traffic to A or B, but not whether to send the traffic to AS65001 or AS65003. AS65002 will, instead, rely on local policy, primarily preference, to determine which exit point to use. If AS65002 determines the best path to 100::/64 is through one of its direct connections to AS65001 (either A or B), and there is no other reason for AS65002 to choose one path over the other, the MED will be used to determined which path to use.

Because AS65003 only has one connection to AS65001, the MED will not impact its bestpath decision at all. Because AS65001’s MED has been reset or stripped in all the routes to 100::/64 AS65004 receives, AS65001’s MED will not play a role in any bestpath decision there, either (AS65002 or AS65003 may set the MED when sending routes to AS65004, which may influence the path AS65004 chooses, but again only when choosing between multiple connections to the same peering AS).

Because MED is only considered nominally useful, it is often stripped off routes when they are received from another AS.

BGP Policies (Part 3)

At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.

There are many reasons an operator might want to select which neighboring AS through which to send traffic towards a given reachable destination (for instance, 100::/64). Each of these examples assumes the AS in question has learned multiple paths towards 100::/64, one from each peer, and must choose one of the two available paths to forward along.

In the following network—

From AS65001’s perspective

Assume AS65001 is some form of content provider, which means it offers some service such as bare metal compute, cloud services, search engines, social media, etc. Customers from AS65006 are connecting to its servers, located on the 100::/64 network, which generates a large amount of traffic returning to the customers.
From the perspective of AS hops, it appears the path from AS65001 to AS65006 is the same length—if this is true, AS65001 does not have any reason to choose one path or another (given there is no measurable performance difference, as in the cases described above from AS65006’s perspective). However, the AS hop count does not accurately describe the geographic distances involved:

  • The geographic distance between 100::/64 and the exit towards AS65003 is very short
  • The geographic distance between AS100::/64 and the exits towards AS65002 is very long
  • The total geographic distance packets travel when following either path is about the same

In this case, AS65001 can either choose to hold on to packets destined to customers in AS65006 for a longer or shorter geographic distance.
While carrying the traffic over a longer geographic distance is more expensive, AS65001 would also like to optimize for the customer’s quality of experience (QoE), which means AS65001 should hold on to the traffic for as long as possible.

Because customers will use AS65001’s services in direct relation to their QoE (the relationship between service usage and QoE is measurable in the real world), AS65001 will opt to carry traffic destined to customers as long as possible—another instance of cold potato routing.
This is normally implemented by setting the preference for all routes equal and relying on the IGP metric part of the BGP bestpath decision process to control the exit point. IGP metrics can then be tuned based on the geographic distance from the origin of the traffic within the network and the exit point closest to the customer.

An alternative, more active, solution would be to have a local controller monitor the performance of individual paths to a given reachable destination, setting the preferences on individual reachable destinations and tuning IGP metrics in near-real-time to adjust for optimal customer experience.
Another alternative is to have a local controller monitor the performance individual paths and use MPLS, segment routing, or some other mechanism to actively engineer or steer the path of traffic through the network.

Some content providers may directly peer with transit and edge providers to reach customers more quickly, to reduce costs, and to increase their control over customer-facing traffic. For instance, if AS65001 is a content provider that transits traffic through [65002,65005] to reach customers in AS65006. To avoid transiting multiple autonomous systems, AS65001 can run a link directly to AS65005.

In some cases, content providers will build long-haul fiber optics (including undersea cable operations, see this site for examples) to avoid transiting multiple autonomous systems.

While the operator can end up paying a lot to build and operate long-haul optical links, this cost is offset is offset by decreasing paying transit providers for high levels of asymmetric traffic flows. Beyond this, content providers can control user experience more effectively the longer they control the user’s traffic. Finally, content providers can gain more information by connecting closer to users, feeding into Kai-Fu Lee’s virtuous cycle.

Note: content providers peering directly with edge providers and through IXPs is one component of the centralization of the Internet.

A failed alternative to the techniques described here was the use of automatic disaggregation at the content provider’s autonomous system borders. For instance, if a customer connected to a server in 100::/64 by sending traffic via the [65003,65001] link, an automated system will examine the routing table to see which route is currently being used to reach the customer’s reachable destination. If traffic forwarded to this customer’s address would normally pass through one of the [65001,65002] links, a local host route is created and distributed into AS65001 to draw this traffic to the exit connected to AS65003.

The theory behind this automatic disaggregation was that the customer will always take the shortest path from their perspective to reach the service. This assumption fails, in practice, however, so this scheme was ultimately abandoned.

BGP Policies (Part 2)

At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.

There are many reasons an operator might want to select which neighboring AS through which to send traffic towards a given reachable destination (for instance, 100::/64). Each of these examples assumes the AS in question has learned multiple paths towards 100::/64, one from each peer, and must choose one of the two available paths to forward along.

In the following network—

From AS65004’s perspective…

Transit providers primarily choose the most optimal exit from their AS to reduce the amount of peering settlement they are paying by using and maintaining settlement-free peering where possible and reducing the amount of time and distance traffic is carried through their network (through hot potato routing, discussed in more detail below).
If, for instance, AS65004 has a paid peering relationship with AS65002, and a contract with AS65003 which is settlement-free so long as the traffic between AS65004 and AS65003 is roughly symmetric. AS65004 has two roughly equal-cost paths (both have the same AS Path length) towards 100::/64. In this situation, AS 65004 is going to direct traffic towards AS65003 to maintain symmetrical traffic flows and direct any remaining traffic towards AS65002.

This kind of balancing is normally done through a controller or network management system that monitors the balance of traffic with AS65003, adjusting the preference of sets of routes to attain the correct balance with AS65003 while reducing the costs of using the link to AS65002 to the minimum possible.

From AS65005’s perspective…

AS65005 can either send traffic originating in AS65001, received from AS65002, and destined to AS65006, to either AS65004—a peer—or AS65006—a customer. The internal path between the entry point for this traffic is longer if the traffic is carried to AS65006, and shorter if the traffic is carried to AS65004. These longer and shorter paths give rise to the concepts of hot and cold potato routing.

If AS65006 is paying AS65005 for transit, AS65005 would normally carry traffic across the longer path to its border with AS65006. This is cold potato routing. AS65005’s reason for choosing this option is to maximize revenue from the customer. First, as the link between AS65005 and AS65006 becomes busier, AS65006 is likely to upgrade the link, generating additional revenue for AS65005. Even if the traffic level is not increasing, steady traffic flow encourages the customer to maintain the link, which protects revenue. Second, AS65005 can control the quality-of-service AS65006 receives by keeping the traffic within its network for as long as possible, improving the customer’s perception of the service they are receiving.
Cold potato routing is normally implemented by setting the preference on routes learned from customers, so these routes are preferred over all routes learned from peers.
If AS6006 is not paying AS65005 for transit, it is to AS65005’s advantage to carry the traffic as short a distance as possible. In this case, although AS65005 is directly connected to AS65006, and the destination is in AS65006, AS65005 will choose to direct the traffic towards its border with AS65004 (because there is a valid route learned for this reachable destination from AS65004).

This is hot potato routing—like the kids’ game, you want to hold on to the traffic for as short an amount of time as possible. Hot potato routing is normally implemented by setting the preference on routes to the same and relying on the IGP metric component of the BGP bestpath decision process to find the closest exit point.

Next week I’ll continue this series on BGP interdomain policies… feel free to leave a comment if you think I’ve explained something incorrectly, etc.

BGP Policies (part 1)

At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.

In the following network—

There are many reasons an operator might want to select which neighboring AS through which to send traffic towards a given reachable destination (for instance, 100::/64). Each of these examples assumes the AS in question has learned multiple paths towards 100::/64, one from each peer, and must choose one of the two available paths to forward along.

Examining this from AS65006’s Perspective …

Assuming AS65006 is an edge operator (commonly called enterprise, but generally just originating and terminating traffic, and never transiting traffic), there are several reasons the operator may prefer one exit point (through an upstream provider), including:

  • An automated system may determine AS65004 has some sort of brownout; in this case, the operator at 65006 has configured the system to prefer the exit through AS65005
  • The traffic destined to 100::/64 may require a class of service (such as video transport) AS65004 cannot support (for instance, because the link between AS65006 and 65005 has low bandwidth, high delay, or high jitter)

The most common way this kind of policy would be implemented is by setting the BGP LOCAL_PREFERENCE (called preference throughout the rest of this document) on routes learned from AS65005 higher than the preference on routes learned from AS65004.

Another common case is AS65006 would prefer to send traffic to AS65005 only when the destination is in an AS directly connected to AS65005 itself, while sending all other traffic through AS65004. This is common when a one provider has good local and poor global coverage, while the other provider has good global but poor local coverage.

For instance, if AS65006 is in a somewhat isolated part of the world, such as some parts of the South Pacific or Central America, there may be a local provider, such as AS65004, that has solid connectivity to most of the other edge operators in the local geographic region but charges a high cost for transiting to the rest of the global Internet. A second provider, such as AS65005, charges less to reach destinations beyond the local geographic region but is relatively expensive to use when sending traffic to other edge operators within the local region.

Preference, by itself, would be difficult to use in this case, because the operator would need to distinguish between geographically local and geographically distant routes. To implement this kind of policy, the operator would accept partial routes from the geographically local provider (AS65004 in this case) and set a high preference on these routes. Partial routes are typically those the local provider learns only from other directly connected autonomous systems, and hence would only include operators in the local geographic region. The operator would then accept full routes, or the entire Internet global routing table, from the second provider (AS65005 in this case) and set a lower preference.

An alternative way to implement geographic preference is using communities. Many transit providers mark individual reachable destinations with information about where the route originated. NTT, for instance, describes their geographic marking here. An operator can create filters using regular expressions to change the preference of a route based on its geographic origin.

This is not a common way to solve the problem because the filtering rules involved can become complex—but it might be deployed if local providers do not offer partial routes for some reason.

Another alternate to implement geographic preference is to use a regular expression filter to set the preference for each reachable destination based on the length of the AS Path. Theoretically, routes originating within the local region should have an AS Path of one or two hops, while those originating outside a region should have longer AS Paths.
This generally does not work for two reasons. First, the average length of an AS Path (after prepending is factored out) is about 4 hops in the entire global Internet—and it is easy to reach four hops even within a local region in some situations. Second, many operators prepend the AS Path to manage inbound entry point preference; these prepended hops must be factored out to use this method.

Next week I’ll continue this series on BGP interdomain policies… feel free to leave a comment if you think I’ve explained something incorrectly, etc.

Quality is (too often) the missing ingredient

Software Eats the World?

I’m told software is going to eat the world very soon now. Everything already is, or will be, software based. To some folks, this sounds completely wonderful, but—leaving aside the privacy issues—I still see an elephant in the room with this vision of the future.

Quality.

Let me give you some recent examples.

First, ceiling fans. Modern ceiling fans, in case you didn’t know, don’t rely on the wall switch and pull chains. Instead, they rely on remote controls. This is brilliant—you can dim the light, change the speed of the fan, etc., from a remote control. No unsightly chains hanging from the ceiling.

Well, it’s brilliant so long as it works. I’ve replaced three of the four ceiling fans in my house. Two of the remote controls have somehow attached themselves to two of the three fans. It’s impossible to control one of the fans without also controlling the other. They sometimes get into this entertaining mode where turning one fan off turns the other one on.

For the third one—the one hanging from a 13-foot ceiling—the remote control sometimes operates one of the other fans, and sometimes the fan its supposed to operate. Most of the time it doesn’t seem to do much of anything.

The fan manufacturer—a large, well-known company—mentions this situation in their instructions and points to a FAQ that doesn’t exist. Searching around online I found instructions for solving this problem that involve unwiring the fans and repeating a set of steps 12 times for each fan to correct the situation. These instructions, needless to say, don’t work.

There is no way to reset the remote, nor the connection between the remote and the fan. There is no way to manually select some dip switch so the remote has a specific fan it talks to. Just some mystical software that’s supposed to work (but doesn’t) and no real instructions on how to resolve the problem. The result will be a multi-hour wait on a customer support line, spending hours of my time to sort the problem out, and the joy of climbing (tall) ladders to unwire and wire ceiling fans in four different rooms.

Thinking through possible problems and building software interfaces that take those situations into account … might be a bit more important than we think they are if software is really going to eat the world.

Second, the retailer’s web site—a large retailer with thousands of physical stores across the United States. Twice I’ve ordered from this site, asking to have the item held in the local store so I can pick it up. The site won’t let you order the item for store pickup unless they have it in stock.

The first time they called me to say they couldn’t find the item I ordered, but they found a “newer model” that was a lot less expensive. It was a lot less expensive because it wasn’t the same item. They never did find the item I originally ordered.

The second time they called me to say they couldn’t find the item I ordered. I asked if they could just ship the item to my house when it’s back in stock. “I’m sorry, our system doesn’t allow us to do that …” Several hours later, they called back to tell me they found it, but they cannot reinstate my order—I must place a new order.

Again, software quality strikes … what should be a simple process just isn’t. There will always be mismatches between the state in software and the state in the real world—but design the system so it’s possible to adapt when this happens, rather than shutting down the process and starting over.

Third, I own a car that has all the “bells and whistles,” including an adaptive cruise control system. There are certain situations, however, where this adaptive control does the wrong thing, producing potentially dangerous results. There is no way to set the car to use the non-adaptive cruise control permanently (I called and waited on the phone for several hours to discover this). You can set the non-adaptive cruise control on a per-use basis by going through set of menus to change the settings … while driving.

Software quality anyone?

Software eats the world might be someone’s ultimate dream—but I suspect that software quality will always be the fly in the ointment. People are not perfect (even in crowds); software is created by people; hence software will always suffer from quality problems.

Maybe a little humility about our ability to make things as complex as we might like because “we can always have software do that bit” would be a good thing—even in the networking world.

Thoughts on Auto Disaggregation and Complexity

Way in the past, the EIGRP team (including me) had an interesting idea–why not aggregate routes automatically as much as possible, along classless bounds, and then deaggregate routes when we could detect some failure was causing a routing black hole? To understand this concept better, consider the network below.

In this network, B and C are connected to four different routers, each of which is advertising a different subnet. In turn, B and C are aggregating these four routes into 2001:db8:3e8:10::/60, and advertising this aggregate towards A. From a control plane state perspective, this is a major win. The obvious gain is that the amount of state is reduced from four routes to one. The less obvious gain is A doesn’t need to know about any changes in the state for the four destinations aggregated into the /60. Depending on how often these links change state, the reduction in the rate of change is, perhaps, more important than the reduction in the amount of control plane state.

We always know there will be a tradeoff when reducing state; what is the tradeoff here? If C somehow loses its connection to one of the four routers, say the router advertising 11::/64, C’s 10::/60 aggregate will not change. Since A thinks C still has a route to every subnet within 10::/60, it will continue sending traffic destined to addresses in the 11::/64 towards both B and C. C will not have a route towards these destinations, so it will drop the traffic.

We have a routing black hole.

for more information on aggregation in networks, take a look at my livelesson on abstraction in computer networks

This much is pretty simple. The harder part is figuring out to eliminate this routing black hole. Our first choice is to just not aggregate these routes. While you might be cringing right now, this isn’t such a bad option in many networks. We often underestimate the amount of state and the speed of state change modern routing protocols running on modern processors can support. I’ve seen networks running IS-IS in a single flooding domain with tens of thousands of routes and thousands of nodes running “in the wild.” I’ve seen IS-IS networks with thousands of nodes and hundreds of thousands of routes running in lab environments. These networks still converge.

But what if we really think we need to reduce the amount and speed of state, so we really need to aggregate these routes?

One solution that has been proposed a number of times through the years is auto disaggregation.

In this case, suppose D somehow realizes C cannot reach one of the components of a shared aggregate route. D could simply stop advertising the aggregate, advertising each of the components instead. The question here might be: is this a good idea? Looking at this from the perspective of the SOS triad, the aggregation replaced four routes with a single route. In the auto disaggregation case, the single route change is replaced by four route changes. The amount of state is variable, and in some cases the rate of change in state is actually higher than without the aggregation.

So…

I don’t hold that auto disaggregation is either good nor bad—it just presents a different set of challenges to the network designer. Instead of designing for average rates of change and given table sizes, you can count on much smaller tables, but you might find there are times when the rate of change is dramatically higher than you expect. A good question to ask, before deploying this kind of technology, might be: can I forsee a chain of events that will cause a high enough rate of state change that auto disaggregation is actually more destabilizing than just not summarizing at all in this network?

A real danger with auto disaggregation, by the way, is using summarization to dramatically reduce table sizes without understanding how a goldilocks failure (what we used to call in telco a mother’s day event, or perhaps a black swan) can cascade into widespread failures. If you’re counting on particular devices in your network only have a dozen or two dozen table entries, but just the right set of failures can cause them to have several thousand entries because of auto disaggregation, what kinds of failures modes should you anticipate? Can you anticipate or mitigate this kind of problem?

The idea of automatically summarizing and disaggregating routes is an interesting study in complexity, state, and optimization. It’s a good brain exercise in thinking through what-if situations, and carefully thinking about when and where to deploy this kind of thing.

What do you think about this idea? When would you deploy it, where, and why? When and where would you be cautious about deploying this kind of technology?