DESIGN – Page 2 – rule 11 reader

Revisiting BGP Convergence

My video on BGP convergence elicited a lot of . . . feedback, mainly concerning the difference between convergence in a data center fabric and convergence in the DFZ. Let’s begin here—BGP hunt and the impact of the MRAI are very real in the DFZ. Withdrawing a route can take several minutes.

What about the much more controlled environment of a data center fabric?

Several folks pointed out that the MRAI is often set to 0 in DC fabrics (and many implementations by default). Further, almost all implementations will use an MRAI of 0 for the first received update, holding the second and subsequent advertisements by the MRAI. Several folks also pointed out that all the paths through a DC fabric are the same length, so the second part of the equation is also very small.

These are good points—how do they impact BGP convergence? Let’s use the network below, a small slice of a five-stage butterfly fabric, to think it through. Assume every router is in a different AS, so all the peering sessions are eBGP.

Start with A losing its connection to 101::/64—

T1: A withdraws its route from B and C
T2: B withdraws its route from D and E, C withdraws its route from F and G
T3: D and E withdraw their routes from H, F and G withdraw their routes from K
T4: H and K withdraw their routes from L

Note that L cannot receive one withdraw to remove the route from its local table; it must receive withdraws from both H and K. There’s no way at L to tell whether a withdraw from H means 101::/64 is no longer reachable at all or it is no longer reachable through H. For path-vector protocols, like distance-vector, the neighbor through each path must be considered independently.

What does an MRAI of 0 do? Each of the routers in the network will process the withdraw as soon as they receive it and send a withdraw to their peers as soon as they’re done processing it. The process still takes the same number of steps but each step is much faster.

What is the impact of all the paths’ equal length? So long as every router processes the withdraw at around the same speed, there is no hunt. If H and K send their withdraws simultaneously, L should receive them simultaneously and remove the route to 101::/64 from its table rather than switching from one path to the other. Even if they send their withdraws at different times, L removes entries from its ECMP table until it receives the last withdrawal.

If MRAI slows down convergence, why set it to anything other than 0? Because it’s improbable that every router in the network will process each withdraw simultaneously.

Before 101::/64 is withdrawn, H will be using the paths through D and E for ECMP, but it is only going to be advertising one of these two routes to L—say the path through E. When B sends withdraws to D and E, assume E processes the withdraw just a little faster than D. When H receives D’s withdraw, it will send an implicit withdraw to L, updating the AS path to include D rather than E. A few moments later, D sends a withdraw. H processes this withdraw and sends a withdraw to L.

L has received one implicit withdraw and one withdraw from H because of processing time differentials. In a larger fabric, with a much larger fan-out, the likelihood of differences in timing is much higher and spread across a broader range of possibilities. You can (generally) expect H to send about half as many implicit withdraws as it has paths towards the destination before sending an actual withdraw. If there are eight paths between B and H, H would likely send 3 or 4 implicit withdraws before sending a withdraw.

What if the MRAI were set to 1 second at H? H would receive E’s withdrawal and set the MRAI timer. Assuming D’s withdraw arrives within that 1-second MRAI, H will receive D’s withdraw, squash the implicit withdraw, and send a single withdraw to L instead. Setting the MRAI to something other than 0 reduces the number of updates and reduces processing.

Setting the MRAI to 1 second, and forcing it to trigger across all updates, might improve convergence time—or not. Without experimenting with setting the MRAI to different values at different places in a real network, it is hard to know. Replacing the routers, link speeds, changing processor load, and increasing memory can all have an impact on the “best” settings for optimal convergence.

the bottom line

There will be no hunt in BGP convergence in a network with multiple equal-length/equal-cost paths. This is what we should expect. Because the maximum path length minus the best (current) path length will always be 0, the network will converge as quickly as each router can process and advertise withdraws, bounded by the MRAI.

Setting the MRAI to 0 improves convergence speed at the cost of additional updates, especially in wide fan-out data center fabrics. It’s hard to know whether setting the MRAI to 0 or 1 will give you better convergence speeds; you have to try it to see.

I still think we should be moving away from BGP as our underlay protocol in all but the largest data center fabrics. IGPs (like IS-IS and RIFT) will converge more quickly, are easier to configure and manage, and using different protocols for the underlay and overlay breaks up failure and security domains in useful ways. I know I’m tilting at a windmill on this point, but still …

Posted in DESIGN, TECH, WRITTEN

BGP Policies (Part 2)

At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.

There are many reasons an operator might want to select which neighboring AS through which to send traffic towards a given reachable destination (for instance, 100::/64). Each of these examples assumes the AS in question has learned multiple paths towards 100::/64, one from each peer, and must choose one of the two available paths to forward along.

In the following network—

From AS65004’s perspective…

Transit providers primarily choose the most optimal exit from their AS to reduce the amount of peering settlement they are paying by using and maintaining settlement-free peering where possible and reducing the amount of time and distance traffic is carried through their network (through hot potato routing, discussed in more detail below).
If, for instance, AS65004 has a paid peering relationship with AS65002, and a contract with AS65003 which is settlement-free so long as the traffic between AS65004 and AS65003 is roughly symmetric. AS65004 has two roughly equal-cost paths (both have the same AS Path length) towards 100::/64. In this situation, AS 65004 is going to direct traffic towards AS65003 to maintain symmetrical traffic flows and direct any remaining traffic towards AS65002.

This kind of balancing is normally done through a controller or network management system that monitors the balance of traffic with AS65003, adjusting the preference of sets of routes to attain the correct balance with AS65003 while reducing the costs of using the link to AS65002 to the minimum possible.

From AS65005’s perspective…

AS65005 can either send traffic originating in AS65001, received from AS65002, and destined to AS65006, to either AS65004—a peer—or AS65006—a customer. The internal path between the entry point for this traffic is longer if the traffic is carried to AS65006, and shorter if the traffic is carried to AS65004. These longer and shorter paths give rise to the concepts of hot and cold potato routing.

If AS65006 is paying AS65005 for transit, AS65005 would normally carry traffic across the longer path to its border with AS65006. This is cold potato routing. AS65005’s reason for choosing this option is to maximize revenue from the customer. First, as the link between AS65005 and AS65006 becomes busier, AS65006 is likely to upgrade the link, generating additional revenue for AS65005. Even if the traffic level is not increasing, steady traffic flow encourages the customer to maintain the link, which protects revenue. Second, AS65005 can control the quality-of-service AS65006 receives by keeping the traffic within its network for as long as possible, improving the customer’s perception of the service they are receiving.
Cold potato routing is normally implemented by setting the preference on routes learned from customers, so these routes are preferred over all routes learned from peers.
If AS6006 is not paying AS65005 for transit, it is to AS65005’s advantage to carry the traffic as short a distance as possible. In this case, although AS65005 is directly connected to AS65006, and the destination is in AS65006, AS65005 will choose to direct the traffic towards its border with AS65004 (because there is a valid route learned for this reachable destination from AS65004).

This is hot potato routing—like the kids’ game, you want to hold on to the traffic for as short an amount of time as possible. Hot potato routing is normally implemented by setting the preference on routes to the same and relying on the IGP metric component of the BGP bestpath decision process to find the closest exit point.

Next week I’ll continue this series on BGP interdomain policies… feel free to leave a comment if you think I’ve explained something incorrectly, etc.

Posted in DESIGN, TECH, WRITTEN

BGP Policies (part 1)

In the following network—

Examining this from AS65006’s Perspective …

Assuming AS65006 is an edge operator (commonly called enterprise, but generally just originating and terminating traffic, and never transiting traffic), there are several reasons the operator may prefer one exit point (through an upstream provider), including:

An automated system may determine AS65004 has some sort of brownout; in this case, the operator at 65006 has configured the system to prefer the exit through AS65005
The traffic destined to 100::/64 may require a class of service (such as video transport) AS65004 cannot support (for instance, because the link between AS65006 and 65005 has low bandwidth, high delay, or high jitter)

The most common way this kind of policy would be implemented is by setting the BGP LOCAL_PREFERENCE (called preference throughout the rest of this document) on routes learned from AS65005 higher than the preference on routes learned from AS65004.

Another common case is AS65006 would prefer to send traffic to AS65005 only when the destination is in an AS directly connected to AS65005 itself, while sending all other traffic through AS65004. This is common when a one provider has good local and poor global coverage, while the other provider has good global but poor local coverage.

For instance, if AS65006 is in a somewhat isolated part of the world, such as some parts of the South Pacific or Central America, there may be a local provider, such as AS65004, that has solid connectivity to most of the other edge operators in the local geographic region but charges a high cost for transiting to the rest of the global Internet. A second provider, such as AS65005, charges less to reach destinations beyond the local geographic region but is relatively expensive to use when sending traffic to other edge operators within the local region.

Preference, by itself, would be difficult to use in this case, because the operator would need to distinguish between geographically local and geographically distant routes. To implement this kind of policy, the operator would accept partial routes from the geographically local provider (AS65004 in this case) and set a high preference on these routes. Partial routes are typically those the local provider learns only from other directly connected autonomous systems, and hence would only include operators in the local geographic region. The operator would then accept full routes, or the entire Internet global routing table, from the second provider (AS65005 in this case) and set a lower preference.

An alternative way to implement geographic preference is using communities. Many transit providers mark individual reachable destinations with information about where the route originated. NTT, for instance, describes their geographic marking here. An operator can create filters using regular expressions to change the preference of a route based on its geographic origin.

This is not a common way to solve the problem because the filtering rules involved can become complex—but it might be deployed if local providers do not offer partial routes for some reason.

Another alternate to implement geographic preference is to use a regular expression filter to set the preference for each reachable destination based on the length of the AS Path. Theoretically, routes originating within the local region should have an AS Path of one or two hops, while those originating outside a region should have longer AS Paths.
This generally does not work for two reasons. First, the average length of an AS Path (after prepending is factored out) is about 4 hops in the entire global Internet—and it is easy to reach four hops even within a local region in some situations. Second, many operators prepend the AS Path to manage inbound entry point preference; these prepended hops must be factored out to use this method.

Next week I’ll continue this series on BGP interdomain policies… feel free to leave a comment if you think I’ve explained something incorrectly, etc.

Posted in DESIGN, TECH, WRITTEN

Quality is (too often) the missing ingredient

Software Eats the World?

I’m told software is going to eat the world very soon now. Everything already is, or will be, software based. To some folks, this sounds completely wonderful, but—leaving aside the privacy issues—I still see an elephant in the room with this vision of the future.

Quality.

Let me give you some recent examples.

First, ceiling fans. Modern ceiling fans, in case you didn’t know, don’t rely on the wall switch and pull chains. Instead, they rely on remote controls. This is brilliant—you can dim the light, change the speed of the fan, etc., from a remote control. No unsightly chains hanging from the ceiling.

Well, it’s brilliant so long as it works. I’ve replaced three of the four ceiling fans in my house. Two of the remote controls have somehow attached themselves to two of the three fans. It’s impossible to control one of the fans without also controlling the other. They sometimes get into this entertaining mode where turning one fan off turns the other one on.

For the third one—the one hanging from a 13-foot ceiling—the remote control sometimes operates one of the other fans, and sometimes the fan its supposed to operate. Most of the time it doesn’t seem to do much of anything.

The fan manufacturer—a large, well-known company—mentions this situation in their instructions and points to a FAQ that doesn’t exist. Searching around online I found instructions for solving this problem that involve unwiring the fans and repeating a set of steps 12 times for each fan to correct the situation. These instructions, needless to say, don’t work.

There is no way to reset the remote, nor the connection between the remote and the fan. There is no way to manually select some dip switch so the remote has a specific fan it talks to. Just some mystical software that’s supposed to work (but doesn’t) and no real instructions on how to resolve the problem. The result will be a multi-hour wait on a customer support line, spending hours of my time to sort the problem out, and the joy of climbing (tall) ladders to unwire and wire ceiling fans in four different rooms.

Thinking through possible problems and building software interfaces that take those situations into account … might be a bit more important than we think they are if software is really going to eat the world.

Second, the retailer’s web site—a large retailer with thousands of physical stores across the United States. Twice I’ve ordered from this site, asking to have the item held in the local store so I can pick it up. The site won’t let you order the item for store pickup unless they have it in stock.

The first time they called me to say they couldn’t find the item I ordered, but they found a “newer model” that was a lot less expensive. It was a lot less expensive because it wasn’t the same item. They never did find the item I originally ordered.

The second time they called me to say they couldn’t find the item I ordered. I asked if they could just ship the item to my house when it’s back in stock. “I’m sorry, our system doesn’t allow us to do that …” Several hours later, they called back to tell me they found it, but they cannot reinstate my order—I must place a new order.

Again, software quality strikes … what should be a simple process just isn’t. There will always be mismatches between the state in software and the state in the real world—but design the system so it’s possible to adapt when this happens, rather than shutting down the process and starting over.

Third, I own a car that has all the “bells and whistles,” including an adaptive cruise control system. There are certain situations, however, where this adaptive control does the wrong thing, producing potentially dangerous results. There is no way to set the car to use the non-adaptive cruise control permanently (I called and waited on the phone for several hours to discover this). You can set the non-adaptive cruise control on a per-use basis by going through set of menus to change the settings … while driving.

Software quality anyone?

Software eats the world might be someone’s ultimate dream—but I suspect that software quality will always be the fly in the ointment. People are not perfect (even in crowds); software is created by people; hence software will always suffer from quality problems.

Maybe a little humility about our ability to make things as complex as we might like because “we can always have software do that bit” would be a good thing—even in the networking world.

Posted in CULTURE, DESIGN, WRITTEN

Thoughts on Auto Disaggregation and Complexity

Way in the past, the EIGRP team (including me) had an interesting idea–why not aggregate routes automatically as much as possible, along classless bounds, and then deaggregate routes when we could detect some failure was causing a routing black hole? To understand this concept better, consider the network below.

In this network, B and C are connected to four different routers, each of which is advertising a different subnet. In turn, B and C are aggregating these four routes into 2001:db8:3e8:10::/60, and advertising this aggregate towards A. From a control plane state perspective, this is a major win. The obvious gain is that the amount of state is reduced from four routes to one. The less obvious gain is A doesn’t need to know about any changes in the state for the four destinations aggregated into the /60. Depending on how often these links change state, the reduction in the rate of change is, perhaps, more important than the reduction in the amount of control plane state.

We always know there will be a tradeoff when reducing state; what is the tradeoff here? If C somehow loses its connection to one of the four routers, say the router advertising 11::/64, C’s 10::/60 aggregate will not change. Since A thinks C still has a route to every subnet within 10::/60, it will continue sending traffic destined to addresses in the 11::/64 towards both B and C. C will not have a route towards these destinations, so it will drop the traffic.

We have a routing black hole.

for more information on aggregation in networks, take a look at my livelesson on abstraction in computer networks

This much is pretty simple. The harder part is figuring out to eliminate this routing black hole. Our first choice is to just not aggregate these routes. While you might be cringing right now, this isn’t such a bad option in many networks. We often underestimate the amount of state and the speed of state change modern routing protocols running on modern processors can support. I’ve seen networks running IS-IS in a single flooding domain with tens of thousands of routes and thousands of nodes running “in the wild.” I’ve seen IS-IS networks with thousands of nodes and hundreds of thousands of routes running in lab environments. These networks still converge.

But what if we really think we need to reduce the amount and speed of state, so we really need to aggregate these routes?

One solution that has been proposed a number of times through the years is auto disaggregation.

In this case, suppose D somehow realizes C cannot reach one of the components of a shared aggregate route. D could simply stop advertising the aggregate, advertising each of the components instead. The question here might be: is this a good idea? Looking at this from the perspective of the SOS triad, the aggregation replaced four routes with a single route. In the auto disaggregation case, the single route change is replaced by four route changes. The amount of state is variable, and in some cases the rate of change in state is actually higher than without the aggregation.

So…

I don’t hold that auto disaggregation is either good nor bad—it just presents a different set of challenges to the network designer. Instead of designing for average rates of change and given table sizes, you can count on much smaller tables, but you might find there are times when the rate of change is dramatically higher than you expect. A good question to ask, before deploying this kind of technology, might be: can I forsee a chain of events that will cause a high enough rate of state change that auto disaggregation is actually more destabilizing than just not summarizing at all in this network?

A real danger with auto disaggregation, by the way, is using summarization to dramatically reduce table sizes without understanding how a goldilocks failure (what we used to call in telco a mother’s day event, or perhaps a black swan) can cascade into widespread failures. If you’re counting on particular devices in your network only have a dozen or two dozen table entries, but just the right set of failures can cause them to have several thousand entries because of auto disaggregation, what kinds of failures modes should you anticipate? Can you anticipate or mitigate this kind of problem?

The idea of automatically summarizing and disaggregating routes is an interesting study in complexity, state, and optimization. It’s a good brain exercise in thinking through what-if situations, and carefully thinking about when and where to deploy this kind of thing.

What do you think about this idea? When would you deploy it, where, and why? When and where would you be cautious about deploying this kind of technology?

Posted in DESIGN, WRITTEN

Hedge 108: In Defense of Boring Technology with Andrew Wertkin

Engineers (and marketing folks) love new technology. Watching an engineer learn or unwrap some new technology is like watching a dog chase a squirrel—the point is not to catch the squirrel, it’s just that the chase is really fun. Join Andrew Wertkin (from BlueCat Networks), Tom Ammon, and Russ White as we discuss the importance of simple, boring technologies, and moderating our love of the new.

download

Posted in AUDIO, DESIGN, HEDGE

Hedge 105: Johan Gustawsson and Changing Provider Architectures

Many service providers have the feeling that they “didn’t do anything wrong, but somehow we still lost.” How are providers reacting to the massive changes in the networking field, and how are they trying to regain their footing so they can move into the coming decades better positioned to compete? Join Johan Gustawsson, Tom Ammon, and Russ White as we discuss the impact of merchant silicon and changing applications on the architecture of service providers.

download

You can read Johan’s post on this topic here.

Posted in AUDIO, DESIGN, HEDGE