A few weeks ago, Daniel posted a piece about using different underlay and overlay protocols in a data center fabric. He says:
There is nothing wrong with running BGP in the overlay but I oppose to the argument of it being simpler.
One of the major problems we often face in network engineering—and engineering more broadly—is confusing that which is simple with that which has lower complexity. Simpler things are not always less complex. Let me give you a few examples, all of which are going to be controversial.
When OSPF was first created, it was designed to be a simpler and more efficient form of IS-IS. Instead of using TLVs to encode data, OSPF used fixed-length fields. To process the contents of a TLV, you need to build a case/switch construction where each possible type a separate bit of code. You must count off the correct length for the type of data, or (worse) read a length field and count out where you are in the stream.
Fixed-length fields are just much easier to process. You build a structure matching the layout of the fixed-length fields in memory, then point this structure at the packet contents in-memory. From there, you can just use the structure’s contents to directly access the data.
Over time, however, as new requirements have been pushed into IGPs, OSPF has become much more complex while IS-IS has remained relatively constant (in terms of complexity). IS-IS went through a bit of a mess when transitioning from narrow to wide metrics, but otherwise the IS-IS we use today is the same protocol we used when I first started working on networks (back in the early 1990s).
OSPF’s simplicity, in the end, did not translate into a less complex protocol.
Another example is the way we transport data in BGP. A lot of people do not know that BGP’s original design allowed for carrying information other than straight reachability in the protocol. BGP speakers can negotiate multiple sessions, with each session carrying a different kind of information. Rather than using this mechanism, however, BGP has consistently been extended using address families—because it is simpler to create a new address family than it is to define a new kind of data parallel with address families.
This has resulted in AFs that are all over the place, magic numbers, and all sorts of complexity. The AF solution is simpler, but ultimately more complex.
Returning to Daniel’s example, running a single protocol for underlay and overlay is simpler, while running two different protocols is less simple. However, I’ve observed—many times—that running different protocols for underlay and overlay is less complex.
Why? Daniel mentions a couple of reasons, such as each protocol has a separate purpose, and we’re pushing features into BGP to make it serve the role of an IGP (which is, in the end, going to cause some major outages—if it hasn’t already).
Consider this: is it easier to troubleshoot infrastructure reachability separately from vrf reachability? The answer is obvious—yes! What about security? Is it easier to secure a fabric when the underlay never touches any attached workload? Again—yes!
We get this tradeoff wrong all the time. A lot of times this is because we are afraid of what we do not know. Ten years ago I struggled to convince large operators to run BGP in their networks. Today no-one runs anyone other than BGP—and they all say “but we don’t have anyone who knows OSPF or IS-IS.” I’ve no idea what happened to old-fashioned network engineering. Do people really only have one “protocol slot” in their brains? Can people really only ever learn one protocol?
Or maybe we’ve become so fixated on learning features that we no longer no protocols?
I don’t know the answer to these questions, but I will say this—over the years I’ve learned that simpler is not always less complex.
We don’t often do a post-mortem on the development and deployment of new protocols … but here at the Hedge we’re going to brave these deep waters to discuss some of the lessons we can learn from the development and deployment of IPv6, especially as they apply to design and deployment cycles in the “average network” (if there is such at thing). Join us as James Harr, Tom Ammon, and Russ White consider the lessons we can learn from IPv6’s checkered history.
From the question pile: Route servers (as opposed to route reflectors) don’t change anything about a BGP route when re-advertising it to a peer, whether iBGP or eBGP. Why don’t route servers cause routing loops (or other problems) in a BGP network?
Route servers are often used by Internet Exchange Points (IXPs) to distribute routes between connected BGP speakers. BGP route servers
- Don’t change anything about a received BGP route when advertising the route to its peers (other BGP speakers)
- Don’t install routes received through BGP into the local routing table
Shouldn’t using route servers in a network—pontentially, at least—cause routing loops or other BGP routing issues? Maybe a practical example will help.
Assume b, e, and s are all route servers in their respective networks. Starting at the far left, a receives some route, 101::/64, and sends it on to b,, which then sends the unmodified route to c. When c receives traffic destined to 101::/64, what will happen? Regardless of whether these routers are running iBGP or eBGP, b will not change the next hop, so when c receives the route, a is still the next hop. If there’s no underlying routing protocol, c won’t know how to reach A, so it will ignore the route and drop the traffic. Even if there is an underlying routing protocol, c’s route to 101::/64’s route passes through b, and b isn’t installing any routing information learned from BGP into its local routing table (because it’s a route server). b is going to drop traffic destined to 101::/64.
We can solve this simple problem by adding a new link between the two clients of the route server, as shown in the center diagram. Here, d sends 101::/64 to e, which then sends the unchanged route to g. Since g has a direct connection to d, we can assume g will send traffic destined to 101::/64 directly to d, where it will be forwarded to the destination. Why wouldn’t d and g peer directly instead of counting on e to carry routes between them? In most cases this kind of indirect peering is done to increase network scale. If there are thousand routes like d and g, it will be simpler for them all to peer to e than to build a full mesh of connections.
Why not use a route reflector rather than a route server in this situation? Route reflectors can only be used to carry routes between iBGP peers. If d, e, and g are all in different autonomous systems, route reflectors cannot be used to solve this problem.
But this brings us back to the original question—route reflectors use the cluster list to prevent loops within an AS (the cluster list is similar in form and function to the AS path carried between autonomous systems, but it uses router ID’s rather than AS numbers to describe the path)?
If you have multiple route servers connected to one another you can, in fact, form routing loops.
In this network, a is sending 101::/64 to b, which is then sending the route, unmodified, to e. Because of some local policy, e is choosing the path through a, which means e forwards traffic destined to 101::/64 to c. At the same time, e is advertising 101::/64 to b, which is then sending the route (unmodified) to a, and a is choosing the path through c. In this case, a permanent (persistent) routing loop is formed through the control plane, primarily because no single BGP speaker has a complete view of the topology. The two route servers, by hiding the real path to 101::/64, makes is possible to form a routing loop.
The deploy route servers without forming these kinds of loops—
- BGP speakers learning routes from route servers should be directly connected—there should not be destinations reachable via some “hidden” intermediate hop
- Route servers should send all the routes they learn from clients; they should not use bestpath to choose which routes to send to clients
These restrictions prevent routing loops from forming when deploying route servers—but they also restrict the use of route servers to situations like carrying routes between BGP speakers connected to a single fabric.
Cisco filed a patent some time back describing a method to prevent routing loops when using BGP route servers; it makes interesting reading for folks who want to dive a little deeper.
While RFC9199 (are we really in the 9000’s?) is targeted at large-scale DNS deployments–specifically root zone operators–so it might seem the average operator won’t find a lot of value here.
This is, however, far from the truth. Every lesson we’ve learned in deploying large-scale DNS root servers applies to any other large-scale user-facing service. Internally deployed DNS recursive servers are an obvious instance, but the lessons here might well apply to a scheduling, banking, or any other multi-user application accessed from a lot of places by a lot of different users. There are some unique points in DNS, such as the relatively slower pace of database synchronization across nodes, but the network-side lessons can still be useful for a lot of applications.
What are those lessons?
First, using anycast dramatically improves performance for these kinds of services. For those who aren’t familiar with the concept, anycase turns an IP address into a service identifier. Any host with a copy (or instance) or a given service advertises the same address, causing the routing table to choose the (topologically) closest instance of the service. If you’re using anycast, traffic destined to your service will automatically be forwarded to the closest server running the application, providing a kind of load sharing among multiple instances through routing. If there are instances in New York, California, France, and Taipei, traffic from users in North Carolina will be routed to New York and traffic from users in Singapore will be routed to Taipei.
You can think of an anycast address something like a cell tower; users within a certain desintance will be “captured” by a particular instance. The more copies of the service you deploy, the smaller the geographic region the service will support. Hence you can control the number of users using a particular copy of the service by controlling the number and location of service copies.
To understand where and how to deploy service instances, create anycast catchment maps. Again, just like a wifi signal coverage map, or a cellphone tower coverage map, it’s important to understand which users will be directed to which instances. Using a catchment map will help you decide where new instances need to be deployed, which instances need the fastest links and hardware, etc. The RIPE ATLAS pobes and looking glass servers are good ways to start building such a map. If the application supports a large number of users, you might be able to convince the application developer to include some sort of geographic information in requests to help build these maps.
Third, when deploying service instance, pay as much attention to routing and connectivity as you do the number of instances deployed. As the authors note, sometimes eight instances will provide the same level of service as several thousand instances. The connectivity available into each instance of the service–bandwidth, delay, availability, etc.–still has a huge impact on service speed.
Fourth, reduce the speed at which the database needs to be synchronized where possible. Not every piece of information needs to be synchronized at the same rate. The less data being synchronized, the more consistent the view from multiple users is going to be.
RFC9199 is well worth reading, even for the average network engineer.
One of the many reasons engineers should work for a vendor, consulting company, or someone other than a single network operator at some point in their career is to develop a larger view of network operations. What are common ways of doing things? What are uncommon ways? In what ways is every network broken? Over time, if you see enough networks, you start seeing common themes and ideas. Just like history, networks might not always be the same, but the problems we all encounter often rhyme. Ken Celenza joins Tom Ammon, Eyvonne Sharp, and Russ White to discuss these common traits—ten things I know about your network.
My video on BGP convergence elicited a lot of . . . feedback, mainly concerning the difference between convergence in a data center fabric and convergence in the DFZ. Let’s begin here—BGP hunt and the impact of the MRAI are very real in the DFZ. Withdrawing a route can take several minutes.
What about the much more controlled environment of a data center fabric?
Several folks pointed out that the MRAI is often set to 0 in DC fabrics (and many implementations by default). Further, almost all implementations will use an MRAI of 0 for the first received update, holding the second and subsequent advertisements by the MRAI. Several folks also pointed out that all the paths through a DC fabric are the same length, so the second part of the equation is also very small.
These are good points—how do they impact BGP convergence? Let’s use the network below, a small slice of a five-stage butterfly fabric, to think it through. Assume every router is in a different AS, so all the peering sessions are eBGP.
Start with A losing its connection to 101::/64—
- T1: A withdraws its route from B and C
- T2: B withdraws its route from D and E, C withdraws its route from F and G
- T3: D and E withdraw their routes from H, F and G withdraw their routes from K
- T4: H and K withdraw their routes from L
Note that L cannot receive one withdraw to remove the route from its local table; it must receive withdraws from both H and K. There’s no way at L to tell whether a withdraw from H means 101::/64 is no longer reachable at all or it is no longer reachable through H. For path-vector protocols, like distance-vector, the neighbor through each path must be considered independently.
What does an MRAI of 0 do? Each of the routers in the network will process the withdraw as soon as they receive it and send a withdraw to their peers as soon as they’re done processing it. The process still takes the same number of steps but each step is much faster.
What is the impact of all the paths’ equal length? So long as every router processes the withdraw at around the same speed, there is no hunt. If H and K send their withdraws simultaneously, L should receive them simultaneously and remove the route to 101::/64 from its table rather than switching from one path to the other. Even if they send their withdraws at different times, L removes entries from its ECMP table until it receives the last withdrawal.
If MRAI slows down convergence, why set it to anything other than 0? Because it’s improbable that every router in the network will process each withdraw simultaneously.
Before 101::/64 is withdrawn, H will be using the paths through D and E for ECMP, but it is only going to be advertising one of these two routes to L—say the path through E. When B sends withdraws to D and E, assume E processes the withdraw just a little faster than D. When H receives D’s withdraw, it will send an implicit withdraw to L, updating the AS path to include D rather than E. A few moments later, D sends a withdraw. H processes this withdraw and sends a withdraw to L.
L has received one implicit withdraw and one withdraw from H because of processing time differentials. In a larger fabric, with a much larger fan-out, the likelihood of differences in timing is much higher and spread across a broader range of possibilities. You can (generally) expect H to send about half as many implicit withdraws as it has paths towards the destination before sending an actual withdraw. If there are eight paths between B and H, H would likely send 3 or 4 implicit withdraws before sending a withdraw.
What if the MRAI were set to 1 second at H? H would receive E’s withdrawal and set the MRAI timer. Assuming D’s withdraw arrives within that 1-second MRAI, H will receive D’s withdraw, squash the implicit withdraw, and send a single withdraw to L instead. Setting the MRAI to something other than 0 reduces the number of updates and reduces processing.
Setting the MRAI to 1 second, and forcing it to trigger across all updates, might improve convergence time—or not. Without experimenting with setting the MRAI to different values at different places in a real network, it is hard to know. Replacing the routers, link speeds, changing processor load, and increasing memory can all have an impact on the “best” settings for optimal convergence.
the bottom line
There will be no hunt in BGP convergence in a network with multiple equal-length/equal-cost paths. This is what we should expect. Because the maximum path length minus the best (current) path length will always be 0, the network will converge as quickly as each router can process and advertise withdraws, bounded by the MRAI.
Setting the MRAI to 0 improves convergence speed at the cost of additional updates, especially in wide fan-out data center fabrics. It’s hard to know whether setting the MRAI to 0 or 1 will give you better convergence speeds; you have to try it to see.
I still think we should be moving away from BGP as our underlay protocol in all but the largest data center fabrics. IGPs (like IS-IS and RIFT) will converge more quickly, are easier to configure and manage, and using different protocols for the underlay and overlay breaks up failure and security domains in useful ways. I know I’m tilting at a windmill on this point, but still …
At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.
There are many reasons an operator might want to select which neighboring AS through which to send traffic towards a given reachable destination (for instance, 100::/64). Each of these examples assumes the AS in question has learned multiple paths towards 100::/64, one from each peer, and must choose one of the two available paths to forward along.
In the following network—
From AS65004’s perspective…
Transit providers primarily choose the most optimal exit from their AS to reduce the amount of peering settlement they are paying by using and maintaining settlement-free peering where possible and reducing the amount of time and distance traffic is carried through their network (through hot potato routing, discussed in more detail below).
If, for instance, AS65004 has a paid peering relationship with AS65002, and a contract with AS65003 which is settlement-free so long as the traffic between AS65004 and AS65003 is roughly symmetric. AS65004 has two roughly equal-cost paths (both have the same AS Path length) towards 100::/64. In this situation, AS 65004 is going to direct traffic towards AS65003 to maintain symmetrical traffic flows and direct any remaining traffic towards AS65002.
This kind of balancing is normally done through a controller or network management system that monitors the balance of traffic with AS65003, adjusting the preference of sets of routes to attain the correct balance with AS65003 while reducing the costs of using the link to AS65002 to the minimum possible.
From AS65005’s perspective…
AS65005 can either send traffic originating in AS65001, received from AS65002, and destined to AS65006, to either AS65004—a peer—or AS65006—a customer. The internal path between the entry point for this traffic is longer if the traffic is carried to AS65006, and shorter if the traffic is carried to AS65004. These longer and shorter paths give rise to the concepts of hot and cold potato routing.
If AS65006 is paying AS65005 for transit, AS65005 would normally carry traffic across the longer path to its border with AS65006. This is cold potato routing. AS65005’s reason for choosing this option is to maximize revenue from the customer. First, as the link between AS65005 and AS65006 becomes busier, AS65006 is likely to upgrade the link, generating additional revenue for AS65005. Even if the traffic level is not increasing, steady traffic flow encourages the customer to maintain the link, which protects revenue. Second, AS65005 can control the quality-of-service AS65006 receives by keeping the traffic within its network for as long as possible, improving the customer’s perception of the service they are receiving.
Cold potato routing is normally implemented by setting the preference on routes learned from customers, so these routes are preferred over all routes learned from peers.
If AS6006 is not paying AS65005 for transit, it is to AS65005’s advantage to carry the traffic as short a distance as possible. In this case, although AS65005 is directly connected to AS65006, and the destination is in AS65006, AS65005 will choose to direct the traffic towards its border with AS65004 (because there is a valid route learned for this reachable destination from AS65004).
This is hot potato routing—like the kids’ game, you want to hold on to the traffic for as short an amount of time as possible. Hot potato routing is normally implemented by setting the preference on routes to the same and relying on the IGP metric component of the BGP bestpath decision process to find the closest exit point.
Next week I’ll continue this series on BGP interdomain policies… feel free to leave a comment if you think I’ve explained something incorrectly, etc.