WRITTEN
Simple or Complex?
A few weeks ago, Daniel posted a piece about using different underlay and overlay protocols in a data center fabric. He says:
There is nothing wrong with running BGP in the overlay but I oppose to the argument of it being simpler.
One of the major problems we often face in network engineering—and engineering more broadly—is confusing that which is simple with that which has lower complexity. Simpler things are not always less complex. Let me give you a few examples, all of which are going to be controversial.
When OSPF was first created, it was designed to be a simpler and more efficient form of IS-IS. Instead of using TLVs to encode data, OSPF used fixed-length fields. To process the contents of a TLV, you need to build a case/switch construction where each possible type a separate bit of code. You must count off the correct length for the type of data, or (worse) read a length field and count out where you are in the stream.
Fixed-length fields are just much easier to process. You build a structure matching the layout of the fixed-length fields in memory, then point this structure at the packet contents in-memory. From there, you can just use the structure’s contents to directly access the data.
Over time, however, as new requirements have been pushed into IGPs, OSPF has become much more complex while IS-IS has remained relatively constant (in terms of complexity). IS-IS went through a bit of a mess when transitioning from narrow to wide metrics, but otherwise the IS-IS we use today is the same protocol we used when I first started working on networks (back in the early 1990s).
OSPF’s simplicity, in the end, did not translate into a less complex protocol.
Another example is the way we transport data in BGP. A lot of people do not know that BGP’s original design allowed for carrying information other than straight reachability in the protocol. BGP speakers can negotiate multiple sessions, with each session carrying a different kind of information. Rather than using this mechanism, however, BGP has consistently been extended using address families—because it is simpler to create a new address family than it is to define a new kind of data parallel with address families.
This has resulted in AFs that are all over the place, magic numbers, and all sorts of complexity. The AF solution is simpler, but ultimately more complex.
Returning to Daniel’s example, running a single protocol for underlay and overlay is simpler, while running two different protocols is less simple. However, I’ve observed—many times—that running different protocols for underlay and overlay is less complex.
Why? Daniel mentions a couple of reasons, such as each protocol has a separate purpose, and we’re pushing features into BGP to make it serve the role of an IGP (which is, in the end, going to cause some major outages—if it hasn’t already).
Consider this: is it easier to troubleshoot infrastructure reachability separately from vrf reachability? The answer is obvious—yes! What about security? Is it easier to secure a fabric when the underlay never touches any attached workload? Again—yes!
We get this tradeoff wrong all the time. A lot of times this is because we are afraid of what we do not know. Ten years ago I struggled to convince large operators to run BGP in their networks. Today no-one runs anyone other than BGP—and they all say “but we don’t have anyone who knows OSPF or IS-IS.” I’ve no idea what happened to old-fashioned network engineering. Do people really only have one “protocol slot” in their brains? Can people really only ever learn one protocol?
Or maybe we’ve become so fixated on learning features that we no longer no protocols?
I don’t know the answer to these questions, but I will say this—over the years I’ve learned that simpler is not always less complex.
Route Servers and Loops
From the question pile: Route servers (as opposed to route reflectors) don’t change anything about a BGP route when re-advertising it to a peer, whether iBGP or eBGP. Why don’t route servers cause routing loops (or other problems) in a BGP network?
Route servers are often used by Internet Exchange Points (IXPs) to distribute routes between connected BGP speakers. BGP route servers
- Don’t change anything about a received BGP route when advertising the route to its peers (other BGP speakers)
- Don’t install routes received through BGP into the local routing table
Shouldn’t using route servers in a network—pontentially, at least—cause routing loops or other BGP routing issues? Maybe a practical example will help.
Assume b, e, and s are all route servers in their respective networks. Starting at the far left, a receives some route, 101::/64, and sends it on to b,, which then sends the unmodified route to c. When c receives traffic destined to 101::/64, what will happen? Regardless of whether these routers are running iBGP or eBGP, b will not change the next hop, so when c receives the route, a is still the next hop. If there’s no underlying routing protocol, c won’t know how to reach A, so it will ignore the route and drop the traffic. Even if there is an underlying routing protocol, c’s route to 101::/64’s route passes through b, and b isn’t installing any routing information learned from BGP into its local routing table (because it’s a route server). b is going to drop traffic destined to 101::/64.
We can solve this simple problem by adding a new link between the two clients of the route server, as shown in the center diagram. Here, d sends 101::/64 to e, which then sends the unchanged route to g. Since g has a direct connection to d, we can assume g will send traffic destined to 101::/64 directly to d, where it will be forwarded to the destination. Why wouldn’t d and g peer directly instead of counting on e to carry routes between them? In most cases this kind of indirect peering is done to increase network scale. If there are thousand routes like d and g, it will be simpler for them all to peer to e than to build a full mesh of connections.
Why not use a route reflector rather than a route server in this situation? Route reflectors can only be used to carry routes between iBGP peers. If d, e, and g are all in different autonomous systems, route reflectors cannot be used to solve this problem.
But this brings us back to the original question—route reflectors use the cluster list to prevent loops within an AS (the cluster list is similar in form and function to the AS path carried between autonomous systems, but it uses router ID’s rather than AS numbers to describe the path)?
If you have multiple route servers connected to one another you can, in fact, form routing loops.
In this network, a is sending 101::/64 to b, which is then sending the route, unmodified, to e. Because of some local policy, e is choosing the path through a, which means e forwards traffic destined to 101::/64 to c. At the same time, e is advertising 101::/64 to b, which is then sending the route (unmodified) to a, and a is choosing the path through c. In this case, a permanent (persistent) routing loop is formed through the control plane, primarily because no single BGP speaker has a complete view of the topology. The two route servers, by hiding the real path to 101::/64, makes is possible to form a routing loop.
The deploy route servers without forming these kinds of loops—
- BGP speakers learning routes from route servers should be directly connected—there should not be destinations reachable via some “hidden” intermediate hop
- Route servers should send all the routes they learn from clients; they should not use bestpath to choose which routes to send to clients
These restrictions prevent routing loops from forming when deploying route servers—but they also restrict the use of route servers to situations like carrying routes between BGP speakers connected to a single fabric.
Cisco filed a patent some time back describing a method to prevent routing loops when using BGP route servers; it makes interesting reading for folks who want to dive a little deeper.
RFC9199: Lessons in Large-scale Service Deployment
While RFC9199 (are we really in the 9000’s?) is targeted at large-scale DNS deployments–specifically root zone operators–so it might seem the average operator won’t find a lot of value here.
This is, however, far from the truth. Every lesson we’ve learned in deploying large-scale DNS root servers applies to any other large-scale user-facing service. Internally deployed DNS recursive servers are an obvious instance, but the lessons here might well apply to a scheduling, banking, or any other multi-user application accessed from a lot of places by a lot of different users. There are some unique points in DNS, such as the relatively slower pace of database synchronization across nodes, but the network-side lessons can still be useful for a lot of applications.
What are those lessons?
First, using anycast dramatically improves performance for these kinds of services. For those who aren’t familiar with the concept, anycase turns an IP address into a service identifier. Any host with a copy (or instance) or a given service advertises the same address, causing the routing table to choose the (topologically) closest instance of the service. If you’re using anycast, traffic destined to your service will automatically be forwarded to the closest server running the application, providing a kind of load sharing among multiple instances through routing. If there are instances in New York, California, France, and Taipei, traffic from users in North Carolina will be routed to New York and traffic from users in Singapore will be routed to Taipei.
You can think of an anycast address something like a cell tower; users within a certain desintance will be “captured” by a particular instance. The more copies of the service you deploy, the smaller the geographic region the service will support. Hence you can control the number of users using a particular copy of the service by controlling the number and location of service copies.
To understand where and how to deploy service instances, create anycast catchment maps. Again, just like a wifi signal coverage map, or a cellphone tower coverage map, it’s important to understand which users will be directed to which instances. Using a catchment map will help you decide where new instances need to be deployed, which instances need the fastest links and hardware, etc. The RIPE ATLAS pobes and looking glass servers are good ways to start building such a map. If the application supports a large number of users, you might be able to convince the application developer to include some sort of geographic information in requests to help build these maps.
Third, when deploying service instance, pay as much attention to routing and connectivity as you do the number of instances deployed. As the authors note, sometimes eight instances will provide the same level of service as several thousand instances. The connectivity available into each instance of the service–bandwidth, delay, availability, etc.–still has a huge impact on service speed.
Fourth, reduce the speed at which the database needs to be synchronized where possible. Not every piece of information needs to be synchronized at the same rate. The less data being synchronized, the more consistent the view from multiple users is going to be.
RFC9199 is well worth reading, even for the average network engineer.
Learning to Ride
Have you ever taught a kid to ride a bike? Kids always begin the process by shifting their focus from the handlebars to the pedals, trying to feel out how to keep the right amount of pressure on each pedal, control the handlebars, and keep moving … so they can stay balanced. During this initial learning phase, the kid will keep their eyes down, looking at the pedals, the handlebars, and . . . the ground.
After some time of riding, though, managing the pedals and handlebars are embedded in “muscle memory,” allowing them to get their head up and focus on where they’re going rather than on the mechanical process of riding. After a lot of experience, bike riders can start doing wheelies, or jumps, or off-road riding that goes far beyond basic balance.
Network engineer—any kind of engineering, really—is the same way.
At first, you need to focus on what you are doing. How is this configured? What specific output am I looking for in this show command? What field do I need to use in this data structure to automate that? Where do I look to find out about these fields, defects, etc.?
The problem is—it is easy to get stuck at this level, focusing on configurations, automation, and the “what” of things.
You’re not going to be able to get your head up and think about the longer term—the trail ahead, the end-point you’re trying to reach—until you commit these things to muscle memory.
The point, with technology, is learning to stop focusing on the pedals, the handlebars, and the ground, and start focusing on the goal—whether its nailing this jump or conquering this trail or making it there.
Transitioning is often hard, of course, but its just like riding a bike. You won’t make the transition until you trust your muscle memory a bit at a time.
Learning the theory of how and why things work the way they are is a key point in this transition. Configuration is just the intersection of “how this works” with “what am I trying to do…” If you know how (and why) protocols work, and you know what you’re trying to do, configuration and automation will become a matter of asking the right questions.
Learn the theory, and riding the bike will become second nature—rather than something you must focus on constantly.
On Building a Personal Brand
How do you balance loyalty to yourself and loyalty to the company you work for?
This might seem like an odd question, but it’s an important component of work/life balance many of us just don’t think about any longer because, as Pete Davis says in Dedicated, we live in a world of infinite browsing. We’re afraid of sticking to one thing because it might reduce our future options. If we dedicate ourselves to something bigger than ourselves, then we might lose control of our direction. In particular, we should not dedicate ourselves to any single company, especially for too long. As a recent (excellent!) blog post over at the ACM says:
The idea that we should control our own destiny, never getting lost in anything larger than ourselves, is ubitiquos like water is to a fish. We don’t question it. We don’t argue. It is just true. We assume there are three people who are going to look after “me:” me, myself, and I.
I get it. Honestly, I do. I’ve been there more times than I want to think about. I was the scapegoat in an argument between people far above my pay grade early in my career, causing much angst and pain. I’ve been laid off,—I cared about a company that simply didn’t care about me. Most recently, the family I’d dedicated more than twenty years of my life to ended through a divorce.
I can see why you might ask yourself hard questions about dedicating yourself to anything or anyone.
The problem, as Pete Davis points out, is that the human person was not designed for the kind of digital nomad life represented by the phrase “live for yourself.” We can try to substitute an online community. We can try to replace community with a string of novel experiences. But the truth is it will eventually catch up with you. When you’re young it’s hard to see how it will ever catch up with you, but it will.
Returning to the top—the author of the ACM article advises balancing between dedicating yourself to a company and dedicating yourself to your career. This is wise advice, but it leaves me wondering “how?” Let me lay out some thoughts here. They may not be all of the answer, but they will, I hope, point in the right direction.
First, resist seeing these two choices as orthogonal. They might be at odds in some companies—there are publishers who want your content to build their brand, and they specifically work at preventing you from building your brand. There are companies that explicitly want to own “your whole professional life.” They don’t want you blogging, going to conferences to speak, etc. Avoid these companies.
Instead, find companies that understand your personal brand is an asset to the company. Having a lot of people with strong personal brands in a company makes the company stronger, not weaker. People with strong brands will form communities around themselves. This community is a pool of people from which to recruit top-flight talent. This community allows them to collect new ideas that can be directly applied to problems in the organization. People with strong personal brands will have greater influence when they walk into a room to meet with a customer, a supplier, or just about anyone else. A company full of people with strong personal brands is stronger than one where everyone is faceless, consumed by/hiding behind the company logo.
Second, learn to manage your time effectively. I understand it’s possible to spend so much time building your brand that you don’t get your job done. As an individual, you need to be sensitive to this and learn how to manage your time effectively.
Third, seek out the win/win. Don’t think of every situation through the lens of “it’s either my brand or my employer’s.” There may be times when you cannot do something because, while it would help your brand immensely, it would harm your company’s. There may be times where you need to have a delicate discussion with your manager because you’ve been asked to do something that would be great for the company but would harm your brand. There is almost always a win/win, you just have to find it.
Fourth, seek out a community that’s not attached to work and dedicate yourself to it. Find something larger than yourself. A community that’s not tied to work will be your lifeline when things go wrong.
Finally, expect to get hurt. I know I have (an old saying in my community—never trust a man who doesn’t walk with a limp). You can be the nicest, humblest person in the world. Someone is still going to take advantage of you. In fact, the nicer and humbler you are—the more you care, the more likely it is people are going to take advantage of you. I am amazed at how much people seem to enjoy hurting one another when they believe there won’t be any consequences.
But … if you expect your life to be perfect, you were born in the wrong world. Build up the mental reserves to deal with this. Build a community that will help carry you through. There is nothing better than sitting down and sharing your hurt over a cup of coffee with a good friend (except I don’t drink coffee).
I get it—the world has moved into a YOLO/FOMO phase. If you don’t “grab it,” and right now! you risk missing something really important. We pile up alternative possibilities in our minds, wondering what might have happened if we’d chosen otherwise. We have deep angst over our personal brand, overthinking the concept to the point of diminishing returns.
The solution, though, is not to draw into yourself, to become self-centered. The solution is to find the balance, seek the win/win, dedicate yourself to something bigger than yourself, and find the right way to build your personal brand.
Revisiting BGP Convergence
My video on BGP convergence elicited a lot of . . . feedback, mainly concerning the difference between convergence in a data center fabric and convergence in the DFZ. Let’s begin here—BGP hunt and the impact of the MRAI are very real in the DFZ. Withdrawing a route can take several minutes.
What about the much more controlled environment of a data center fabric?
Several folks pointed out that the MRAI is often set to 0 in DC fabrics (and many implementations by default). Further, almost all implementations will use an MRAI of 0 for the first received update, holding the second and subsequent advertisements by the MRAI. Several folks also pointed out that all the paths through a DC fabric are the same length, so the second part of the equation is also very small.
These are good points—how do they impact BGP convergence? Let’s use the network below, a small slice of a five-stage butterfly fabric, to think it through. Assume every router is in a different AS, so all the peering sessions are eBGP.
Start with A losing its connection to 101::/64—
- T1: A withdraws its route from B and C
- T2: B withdraws its route from D and E, C withdraws its route from F and G
- T3: D and E withdraw their routes from H, F and G withdraw their routes from K
- T4: H and K withdraw their routes from L
Note that L cannot receive one withdraw to remove the route from its local table; it must receive withdraws from both H and K. There’s no way at L to tell whether a withdraw from H means 101::/64 is no longer reachable at all or it is no longer reachable through H. For path-vector protocols, like distance-vector, the neighbor through each path must be considered independently.
What does an MRAI of 0 do? Each of the routers in the network will process the withdraw as soon as they receive it and send a withdraw to their peers as soon as they’re done processing it. The process still takes the same number of steps but each step is much faster.
What is the impact of all the paths’ equal length? So long as every router processes the withdraw at around the same speed, there is no hunt. If H and K send their withdraws simultaneously, L should receive them simultaneously and remove the route to 101::/64 from its table rather than switching from one path to the other. Even if they send their withdraws at different times, L removes entries from its ECMP table until it receives the last withdrawal.
If MRAI slows down convergence, why set it to anything other than 0? Because it’s improbable that every router in the network will process each withdraw simultaneously.
Before 101::/64 is withdrawn, H will be using the paths through D and E for ECMP, but it is only going to be advertising one of these two routes to L—say the path through E. When B sends withdraws to D and E, assume E processes the withdraw just a little faster than D. When H receives D’s withdraw, it will send an implicit withdraw to L, updating the AS path to include D rather than E. A few moments later, D sends a withdraw. H processes this withdraw and sends a withdraw to L.
L has received one implicit withdraw and one withdraw from H because of processing time differentials. In a larger fabric, with a much larger fan-out, the likelihood of differences in timing is much higher and spread across a broader range of possibilities. You can (generally) expect H to send about half as many implicit withdraws as it has paths towards the destination before sending an actual withdraw. If there are eight paths between B and H, H would likely send 3 or 4 implicit withdraws before sending a withdraw.
What if the MRAI were set to 1 second at H? H would receive E’s withdrawal and set the MRAI timer. Assuming D’s withdraw arrives within that 1-second MRAI, H will receive D’s withdraw, squash the implicit withdraw, and send a single withdraw to L instead. Setting the MRAI to something other than 0 reduces the number of updates and reduces processing.
Setting the MRAI to 1 second, and forcing it to trigger across all updates, might improve convergence time—or not. Without experimenting with setting the MRAI to different values at different places in a real network, it is hard to know. Replacing the routers, link speeds, changing processor load, and increasing memory can all have an impact on the “best” settings for optimal convergence.
the bottom line
There will be no hunt in BGP convergence in a network with multiple equal-length/equal-cost paths. This is what we should expect. Because the maximum path length minus the best (current) path length will always be 0, the network will converge as quickly as each router can process and advertise withdraws, bounded by the MRAI.
Setting the MRAI to 0 improves convergence speed at the cost of additional updates, especially in wide fan-out data center fabrics. It’s hard to know whether setting the MRAI to 0 or 1 will give you better convergence speeds; you have to try it to see.
I still think we should be moving away from BGP as our underlay protocol in all but the largest data center fabrics. IGPs (like IS-IS and RIFT) will converge more quickly, are easier to configure and manage, and using different protocols for the underlay and overlay breaks up failure and security domains in useful ways. I know I’m tilting at a windmill on this point, but still …
BGP Policy (Part 7)
At the most basic level, there are only three BGP policies: pushing traffic through a specific exit point; pulling traffic through a specific entry point; preventing a remote AS (more than one AS hop away) from transiting your AS to reach a specific destination. In this series I’m going to discuss different reasons for these kinds of policies, and different ways to implement them in interdomain BGP.
In this post—the last post in this series—I’m going to cover do not transit options from the perspective of AS65001 in the following network—
There are cases where an operator does not traffic to be forwarded to them through some specific AS, whether directly connected or multiple hops away. For instance, AS65001 and AS65005 might be operated by companies in politically unfriendly nations. In this case, AS65001 may be legally required to reject traffic that has passed through the nation in which AS65005 is located. There are at least three mechanisms in BGP that are used, in different situations, to enforce this kind of policy.
Do Not Advertise Communities (Provider Specific)
Many providers supply communities a customer can use to block the advertisement of their routes to a particular AS. For instance, if AS65002 were NTT, according to the NTT customer communities site, if AS65001 advertises 100::/64 with the community 65500:65005, NTT would advertise 100::/64 to all its other peers, but not to AS65005.
Note: NTT is not AS65002; this is only used as an illustration of using a community to block advertisement to a peer’s peer.
The operator at AS65001 might reasonably expect that blocking AS65002 from advertising 100::/64 to AS65005 will block all traffic traveling through AS65005—but the vagaries of the global Internet routing table may well cause traffic to be forwarded through AS65005 anyway in some instances.
If AS65006 has a default route pointing to AS65005, traffic destined to 100::/64 may still be forwarded to AS65005. If AS65005 happens to have a covering aggregate route, or learned of the route via AS65004, it might still carry traffic destined for 100::/64.
It is almost impossible to block all traffic to a given reachable destination from being forwarded through a given autonomous system.
AS Path Injection
An alternate, widely used mechanism is to intentionally inject an AS Path loop when advertising a route to prevent some AS from accepting the route. For instance, AS65001 might advertise 100::/64 with the AS Path [65005,65001] to AS65002. AS65005 would then reject this advertisement because the local AS is already in the AS Path.
While this might appear to “break the rules” of BGP, the reality is the AS Path was never really intended to be a “true record” of the path of an “update” (in fact, there is no such thing as an “update” that travels from one router to the next—the “update” is constructed at each hop based on local tables). This technique is problematic in providing “path security” in BGP, but it does not intrinsically break any BGP rules.
Note: For more information about this technique, refer to this episode of the Hedge.
Again, note it is almost impossible to block all traffic to a given reachable destination from being forwarded through a given autonomous system.
Do Not Advertise Communities (Well Known)
Three further well-known communities, although they are not widely used, are worth considering.
When a route is marked with NO-PEER, the AS should only advertise the route to its customers and never its peers. For instance, if AS65001 advertises 100::/64 to AS65003 with NO-PEER, AS65003 will advertise the destination to AS6507 and AS65008 (assuming these are customers), and not to AS65002 or AS65004 (because both of these autonomous systems transit traffic to and from AS65003).
When a route is marked with NO-EXPORT, the AS should not advertise the reachable destination to any other AS. For instance, if AS65001 advertises 100::/64 to AS65003 with NO-EXPORT, AS65003 will not advertise this reachable destination to any other AS, including AS65007, AS65008, AS65002, or AS65004.
When a route is marked with NO-ADVERTISE, the receiving BGP speaker should not advertise the route to any other BGP speaker, including internal and external connections.