BGP and Suboptimal Route Reflection

One of the crucial points in understanding the operation of BGP is the reliance on the AS path to ensure all routes are loop-free. Within a single AS, however, there is no AS path. How, then, can you ensure the path through an AS is loop-free? The original plan was to fully mesh all the BGP speakers in the AS (a full mesh of iBGP speakers)—but building and maintaining a full mesh of iBGP speakers is difficult, so other solutions were quickly designed. The first of these as the BGP Confederation, which allows a set of autonomous systems to look like a single AS from the outside. This solution, however, is also cumbersome, so… the RR was invented.


  • BGP RR’s abstract information in a way that can cause suboptimal routing
  • To resolve this suboptimal routing, additional paths are advertised to RRCs by RRs


The basic operation of an RR is fairy simple; as new attribute, the cluster list, is added to a route as is passes from client to server. The cluster list contains as list of the clusters the route has passed through, identified by the identifier of the route reflector that “heads” the cluster. If a route is advertised to an iBGP speaker with a cluster ID already on the cluster list, the route is ignored. In a sense, each cluster acts as a “sub-AS,” much like in a confederation, and the cluster ID acts as the sub-AS number. Loop-freeness is guaranteed by making certain the route is not advertised as reachable through the same RR cluster twice, much like eBGP loop-freeness is guaranteed by not advertising a route as reachable through the same AS twice.

Route reflectors emulate the eBGP construction in one other way, as well—only the best path from each cluster’s perspective is advertised. This improves scaling dramatically; with a full mesh of iBGP speakers, every BGP speaker will learn about every path available to reach a destination. If there are ten eBGP peers that can reach 2001:db8:3e8:100::/64, for instance, then every BGP speaker in the AS will know about all ten paths. If route reflectors are used instead of a full mesh of iBGP speakers, the RR server will choose the best exit point from the local AS, and advertise only that one route to each of its clients. An important point to note here: the best path is chosen from the perspective of the RR.

For instance, in this diagram, the best path to 2011:db8:3e8:100::/64 for B is through A. Since B is the route reflector, it will advertise only the path through A to its two clients, which are C and E. Because of this, E will not know about the path through C, even though this is a valid path—and a better path, from E’s perspective.

This is a classic case of aggregation and information hiding; B is removing some available routes towards its route reflector clients, which allows the control plane to scale. According to the state/optimization/surface triad, a reduction in the amount of state should have some effect on the optimization of traffic flow through the network. In this case, E has a suboptimal route.

How can this problem be solved? The most obvious way is to add more information back into the control plane. The solutions offered for this problem vary in which information is added back, and how that information is carried.

The first proposal for solving the suboptimal routing problem was for the RR to send all of its routes to each of its clients. Normally, if a BGP speaker sends a route for a destination, and then follows with another update about this same destination, the receiving speaker will discard the first advertisement. This is called an implicit withdraw. RFC7911 specifies a mechanism for a BGP speaker to carry more than one route for the same destination to a peer by including a Route Identifier (RID) with each route. The function of the route identifier is the same as the route distinguisher in a BGP-VPN deployment—to describe two different routes that happen to share the same destination prefix. By setting a different RID on each route, the reflector can advertise all of its route to its clients. The problem with this solution is that it “undoes” the scaling advantages conferred by the RR; every RRC receives every route. In large networks, the scaling and convergence speed differences can be dramatic.

It seems better to send only part of the routes, but which part? An alternative proposal is for the RR to compute the BGP bestpath from the perspective of each of its RRCs and send only the best route from that RRC’s perspective each RRC. There are several tradeoffs here, as well. First, the RRC must calculate the IGP path from each of its RRC’s in some way; this normally means running SPF from the perspective of each RRC. The RR must then run bestpath for each RRC and send the correct route to each one. Finally, the RRC must keep track of which route it has sent to each RRC, so it can determine when it needs to resend a new route based on changes in the network topology—both for changes in BGP and IGP reachability. This adds a lot of complexity to the BGP RR implementation, and creates a rather deep interaction surface between the IGP and BGP. It also just happens to assume there is an underlying link-state IGP in operation—which may not always be true.

Another option is for the RR to send the bestpath and the second-best path to each RRC. The assumption is one of these two paths will either be optimal, or close to optimal, for all RRCs. This is the solution most implementations have opted for, as it seems to strike a good balance between adding more control plane information, implementation complexity, and solving the suboptimal routing problem. Most implementations will allow the operator to send as many “second-best” paths as they desire to, which allows the operator to fine-tune optimization against state differently in diverse parts of their network.

BGP Security: A Gentle Reminder that Networking is Business

At NANOG on the Road (NotR) in September of 2018, I participated in a panel on BGP security—specifically the deployment of Route Origin Authentication (ROA), with some hints and overtones of path validation by carrying signatures in BGP updates (BGPsec). This is an area I have been working in for… 20 years? … at this point, so I have seen the argument develop across these years many times, and in many ways. What always strikes me about this discussion, whenever and wherever it is aired, is the clash between business realities and the desire for “someone to do something about routing security in the DFZ, already!” What also strikes me about these conversations it the number of times very fundamental concepts end up being explained to folks who are “new to the problem.”


  • BGP security is a business problem first, and a technology problem second
  • Signed information is only useful insofar as it is maintained
  • The cost of deployment must be lower than the return on that cost
  • Local policy will always override global policy—as it should
  • The fear of losing business is a stronger motivator than gaining new business


Part of the problem here is solutions considered “definitive and final” have been offered, the operator community has rejected them for many years, and yet these same solutions are put on the table year after year—like the perennial fruit cake made by someone’s great great aunt in the mists of Christmastime history that has been regifted so many times no-one really remembers where it came from, nor what sorts of fruit it actually contains.

cross posted at CircleID

The business reality, in terms of BGP security, is simple. To deploy some sort of check on the global routing table, at point must be reached where it costs more to not deploy it than to deploy it. This simply business reality is something network designers and architects beat their heads against every day. The solution can be the neatest solution in the world. It might even shop for the ingredients, mix the cookie dough in perfect proportions, bake the cookies, and then transport them to the proper location for the perfect amount of enjoyment from just the right people (insert Goldilocks here, perhaps). But none of this will ever matter if there is no financial upside, or if the financial risks are greater than the financial gains.

So, some hopefully helpful business realities.

Signed information is no more useful than unsigned information if it is not kept up to date. It is great to get everyone out in full force to build a cryptographically secured database of who owns what prefixes and AS numbers. It is wonderful to find a way to distribute that information throughout the ‘net so people can use it as another tool to determine whether or not to accept a route, or what weight to place on that route. People might even spend a bit of time building this database, just because they believe it is good for the community.

The problem is not day one. It is day two, and then day two thousand. What motivation is there to keep the information in this database up-to-date? Unless there is some—and here I mean financial motivation—the database will lose its effectiveness over time. At some point, when the error rate reaches some number (around 30% seems to be about right), people will simply stop trusting it. When the error rate gets high enough, the tools will stop being used, and the ‘net will revert to its old self.

The cost of deployment and operation must at least be close to the gains from deployment. At this point, there is no financial gain any one company can see from deploying anything in the realm of BGP security, so the cost of deployment and maintenance must be close to zero. While there are folks (including me) trying to reduce the cost as close to zero as possible, we are not there yet, and I do not know if we will ever be there.

Local policy will always override global policy. The literature of BGP security is replete with statements like: “if the route meets this criteria, the BGP speaker MUST drop it.” Good luck. The Internet is a confederation of independent companies, each of which runs their network in a way that they believe will make the most money at the lowest cost possible. One of the ways this happens is that people tune their local policies to charge their adjacent autonomous systems as much money as possible, while reducing their OPEX and CAPEX as much as possible. There will always be money to be made in the grey space around local tuning of policies for optimal traffic flow. Hence local policy will always win over what any database anyplace might say.

To give a specific example: assume you run a network, and you have peered with another operator in multiple places for many years. You know the other operator’s routes well, as they have not changed for many years. One day, you receive a route from this operator in which everything looks correct, but the route is not contained in this outside database of “correct routes.”

Noting this route is a route you have received from this very same operator for many years, are you going to drop it because it’s not right in some database, or are you going to use it given your past standing and relationship?

The fear of losing business is always the strongest motivator. Which leads to the next issue. It does not matter how wonderful your network is if you have a high customer churn rate. The most certain way to have a high churn rate is to place your customer’s experience in the hands of someone else. Such as a communally managed database, perhaps. This is another reason why local policy will always win over remote policy—the local provider is handed checks by customers, not the community at large. They have more incentive to keep their customers happy than the community.

You cannot secure things you do not tell anyone about. This final one is probably not as obvious as the others, but it is just as important as any other item on this list. There are many backdoor arrangements and sealed contracts in the provider world. People transit traffic without telling anyone else that traffic is being transited. Some people are customers of others only in the event of a massive failure someplace else in their network, but do not want anyone to know about this.

All of these arrangements are perfectly legitimate and legal in their respective jurisdictions. But you cannot secure something that no-one knows about. The more information that is hidden in a system, the harder it is to validate the information that exists is correct.

The bottom line is this: BGP security, like most networking problems, is not a technology problem. BGP security is, at its heart, a business problem. The lesson is here not just for security, but for network engineering in general. Business is the bottom line, not technology.

Research: Facebook’s Edge Fabric

The Internet has changed dramatically over the last ten years; more than 70% of the traffic over the Internet is now served by ten Autonomous Systems (AS’), causing the physical topology of the Internet to be reshaped into more of a hub-and-spoke design, rather than the more familiar scale-free design (I discussed this in a post over at CircleID in the recent past, and others have discussed this as well). While this reshaping might be seen as a success in delivering video content to most Internet users by shortening the delivery route between the server and the user, the authors of the paper in review today argue this is not enough.

Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA, 418-431. DOI:

Why is this not enough? The authors point to two problems in the routing protocol tying the Internet together: BGP. First, they state that BGP is not capacity-aware. It is important to remember that BGP is focused on policy, rather than capacity; the authors of this paper state they have found many instances where the preferred path, based on BGP policy, is not able to support the capacity required to deliver video services. Second, they state that BGP is not performance-aware. The selection criteria used by BGP, such as MED and Local Pref, do not correlate with performance.

Based on these points, the authors argue traffic needs to be routed more dynamically, in response to capacity and performance, to optimize efficiency. The paper presents the system Facebook uses to perform this dynamic routing, which they call Edge Fabric. As I am more interested in what this study reveals about the operation of the Internet than the solution Facebook has proposed to the problem, I will focus on the problem side of this paper. Readers are invited to examine the entire paper at the link above, or here, to see how Facebook is going about solving this problem.

The paper begins by examining the Facebook edge; as edges go, Facebook’s is fairly standard for a hyperscale provider. Facebook deploys Points of Presence, which are essentially private Content Delivery Network (CDN) compute and edge pushed to the edge, and hence as close to users as possible. To provide connectivity between these CDN nodes and their primary data center fabrics, Facebook uses transit provided through peering across the public ‘net. The problem Facebook is trying to solve is not the last mile connectivity, but rather the connectivity between these CDN nodes and their data center fabrics.

The authors begin with the observation that if left to its own decision process, BGP will evenly distribute traffic across all available peers, even though each peer is actually different levels of congestion. This is not a surprising observation. In fact, there was at least one last mile provider that used their ability to choose an upstream based on congestion in near real time. This capability was similar to the concept behind Performance Based Routing (PfR), developed by Cisco, which was then folded into DMVPN, and thus became part of the value play of most Software Defined Wide Area Network (SD-WAN) solutions.

The authors then note that BGP relies on rough proxies to indicate better performing paths. For instance, the shortest AS Path should, in theory, be shortest physical or logical path, as well, and hence the path with the lowest end-to-end time. In the same way, local preference is normally set to prefer peer connections rather than upstream or transit connections. This should mean traffic will take a shorter path through a peer connected to the destination network, rather than a path up through a transit provider, then back down to the connected network. This should result in traffic passing through less lightly loaded last mile provider networks, rather than more heavily used transit provider networks. The authors present research showing these policies can often harm performance, rather than enhancing it; sometimes it is better to push traffic to a transit peer, rather than to a directly connected peer.

How often are destination prefixes constrained by BGP into a lower performing path? The authors provide this illustration—

The percentage of impacted destination prefixes is, by Facebook’s measure, high. But what kind of solution might be used to solve this problem?

Note that no solution that uses static metrics for routing traffic will be able to solve these problems. What is required, if you want to solve these problems, is to measure the performance of specific paths to given destinations in near real time, and somehow adjust routing to take advantage of higher performance paths regardless of what the routing protocol metrics indicate. In other words, the routing protocol needs to find the set of possible loop-free paths, and some other system must choose which path among this set should be used to forward traffic. This is a classic example of the argument for layered control planes (such as this one).

Facebook’s solution to this problem is to overlay an SDN’ish solution on top of BGP. Their solution does not involve tunneling, like many SD-WAN solutions. Rather, they adjust the BGP metrics in near real time based on current measured congestion and performance. The paper goes on to describe their system, which only uses standard BGP metrics to steer traffic onto higher performance paths through the ‘net.

A few items of note from this research.

First, note that many of the policies set up by providers are not purely shorthand for performance; they actually represent a price/performance tradeoff. For instance, the use of local preference to send traffic to peers, rather than transits, is most often an economic decision. Providers, particularly edge providers, normally configure settlement-free peering with peers, and pay for traffic sent to an upstream transit provider. Directing more traffic at an upstream, rather than a peer, can have a significant financial impact. Hyperscalers, like Facebook, don’t often see these financial impacts, as they are purchasing connectivity from the provider. Over time, forcing providers to use more expensive links for performance reasons could increase cost, but in this situation the costs are not immediately felt, so the cost/performance feedback loop is somewhat muted.

Second, there is a fair amount of additional complexity in pulling this bit of performance out of the network. While it is sometimes worth adding complexity to increase complexity, this is not always true. It likely is for many hyperscalers, who’s business relies largely on engagement. Given there is a directly provable link between engagement and speed, every bit of performance makes a large difference. But this is simply not true of all networks.

Third, you can replicate this kind of performance-based routing in your network by creating a measurement system. You can then use the communities operators providers allow their customers to use to shape the direction of traffic flows to optimize traffic performance. This might not work in all cases, but it might give you a fair start on a similar system—if this kind of wrestling match with performance is valuable in your environment.

Another option might be to use an SD-WAN solution, which should have the measurement and traffic shaping capabilities “built in.”

Fourth, there is a real possibility of building a system that fails in the face of positive feedback loops or reduces performance in the face of negative feedback loops. Hysteresis, the tendency to cause a performance problem in the process of reacting to a performance problem, must be carefully considered when designing such as system, as well.

The Bottom Line

Statically defined metrics in dynamic control planes cannot provide optimal performance in near real time. Building a system that can involves a good bit of additional complexity—complexity that is often best handled in a layered control plane.

Are these kinds of tools suitable for a network other than Facebook? In the right situation, the answer is clearly yes. But heed the tradeoffs. If you haven’t found the tradeoff, you haven’t looked hard enough.

Research: Robustness in Complex Systems

While the network engineering world tends to use the word resilience to describe a system that will support rapid change in the real world, another word often used in computer science is robustness. What makes a system robust or resilient? If you ask a network engineer this question, the most likely answer you will get is something like there is no single point of failure. This common answer, however, does not go “far enough” in describing resilience. For instance, it is at least sometimes the case that adding more redundancy into a network can actually harm MTTR. A simple example: adding more links in parallel can cause the control plane to converge more slowly; at some point, the time to converge can be reduced enough to offset the higher path availability.

In other cases, automating the response to a change in the network can harm MTTR. For instance, we often nail a static route up and redistribute that, rather than redistributing live routing information between protocols. Experience shows that sometimes not reacting automatically is better than reacting automatically.

This post will look at a paper that examines robustness more deeply, Robustness in Complexity Systems,” by Steven Gribble. While this is an older paper—it was written in 2000—it remains a worthwhile read for the lessons in distributed system design. The paper is based on the deployment of a cluster based Distributed Data Structure (DDS). A more convenient way for readers to think of this is as a distributed database. Several problems discovered when building and deploying this DDS are considered, including—

  • A problem with garbage collection, which involved timeouts. The system was designed to allocate memory as needed to form and synchronize records. After the record had been synchronized, any unneeded memory would be released, but would not be reallocated immediately by some other process. Rather, a garbage collection routine would coallesce memory into larger blocks where possible, rearranging items and placing memory back into available pools. This process depends on a timer. What the developers discovered is their initial “guess” at a a good timer was ultimately an order of a magnitude too small, causing some of the nodes to “fall behind” other nodes in their synchronization. Once a node fell behind, the other nodes in the system were required to “take up the slack,” causing them to fail at some point in the future. This kind of cascading failure, triggered by a simple timer setting, is common in a distributed system.
  • A problem with a leaky abstraction from TCP into the transport. The system was designed to attempt to connect on TCP, and used fairly standard timeouts for building TCP connections. However, a firewall in the network was set to disallow inbound TCP sessions. Another process connecting on TCP was passing through this firewall, causing the TCP session connection to fail, and, in turn, causing the TCP stack on the nodes to block for 15 minutes. This interaction of different components caused nodes to fall out of availability for long periods of time.

Gribble draws several lessons from these, and other, outages in the system.

First, he states that for a system to be truly robust, it must use some form of admission control. The load on the system, in other words, must somehow be controlled so more work cannot be given than the system can support. This has been a contentious issue in network engineering. While circuit switched networks can control the amount of work offered to the network (hence a Clos can be non-blocking in a circuit switched network), admission control in a packet switched network is almost impossible. The best you can do is some form of Quality of Service marking and dropping, such as traffic shaping or traffic policing, along the edge. This does highlight the importance of such controls, however.

Second, he states that systems must be systematically overprovisioned. This comes back to the point about redundant links. The caution above, however, still applies; systematic overprovisioning needs to be balanced against other tools to build a robust system. Far too often, overprovisioning is treated as “the only tool in the toolbox.”

Third, he states introspection must be built into the system. The system must be designed to be monitorable from its inception. In network engineering, this kind of thinking is far too often taken to say “everything must be measureable.” This does not go far enough. Network engineers need to think about not only how to measure, but also what they expect normal to look like, and how to tell when “normal” is no longer “normal.” The system must be designed within limits. Far too often, we just build “as large as possible,” and run it to see what happens.

Fourth, Gribbles says adaptivitiy must be provided through a closed control loop. This is what we see in routing protocols, in the sense that the control plane reacts to topology changes in a specific way, or rather within a specific state machine. Learning this part of the network is a crucial, but often skimmed over, part of network engineering.

This is an excellent paper, well worth reading for those who are interested in classic work around the area of robustness and distributed systems.

RIPE NCC: The Future of BGP Security

I was recently invited to a webinar for the RIPE NCC about the future of BGP security. The entire series is well worth watching; I was in the final session, which was a panel discussion on where we are now, and where we might go to make BGP security better.

Why is the Feasibility Condition Less Than?

A reader recently emailed me with this question: Why isn’t the condition for a Feasible Successor set to less than (<), rather than less than of equal (<=), in EIGRP? It certainly seems, as noted in the email, that this rules out a lot of possible possible loop free alternate paths. The network below will be used to illustrate.

First, assume all links are cost of 1 except D->C, which is cost of 2. Here D will choose B as the Successor, and the FC will be set to 2. The RD of C will be 1, so C will be an FS. Now consider two failures. The first failure is D->B. D will immediately reroute to the FS, which is C, without changing the FC. This works, because C’s cost to 100::/64 via D is 4, much higher than it’s cost to 100::64 along C->A. Now consider what happens if A->100::/64 fails. If the timing of the query “works right,” C and B will be notified first, then finally D. Even if D is somehow notified before C, and D switches to C as its FS, the traffic is dropped, rather than looped—so all is happy.

Now change the situation a little. Assume the A->C link is cost of 2, and the remaining links are 1. Now assume you make the FS condition <=, rather than "just" <. From D's perspective, C is still an FS. From C's perspective, D is also an FS. Suppose the B->D link fails. D switches to C, who’s path is intact; this works. Assume the C->A link fails. C switches to D, who’s path is intact; this works.

Finally, assume the A->100::/64 link fails. If the timing is just right, D and C will receive the update with this failure at the same moment, and will both switch to their FS’s. Now you have a loop. How long will this loop last? Until C and D can do a diffusing update — probably around 250ms or less… But if you count the outside computation time, it’s the SIA timer, which is around 10 minutes in more recent versions. Hence, “just the wrong circumstances” can cause up to a 10 minute loop. Not good.

The bottom line is this: any time you have a situation where two routers can end up pointing at one another as their local FS, you have a ring of some number of hops. If the final destination is outside the ring, some member of the ring must be the point at which traffic leaves the ring, and moves towards the destination. If the link connecting the ring to the destination fails, the update carrying the information about the loss of connectivity to the destination must travel around the ring in both directions.

When the update reaches the point at which routing would normally split horizon—the “point at which the waterfall splits,” so-to-speak, it will like reach both sides of the split horizon point at a close enough interview to cause a loop in the forwarding tables. This situation causes microloops in a link state protocol, but microloops are often resolved quickly, and hence tend to be tolerable. In a distance vector protocol, like EIGRP, the length of time the microloop can exist can be much longer—ultimately it depends on the speed at which a distributed computation can take place (because the computation is not local to each node), and, failing that, the amount of time the network can remain in an unstable state before “something is done about it.”

This, by the way, is why I am always opposed to increasing the SIA timer in EIGRP above the “factory defaults.” The SIA timer is essentially the amount of time you are willing to allow your network to remain unconverged in the worst case, and hence either dropping or looping traffic.

The Universal Fat Tree

Have you ever wondered why spine-and-leaf networks are the “standard” for data center networks? While the answer has a lot to do with trial and error, it turns out there is also a mathematical reason the fat-tree spine-and-leaf is is used almost universally. There often is some mathematical reason for the decisions made in engineering, although we rarely explore those reasons. If it seems to work, there is probably a reason.

The fat-tree design is explored in a paper published in 2015 (available here at the ACM, and now added to my “classic papers” page so there is a local copy as well), using a novel technique to not only explore why the spine-and-leaf fat-tree is so flexible, but even what the ideal ratio of network capacity is at each stage. The idea begins with this basic concept: one kind of network topology can be emulated on top of another physical topology. For instance, you can emulate a toroid topology on top of a hierarchical network, or a spine-and-leaf on top of of hypercube, etc. To use terms engineers are familiar with in a slightly different way, let’s call the physical topology the underlay, and the emulated topology the overlay. This is not too far off, as emulated topologies do need to be a virtual construction on top of the underlay, which means… tunnels, of course.

In emulating one kind of topology over another kind of topology, however, you lose some performance. This performance loss factor is quite important, in that it tells you whether or not it is worth building your actual network using the design of the underlay, or using the overlay. If, for instance, you emulate a toroid design on top of a folded Clos, and you discover the performance is half as good as a real, physical toroid, then you might be better off building a toroid rather than a folded Clos.

This might be difficult to grasp at first, so let’s use a specific example. The only topologies most engineers are familiar with are rings, hierarchical networks, partial mesh, full mesh, and some form of spine-and-leaf. Consider the case of a full mesh network versus a ring topology. Suppose you have some application that, through careful testing, you have determined best runs on a full mesh network. Now when you go to build your physical network, you find that you have the choice between building a real full mesh network, which will be very expensive, or a ring network, which will be very cheap. How much performance will you lose in running this application on the ring topology?

The way to discover the answer is to build a virtual full mesh overlay onto the ring topology. Once you do this, it turns out there are ways to determine how much performance loss you will encounter using this combination of underlay and overlay. Now, take this one step farther—what if you decide not to build the overlay, but simply run the application directly on the ring topology? Minus the tunnel encapsulation and de-encapsulation, the resulting loss in performance should be roughly the same,

Given this, you should be able to calculate the performance loss from emulating every kind of topology on top of every other kind of topology. From here, you can determine if there is one topology that can be used as an underlay for every other kind of topology with a minimal amount of performance loss. It turns out there is: the fat-tree spine-and-leaf topology. So an application that might best work on a full mesh network will run on a fat-tree spine-and-leaf topology with minimal performance loss. The same can be said of an application that runs best on a ring topology, or a hierarchical topology, or a partial mesh, or a… Thus, the fat-tree spine-and-leaf can be called a universal topology.

After coming to this point, the authors wonder if it is possible to determine how the bandwidth should be apportioned in a fat-tree spine-and-leaf to produce optimal emulation results across every possible overlay. This would give designers the ability to understand how to achieve the most optimal universal application performance on a single physical topology. Again, there turns out to be an answer to this question: doubling the bandwidth at each stage in the fabric is the best way to create an underlay that will emulate every other kind of overlay with the minimal amount of performance loss. The authors come to this conclusion by comparing a data center network to an on-chip connection network, which has a provable “best” design based on the physical layout of the wires. Translating the physical space into a monetary cost produces the same result as above: doubling the bandwidth at each stage creates the least expensive network with the most capabilities.

The result, then, is this: the fat-tree spine-and-leaf design is the most effective design for most applications, a universal topology. Further, building such a design where there are two neighbors down for every neighbor up, and doubling the bandwidth at each stage, is the most efficient use of resources in building such a network.

OSPF Topology Transparent Zones

Anyone who has worked with OSPF for any length of time has at least heard of areas—but perhaps before diving into Topology Transparent Zones (TTZs), a short review is in order.

In this diagram, routers A and B are in area 0, routers C and D are Area Border Routers (ABRs), and routers E, F, G, H, and K are all in area 1. The ABRs, C and D, do not advertise the existence of E, F, G, H, or K to the routers in area 0, nor the links to or between any of those routers. Any reachable destinations in area 1 are advertised using a em>summary LSA, or a type 3 LSA, towards A and B. From the perspective of A and B, 100::/64 and 101::/64 would be advertised by C and D as directly connected destinations, using the cost from C and D to each of these two destinations, based on a summary LSA.

What if you wanted to place H and K in their own area, with G as an ABR, behind the existing area 1? You cannot do this in OSPF using any form of a standard flooding domain, or area. There is no way to take information about a destination out of a summary LSA and place it in another summary LSA without risking the formation of a routing loop. Once the topology information is lost, which OSPF relies on to prevent the formation of routing loops, there is no way to gain it back.

Why might you want to do this? The primary reasons for stacking multiple areas in this way relate to mobile ad-hoc networks; you cannot control the way routers might be connected together, nor where flooding domain boundaries might be drawn. Assuming you have a requirement, you would be out of luck if the only tool you have is a standard OSPF area.

However, you can stack areas using TTZs. What is a TTZ? It is a flooding domain that does not appear to exist to the routers around it—or rather, connected to either side of the TTZ. In the illustration below, routers C, D, E, F, and G have been configured so they form a TTZ.

Note that when the network is configured this way, routers A, B, H, and K all believe they are in a single flooding domain. From the perspective of these four routers, C, D, and G appear to be directly connected by a set of point-to-point links. Traffic sent from B to 101::/64, for instance, would pass into the TTZ at D, through the TTZ, and back out the other side at G. This traffic would be forwarded normally through the TTZ, as the routers within the TTZ have a full view of the network topology, as well as reachable destinations.

The destination within the TTZ, 100::/64, would be advertised by each TTZ edge router as if it were a directly connected destination. Hence G, C, and D would each advertise 100::/64 as a connected interface in their router LSA, or their type 1 advertisement. This allows routers outside the TTZ to reach destinations within the TTZ, without revealing the internal topology of the TTZ.

Information about the topology within the TTZ is carried in a set of opaque LSAs which are not forwarded outside the TTZ. This allows the TTZ to maintain consistent internal state, without burdening the rest of the network with internal topology information, or topology changes.

TTZ’s are a rather narrowly focused solution; it is not likely you will see these used in an OSPF network near you any time soon. On the other hand, they are an interesting, experimental, addition to OSPF that can be used to solve a set of edge cases.

You can read about TTZs more in RFC8099.

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

  • EIGRP receives a set of queries to which it must respond
  • EIGRP allocates memory for each packet to build the responses
  • Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
  • EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
  • EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

Responding to Readers: Questions on Microloops

Two different readers, in two different forums, asked me some excellent questions about some older posts on mircoloops. Unfortunately I didn’t take down the names or forums when I noted the questions, but you know who you are! For this discussion, use the network show below.

In this network, assume all link costs are one, and the destination is the 100::/64 Ipv6 address connected to A at the top. To review, a microloop will form in this network when the A->B link fails:

  1. B will learn about the link failure
  2. B will send an updated router LSP or LSA towards D, with the A->B link removed
  3. At about the same time, B will recalculate its best path to 100::/64, so its routing and forwarding tables now point towards D as the best path
  4. D, in the meantime, receives the updated information, runs SPF, and installs the new routing information into its forwarding table, with the new path pointing towards E

Between the third and fourth steps, B will be using D as its best path, while D is using B as its best path. Hence the microloop. The first question about microloops was—

Would BFD help prevent the microloop (or perhaps make it last a shorter period of time)?

Consider what happens if we have BFD running between A and B in this network. While A and B discover the failure faster—perhaps a lot faster—none of the other timing points change. No matter how fast A and B discover the link failure, B is still going to take some time to flood the change to the topology to D, and D is going to take some time to compute a new set of shortest paths, and install them into its local routing and forwarding tables.

Essentially, if you look at convergence as a four step process—

  1. Discovery
  2. Reporting
  3. Calculation
  4. Installation

Then you can see the microloop forms because different routers are “doing” steps 3 and 4 at different times. Discovery, which is what BFD is aimed at, is not going to change this dynamic. The second question was—

Can microloops occur on a link coming up?

For this one, let’s start in a different place. Assume, for a moment, that the A->B link is down. Now someone goes in and configures the A->B link to make it operational. At the moment the link comes up, B’s shortest path to 100::/64 is through D. When the new link comes up, it will learn about the new link, calculated a new shortest path tree, and then install the new route through A towards 100::/64. E will also need to calculate a new route, using B as its next hop.

The key point to consider is this: who tells E about this new path to the destination? It is B. To form a microloop, we need D to install a route through B towards 100::/64 before B does. This is theoretically possible in this situation, but unlikely, because D is dependent on B for information about this new path. Things would need to be pretty messed up for B to learn about the new path first, but not recalculate its shortest path tree and install the route before D can. So—while it is possible, it is not likely.

Thanks for sending these terrific questions in.

Tags: |

Thoughts on Open/R

Since Facebook has released their Open/R routing platform, there has been a lot of chatter around whether or not it will be a commercial success, whether or not every hyperscaler should use the protocol, whether or not this obsoletes everything in routing before this day in history, etc., etc. I will begin with a single point.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

Design is about tradeoffs. Protocol design is no different than any other design. Hence, we should expect that Open/R makes some tradeoffs. I know this might be surprising to some folks, particularly in the crowd that thinks every new routing system is going to be a silver bullet that solved every problem from the past, that the routing singularity has now occurred, etc. I’ve been in the world of routing since the early 1990’s, perhaps a bit before, and there is one thing I know for certain: if you understand the basics, you would understand there is no routing singularity, and there never will be—at least not until someone produces a quantum wave routing protocol.

Ther reality is you always face one of two choices in routing: build a protocol specifically tuned to a particular set of situations, which means application requirements, topologies, etc., or build a general purpose protocol that “solves everything,” at some cost. BGP is becoming the latter, and is suffering for it. Open/R is an instance of the former.

Which means the interesting question is: what are they solving for, and how? Once you’ve answered this question, you can then ask: would this be useful in my network?

A large number of the points, or features, highlighted in the first blog post are well known routing constructions, so we can safely ignore them. For instance: IPv6 link local only, graceful restart, draining and undraining nodes, exponential backoff, carrying random information in the protocol, and link status monitoring. These are common features of many protocols today, so we don’t need to discuss them. There are a couple of interesting features, however, worth discussing.

Dynamic Metrics. EIGRP once had dynamic metrics, and they were removed. This simple fact always makes me suspicious when I see dynamic metrics touted as a protocol feature. Looking at the heritage of Open/R, however, dynamic metrics were probably added for one specific purpose: to support wireless networks. This functionality is, in fact, provided through DLEP, and supported in OLSR, MANET extended OSPF, and a number of other MANET control planes. Support DLEP and dynamic metrics based on radio information was discussed at the BABEL working group at the recent Singapore IETF, in fact, and the BABEL folks are working on integration dynamic metrics for wireless. So this feature not only makes sense in the wireless world, it’s actually much more widespread than might be apparent if you are looking at the world from an “Enterprise” point of view.

But while this is useful, would you want this in your data center fabric? I’m not certain you would. I would argue dynamic metrics are actually counter productive in a fabric. What you want, instead, is basic reacbility provided by the distributed control plane (routing protocol), and some sort of controller that sits on top using an overlay sort of mechanism to do traffic engineering. You don’t want this sort of policy stuff in a routing protocol in a contained envrionment like a fabric.

Which leads us to our second point: The API for the controller. This is interesting, but not strictly new. Openfabric, for instance, already postulates such a thing, and the entire I2RS working group in the IETF was formed to build such an interface (though it has strayed far from this purpose, as usual with IETF working groups). The really interesting thing, though, is this: this southbound interface is built into the routing protocol itself. This design decision makes a lot of sense in a wireless network, but, again, I’m not certain it does in a fabric.

Why not? It ties the controller architecture, including the southbound interface, to the routing protocol. This reduced component flexibility, which means it is difficult to replace one piece without replacing the other. If you wanted to replace the basic functionality of Open/R without replacing the controller architecture at some point int he future, you must hack your way around this problem. In a monolithic system like Facebook, this might be okay, but in most other network environments, it’s not. In other words, this is a rational decision for Open/R, but I’m not certain it can, or should, be generalized.

This leads to a third observation: This is a monolithic architecture. While in most implementations, there is a separate RIB, FIB, and interface into the the forwarding hardware, Open/R combines all these things into a single system. In any form of Linux based network operating system, for instance, the routing processes install routes into Zebra, which then installs routes into the kernel and notifies processes about routes through the Forwarding Plane Manager (FPM). Some external process (switchd in Cumulus Linux, SWSS in SONiC), then carry this routing information into the hardware.

Open/R, from the diagrams in the blog post, pushes all of this stuff, including the southbound interface from the controller, into a different set of processes. The traditional lines are blurred, which means the entire implemention acts as a single “thing.” You are not going to take the BGP implementation from snaproute or FR Routing and run it on top of Open/R without serious modification, nor are you going to run Open/R on ONL or SONiC or Cumulus Linux without serious modification (or at least a lot of duplication of effort someplace).

This is probably an intentional decision on the part of Open/R’s designers—it is designed to be an “all in one solution.” You RPM it to a device, with nothing else, and it “just works.” This makes perfect sense in the wrieless environment, particularly for Facebook. Whether or not it makes perfect sense in a fabric depends—does this fit into the way you manage boxes today? Do you plan on using boxex Faebook will support, or roll your own drivers as needed for different chipsets, or hope the SAI support included in Open/R is enough? Will you ever need segment routing, or some other capability? How will those be provided for in the Open/R model, given it is an entire stack, and does not interact with any other community efforts?

Finally, there are a number of interesting points that are not discussed in the publicly available information. For instance, this controller—what does it look like? What does it do? How would you do traffic engineering with this sytem? Segment routing, MPLS—none of the standard ways of providing virtualization are mentioned at all. Dynamic metrics simply are not enough in a fabric. How is the flooding of information actually done? In the past, I’ve been led to believe this is based on ZeroMQ—is this still true? How optimal is ZeroMQ for flooding information? What kind of telemetry can you get out of this, and is it carried in the protocol, or in a separate system? I assume they want to carry telemtry as opaque information flooded by the protocol, but does it really make sense to do this?

Overall, Open/R is interesting. It’s a single protocol designed to opperate optimally in a small range of environments. As such, it has some interesting features, and it makes some very specific design choices. Are those design choices optimal for more general cases, or even other specific problem spaces? I would argue the architecture, in particular, is going to be problematic in terms of long term maintenance and growth. This can modified over time, of course, but then we are left with a collection of ideas that are available in many other protocols, making the idea much less interesting.

Is it interesting? Yes. Is it the routing singularity? No. As engineers, we should take it for what it is worth—a chance to see how other folks are solving the problems they are facing in day-to-day operation, and thinking about how some of those lessons might be applied in our own world. I don’t think the folks at Facebook would argue any more than this, either.

A glance back at the looking glass: Will IP really take over the world?

In 2003, the world of network engineering was far different than it is today. For instance, EIGRP was still being implemented on the basis of its ability to support multi-protocol routing. SONET, and other optical technologies, were just starting to come into their own, and all optical switching was just beginning to be considered for large scale deployment. What Hartley says of history holds true when looking back at what seems to be a former age: “The past is a foreign country; they do things differently there.”

Cross posted at CircleID

In the midst of this change, the Association for Computing Machinery (the ACM) published a paper entitled “Will IP really take over the world (of communications)?” This paper, written during the ongoing discussion within the engineering community over whether packet switching or circuit switching is the “better” technology, provides a lot of insight into the thinking of the time. Specifically, as the author say, the belief that IP is better:

…is based on our collective belief that packet switching is inherently superior to circuit switching because of the efficiencies of statistical multiplexing, and the ability of IP to route around failures. It is widely assumed that IP is simpler than circuit switching, and should be more economical to deploy and manage.

Several of the reasons given are worth considering in light of the intervening years between the paper being written and today. Section 2 of the paper suggest four myths that need to be considering in the world of IP based packet switching.

The first of these myths is that IP is more efficient. Here the authors point out an early argument used by packet switching advocates: when a circuit is idle, reserved bandwidth is unused. Because packets does not reserve bandwidth in this way, packet switching makes more efficient use of the available resources. The authors counter this argument by observing packets switched networks are not that heavily utilized, and the exploring why this might be so. The reasons they give are that packet switched networks become unstable if they are overutilized, and the desire to drive delay down as much as possible. The authors say:

But simply reducing the average link utilization will not be enough to make users happy. For a typical user to experience low utilization, the variance of the network utilization needs to be low, too. Reducing variations in link utilization is hard; today we lack effective techniques to do it. It might be argued that the problem will be solved by research efforts on traffic management, congestion control, and multipath routing. But to-date, despite these problems being understood for many years, effective measures are yet to be introduced.

The second of these myths is that IP is stable. Here the authors point out the many potential problems with packet switched networks, particularly in the area of in band signaling.

The third myth the authors consider is that IP is simpler. Here the argument is that circuit switched networks have fewer moving parts, each of which can be more regimented in their implementation. This means circuit switched networks are generally simpler than their packet switched counterparts.

The fourth myth the authors consider is that real time traffic will be able to run over IP based networks. They argue that because of the various quality of service problems, near real time traffic, such as voice and video, can never be carried over an IP network.

What did the authors do get right, and what did they get wrong? Looking back over the intervening years, some of these problems seem to have been largely solved. For instance, between research into new quality of service solutions, virtualization technologies, and research into and deployment of new methods of managing traffic flows over a packet switched network (for instance, as implemented in QUIC), have largely eased the problems the authors discuss.

Other problems in their list seem to have simply fallen by the wayside. For instance, the authors are largely correct on the stability of IP routing. The Internet, as it exists today, is not really stable. The default free zone, the Internet core, never does really converge. But, on the other hand, it turns out that it might not be all that important. So long as reachability exists, and the applications running over the network can deal with the variable delay, then it does not really matter. It is clear that IP networks can be used for near real time traffic at this point, as well.

The third myth is, perhaps, the most interesting—IP is more complex. Circuit switching generally relies on a centralized control plane, rather than a distributed one. Is this truly simpler? Advocates of Software Defined Networks have argued recently that it is, but I am not convinced. The problem here, it seems to me, is the conflation of reachability and policy into a single “thing” that must be solved using a single subsystem of the network as a system. I tend to think that we will have a much healthier discussion around control plane design and operation once we split the various purposes the control plane serves up into different pieces, and discuss each one as a separate problem and solution space.

What did the authors get right? They were right in surmising that circuit switched networks, particularly in the area of optical switching, would continue to play a role in the global Internet. In fact, you could argue that packet and circuit switched networks often live side-by-side in the global Internet today, with many long haul and metropolitan area links being good examples of circuit switched networks underlying a packet switched overlay. MPLS, of course, also provides a sort of hybrid mode between circuit and packet switching which enables many of the solutions deployed today.

Much as we might like to look back in derision at papers such as this, the reality is the authors were accurately pointing out many of the problems with packet switched networks, some of which still have not been solved today. These objections should not be taken as a relic of the past, but rather as a pointer to what yet remains to be solved, where problems may crop up again, and, finally, where we might look for solutions that may, ultimately, solve the problem of carrying data at scale.

BGPsec and Reality

From time to time, someone publishes a new blog post lauding the wonderfulness of BGPsec, such as this one over at the Internet Society. In return, I sometimes feel like I am a broken record discussing the problems with the basic idea of BGPsec—while it can solve some problems, it creates a lot of new ones. Overall, BGPsec, as defined by the IETF Secure Interdomain (SIDR) working group is a “bad idea,” a classic study in the power of unintended consequences, and the fond hope that more processing power can solve everything. To begin, a quick review of the operation of BGPsec might be in order. Essentially, each AS in the AS Path signs the “BGP update” as it passes through the internetwork, as shown below.

In this diagram, assume AS65000 is originating some route at A, and advertising it to AS65001 and AS65002 at B and C. At B, the route is advertised with a cryptographic signature “covering” the first two hops in the AS Path, AS65000 and AS65001. At C, the route is advertised with a cryptogrphic signature “covering” the first two hops in the AS Path, AS65000 and AS65002. When F advertises this route to H, at the AS65001 to AS65003 border, it again signs the AS Path, including the AS F is advertising the route to, so the signed path includes AS65000, AS65001, and AS65003.

To validate the route, H can use AS65000’s public key to verify the signature over the first two hops in the AS Path. This shows that AS65000 not only did advertise the route to AS65001, but also that it intended to advertise this route to AS65001. In this way, according to the folks working on BGPsec, the intention of AS65000 is laid bare, and the “path of the update” is cryptographically verified through the network.

Except, of course, there is no such thing as an “update” in BGP that is carried from A to H. Instead, at each router along the way, the information stored in the update is broken up and stored in different memory structures, and then rebuilt to be transmitted to specific peers as needed. BGPsec, then, begins with a misunderstanding of how BGP actually works; it attempts to validate the path of an update through an internetwork—and this turns out to be the one piece of information that doesn’t matter all that much in security terms.

But set this problem aside for a moment, and consider how this actually works. First, B, before the signatures, could have sent a single update to multiple peers. After the signatures, each peer must receive its own update. One of the primary ways BGP uses to increase performance is in gathering updates up and sending one update whenever possible using either a peer group or an update group. Worse yet, every reachable destination—NLRI—now must be carried in its own update. So this means no packing, and no peer groups. The signatures themselves must be added to the update packets, as well, which means they must be stored, carried across the wire, etc.

The general assumption in the BGPsec community is the resulting performance problems can be resolved by just upping the processor and bandwidth. That BGPsec has been around for 20 years, and the performance problem still hasn’t been solved is not something anyone seems to consider. 🙂 In practice, this also means replacing every eBGP speaker in the internetwork—perhaps hundreds of thousands of them in the ‘net—to support this functionality. “At what cost,” and “for what tradeoffs,” are questions that are almost never asked.

But let’s lay aside this problem for a moment, and just assumed every eBGP speaking router in the entire ‘net could be replaced tomorrow, at no cost to anyone. Okay, all the BGP AS Path problems are now solved right? Not so fast…

Assume, for a moment, that AS65000 and AS65001 break their peering relationship for some reason. At the moment the B to D peering relationship is shut down, D still has a copy of the signed updates it has been using. How long can AS65001 continue advertising connectivity to this route? The signatures are in band, carried in the BGP update as constructed at B, and transmitted to D. So long as AS65001 has a copy of a single update, it can appear to remain connected to AS65000, even though the connection has been shut down. The answer, then, is that AS65000 must somehow invalidate the updates it previously sent to AS65001. There are three ways to do this.

First, AS65000 could roll its public and private key pair. This might work, so long as peering and depeering events are relatively rare, and the risk from such depeering situations is small. But are they? Further, until the new public and private key pairs are distributed, and until new routes can be sent through the internetwork using these new keys, the old keys must remain in place to prevent a routing disruption. How long is this? Even if it is 24 hours, probably a reasonable number, AS65001 has the means to grab traffic that is destined to AS65000 and do what it likes with that traffic. Are you comfortable with this?

Second, the community could build a certificate revocation list. This is a mess, so there’s no point in going there.

Third, you could put a timer in the BGP update, like a Link State Update. Once the timer runs down or our, the advertisement must be replaced. Given there are 800k routes in the default free zone, a timer of 24 hours (which would still make me uncomfortable in security terms), there would need to be 800k/24 hours updates per hour added to the load of every router in the Internet. On top of the reduced performance noted above.

Again, it is useful to set this problem aside, and assume it can be solved with the wave of a magic wand someplace. Perhaps someone comes up with a way to add a timer without causing any additional load, or a new form of revocation list is created that has none of the problems of any sort known about today. Given these, all the BGP AS Path problems in the Internet are solved, right?

Consider, for a moment, the position of AS65001 and AS65002. These are transit providers, companies that rely on their customer’s trust, and their ability to out compete in the area of peering, to make money. First, signing updates means that you are declaring, to the entire world, in a legally provable way, who your customers are. This, from what I understand of the provider business model, is not only a no-no, but a huge legal issue. But this is actually, still, the simpler problem to solve.

Second, you cannot deploy this kind of system with a single, centrally stored private key. Assume, for a moment, that you do solve the problem this way. What happens if a single eBGP speaker is compromised? What if you need to replace a single eBGP speaker? You must roll your AS level private key. And replace all your advertisements in the entire Internet. This, from a security standpoint, is a bad idea.

Okay—the reasonable alternative is to create a private key per eBGP speaker. This private key would have its own public key, which would, in turn, be signed by the AS level private key. There are two problems with this scheme, however. The first is: when H validates the signature on some update it has received, it must now find not only the AS level public keys for AS65000 and AS65001, it must find the public key for B and F. This is going to be messy. The second is: By examining the publickeys I receive in a collection of “every update on the Internet,” I can now map the actual peering points between every pair of autonomous systems in the world. All the secret sauce in peering relationships? Exposed. Which router (or set of routers) to attack to impact the business of a specific company? Exposed.

The bottom line is this: even setting aside BGPsec’s flawed view of the way BGP works, even setting aside BGPsec’s flawed view of what needs to be secured, even setting aside BGPsec implementations the benefit of doing the impossible (adding state and processing without impacting performance), even given some magical form of replay attack prevention that costs nothing, BGPsec still exposes information no-one really wants exposed. The tradeoffs are ultimately unacceptable.

Which all comes back to this: If you haven’t found the tradeoffs, you haven’t looked hard enough.