Is BGP Good Enough?

In a recent podcast, Ivan and Dinesh ask why there is a lot of interest in running link state protocols on data center fabrics. They begin with this point: if you have less than a few hundred switches, it really doesn’t matter what routing protocol you run on your data center fabric. Beyond this, there do not seem to be any problems to be solved that BGP cannot solve, so… why bother with a link state protocol? After all, BGP is much simpler than any link state protocol, and we should always solve all our problems with the simplest protocol possible.


  • BGP is both simple and complex, depending on your perspective
  • BGP is sometimes too much, and sometimes too little for data center fabrics
  • We are danger of treating every problem as a nail, because we have decided BGP is the ultimate hammer

Will these these contentions stand up to a rigorous challenge?

I will begin with the last contention first—BGP is simpler than any link state protocol. Consider the core protocol semantics of BGP and a link state protocol. In a link state protocol, every network device must have a synchronized copy of the Link State Database (LSDB). This is more challenging than BGP’s requirement, which is very distance-vector like; in BGP you only care if any pair of speakers have enough information to form loop-free paths through the network. Topology information is (largely) stripped out, metrics are simple, and shared information is minimized. It certainly seems, on this score, like BGP is simpler.

Before declaring a winner, however, this simplification needs to be considered in light of the State/Optimization/Surface triad.

When you remove state, you are always also reducing optimization in some way. What do you lose when comparing BGP to a link state protocol? You lose your view of the entire topology—there is no LSDB. Perhaps you do not think an LSDB in a data center fabric is all that important; the topology is somewhat fixed, and you probably are not going to need traffic engineering if the network is wired with enough bandwidth to solve all problems. Building a network with tons of bandwidth, however, is not always economically feasible. The more likely reality is there is a balance between various forms of quality of service, including traffic engineering, and throwing bandwidth at the problem. Where that balance is will probably vary, but to always assume you can throw bandwidth at the problem is naive.

There is another cost to this simplification, as well. Complexity is inserted into a network to solve hard problems. The most common hard problem complexity is used to solve is guarding against environmental instability. Again, a data center fabric should be stable; the topology should never change, reachability should never change, etc. We all know this is simply not true, however, or we would be running static routes in all of our data center fabrics. So why aren’t we?

Because data center fabrics, like any other network, do change. And when they do change, you want them to converge somewhat quickly. Is this not what all those ECMP parallel paths are for? In some situations, yes. In others, those ECMP paths actually harm BGP convergence speed. A specific instance: move an IP address from one ToR on your fabric to another, or from one virtual machine to another. In this situation, those ECMP paths are not working for you, they are working against you—this is, in fact, one of the worst BGP convergence scenarios you can face. IS-IS, specifically, will converge much faster than BGP in the case of detaching a leaf node from the graph and reattaching it someplace else.

Complexity can be seen from another perspective, as well. When considering BGP in the data center, we are considering one small slice of the capabilities of the protocol.

in the center of the illustration above there is a small grey circle representing the core features of BGP. The sections of the ten sided figure around it represent the features sets that have been added to BGP over the years to support the many places it is used. When we look at BGP for one specific use case, we see the one “slice,” the core functionality, and what we are building on top. The reality of BGP, from a code base and complexity perspective, is the total sum of all the different features added across the years to support every conceivable use case.

Essentially, BGP has become not only a nail, but every kind of nail, including framing nails, brads, finish nails, roofing nails, and all the other kinds. It is worse than this, though. BGP has also become the universal glue, the universal screw, the universal hook-and-loop fastener, the universal building block, etc.

BGP is not just the hammer with which we turn every problem into a nail, it is a universal hammer/driver/glue gun that is also the universal nail/screw/glue.

When you run BGP on your data center fabric, you are not just running the part you want to run. You are running all of it. The L3VPN part. The eVPN part. The intra-AS parts. The inter-AS parts. All of it. The apparent complexity may appear to be low, because you are only looking at one small slice of the protocol. But the real complexity, under the covers, where attack and interaction surfaces live, is very complex. In fact, by any reasonable measure, BGP might have the simplest set of core functions, but it is the most complicated routing protocol in existence.

In other words, complexity is sometimes a matter of perspective. In this perspective, IS-IS is much simpler. Note—don’t confuse our understanding of a thing with its complexity. Many people consider link state protocols more complex simply because they don’t understand them as well as BGP.

Let me give you an example of the problems you run into when you think about the complexity of BGP—problems you do not hear about, but exist in the real world. BGP uses TCP for transport. So do many applications. When multiple TCP streams interact, complex problems can result, such as the global synchronization of TCP streams. Of course we can solve this with some cool QoS, including WRED. But why do you want your application and control plane traffic interacting in this way in the first place? Maybe it is simpler just to separate the two?

Is BGP really simpler? From one perspective, it is simpler. From another, however, it is more complex.

Is BGP “good enough?” For some applications, it is. For others, however, it might not be.

You should decide what to run on your network based on application and business drivers, rather than “because it is good enough.” Which leads me back to where I often end up: If you haven’t found the trade-offs, you haven’t look hard enough.

Research: Facebook’s Edge Fabric

The Internet has changed dramatically over the last ten years; more than 70% of the traffic over the Internet is now served by ten Autonomous Systems (AS’), causing the physical topology of the Internet to be reshaped into more of a hub-and-spoke design, rather than the more familiar scale-free design (I discussed this in a post over at CircleID in the recent past, and others have discussed this as well). While this reshaping might be seen as a success in delivering video content to most Internet users by shortening the delivery route between the server and the user, the authors of the paper in review today argue this is not enough.

Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’17). ACM, New York, NY, USA, 418-431. DOI:

Why is this not enough? The authors point to two problems in the routing protocol tying the Internet together: BGP. First, they state that BGP is not capacity-aware. It is important to remember that BGP is focused on policy, rather than capacity; the authors of this paper state they have found many instances where the preferred path, based on BGP policy, is not able to support the capacity required to deliver video services. Second, they state that BGP is not performance-aware. The selection criteria used by BGP, such as MED and Local Pref, do not correlate with performance.

Based on these points, the authors argue traffic needs to be routed more dynamically, in response to capacity and performance, to optimize efficiency. The paper presents the system Facebook uses to perform this dynamic routing, which they call Edge Fabric. As I am more interested in what this study reveals about the operation of the Internet than the solution Facebook has proposed to the problem, I will focus on the problem side of this paper. Readers are invited to examine the entire paper at the link above, or here, to see how Facebook is going about solving this problem.

The paper begins by examining the Facebook edge; as edges go, Facebook’s is fairly standard for a hyperscale provider. Facebook deploys Points of Presence, which are essentially private Content Delivery Network (CDN) compute and edge pushed to the edge, and hence as close to users as possible. To provide connectivity between these CDN nodes and their primary data center fabrics, Facebook uses transit provided through peering across the public ‘net. The problem Facebook is trying to solve is not the last mile connectivity, but rather the connectivity between these CDN nodes and their data center fabrics.

The authors begin with the observation that if left to its own decision process, BGP will evenly distribute traffic across all available peers, even though each peer is actually different levels of congestion. This is not a surprising observation. In fact, there was at least one last mile provider that used their ability to choose an upstream based on congestion in near real time. This capability was similar to the concept behind Performance Based Routing (PfR), developed by Cisco, which was then folded into DMVPN, and thus became part of the value play of most Software Defined Wide Area Network (SD-WAN) solutions.

The authors then note that BGP relies on rough proxies to indicate better performing paths. For instance, the shortest AS Path should, in theory, be shortest physical or logical path, as well, and hence the path with the lowest end-to-end time. In the same way, local preference is normally set to prefer peer connections rather than upstream or transit connections. This should mean traffic will take a shorter path through a peer connected to the destination network, rather than a path up through a transit provider, then back down to the connected network. This should result in traffic passing through less lightly loaded last mile provider networks, rather than more heavily used transit provider networks. The authors present research showing these policies can often harm performance, rather than enhancing it; sometimes it is better to push traffic to a transit peer, rather than to a directly connected peer.

How often are destination prefixes constrained by BGP into a lower performing path? The authors provide this illustration—

The percentage of impacted destination prefixes is, by Facebook’s measure, high. But what kind of solution might be used to solve this problem?

Note that no solution that uses static metrics for routing traffic will be able to solve these problems. What is required, if you want to solve these problems, is to measure the performance of specific paths to given destinations in near real time, and somehow adjust routing to take advantage of higher performance paths regardless of what the routing protocol metrics indicate. In other words, the routing protocol needs to find the set of possible loop-free paths, and some other system must choose which path among this set should be used to forward traffic. This is a classic example of the argument for layered control planes (such as this one).

Facebook’s solution to this problem is to overlay an SDN’ish solution on top of BGP. Their solution does not involve tunneling, like many SD-WAN solutions. Rather, they adjust the BGP metrics in near real time based on current measured congestion and performance. The paper goes on to describe their system, which only uses standard BGP metrics to steer traffic onto higher performance paths through the ‘net.

A few items of note from this research.

First, note that many of the policies set up by providers are not purely shorthand for performance; they actually represent a price/performance tradeoff. For instance, the use of local preference to send traffic to peers, rather than transits, is most often an economic decision. Providers, particularly edge providers, normally configure settlement-free peering with peers, and pay for traffic sent to an upstream transit provider. Directing more traffic at an upstream, rather than a peer, can have a significant financial impact. Hyperscalers, like Facebook, don’t often see these financial impacts, as they are purchasing connectivity from the provider. Over time, forcing providers to use more expensive links for performance reasons could increase cost, but in this situation the costs are not immediately felt, so the cost/performance feedback loop is somewhat muted.

Second, there is a fair amount of additional complexity in pulling this bit of performance out of the network. While it is sometimes worth adding complexity to increase complexity, this is not always true. It likely is for many hyperscalers, who’s business relies largely on engagement. Given there is a directly provable link between engagement and speed, every bit of performance makes a large difference. But this is simply not true of all networks.

Third, you can replicate this kind of performance-based routing in your network by creating a measurement system. You can then use the communities operators providers allow their customers to use to shape the direction of traffic flows to optimize traffic performance. This might not work in all cases, but it might give you a fair start on a similar system—if this kind of wrestling match with performance is valuable in your environment.

Another option might be to use an SD-WAN solution, which should have the measurement and traffic shaping capabilities “built in.”

Fourth, there is a real possibility of building a system that fails in the face of positive feedback loops or reduces performance in the face of negative feedback loops. Hysteresis, the tendency to cause a performance problem in the process of reacting to a performance problem, must be carefully considered when designing such as system, as well.

The Bottom Line

Statically defined metrics in dynamic control planes cannot provide optimal performance in near real time. Building a system that can involves a good bit of additional complexity—complexity that is often best handled in a layered control plane.

Are these kinds of tools suitable for a network other than Facebook? In the right situation, the answer is clearly yes. But heed the tradeoffs. If you haven’t found the tradeoff, you haven’t looked hard enough.

Recent BGP Peering Enhancements

BGP is one of the foundational protocols that make the Internet “go;” as such, it is a complex intertwined system of different kinds of functionality bundled into a single set of TLVs, attributes, and other functionality. Because it is so widely used, however, BGP tends to gain new capabilities on a regular basis, making the Interdomain Routing (IDR) working group in the Internet Engineering Task Force (IETF) one of the consistently busiest, and hence one of the hardest to keep up with. In this post, I’m going to spend a little time talking about one area in which a lot of work has been taking place, the building and maintenance of peering relationships between BGP speakers.

The first draft to consider is Mitigating the Negative Impact of Maintenance through BGP Session Culling, which is a draft in an operations working group, rather than the IDR working group, and does not make any changes to the operation of BGP. Rather, this draft considers how BGP sessions should be torn down so traffic is properly drained, and the peering shutdown has the minimal effect possible. The normal way of shutting down a link for maintenance would be to for administrators to shut down BGP on the link, wait for traffic to subside, and then take the link down for maintenance. However, many operators simply do not have the time or capability to undertake scheduled shutdowns of BGP speakers. To resolve this problem, graceful shutdown capability was added to BGP in RFC8326. Not all implementations support graceful shutdown, however, so this draft suggests an alternate way to shut down BGP sessions, allowing traffic to drain, before a link is shut down: use link local filtering to block BGP traffic on the link, which will cause any existing BGP sessions to fail. Once these sessions have failed, traffic will drain off the link, allowing it to be safely shut down for maintenance. The draft discusses various timing issues in using this technique to reduce the impact of link removal due to maintenance (or other reasons).

Graceful shutdown, itself, is also in line to receive some new capabilities through Extended BGP Administrative Shutdown Communication. This draft is rather short, as it simply allows an operator to send a short freeform message (presumably in text format) along with the standard BGP graceful shutdown notification. This message can be printed on the console, or saved to syslog, to provide an operator with more information about why a particular BGP has been shut down, whether it coming back up again, how long the shutdown is expected to last, etc.

Graceful Restart (GR) is a long available feature in many BGP implementations that aims to prevent the disruption of traffic flow; the original purpose was to handle a route processor restart in a router where the line cards could continue forwarding traffic based on local forwarding tables (the FIB), including cases where one route processor fails, causing the router switches to a backup route processor in the same chassis. Over time, GR began to be applied to NOTIFICATION messages in BGP. For instance, if a BGP speaker receives a malformed message, it is required (by the BGP RFCs) to send a NOTIFICATION, which will cause the BGP session to be torn down and restarted. GR has been adapted to these situations, so traffic flow is either not impacted, or minimally impacted through the NOTIFICATION/session restart process. This same processing takes place for a hold timer timeout in BGP.

The problem is that only one of the two speakers in a restarting pair will normally retain its local forwarding information. The sending speaker will normally flush its local routing tables, and with them its local forwarding tables, on sending a BGP NOTIFICATION. Notification Message support for BGP Graceful Restart changes this processing, allowing both speakers to enter the “receiving speaker” mode, so both speakers would retain their local forwarding information. A signal is provided to allow the sending speaker to indicate the sessions should be hard reset, rather than gracefully reset, if needed.

Finally, BGP allows speakers to send a route with a next hop other than themselves; this is called a third party next hop, and is illustrated in the figure below.

In this network, router C’s best path to 2001:db8:3e8:100::/64 might be through A, but the operator may prefer this traffic pass through B. While it is possible to change the preferences so C chooses the path through B, there are some situations where it is better for A to advertise C as the next hop towards the destination (for instance, a route server would not normally advertise itself as the nexthop towards a destination). The problem with this situation is that B might not have the same capabilities as a BGP speaker as A. If B, for instance, cannot forward for IPv6, the situation shown in the illustration would clearly not work.

To resolve this, BGP Next-Hop dependent capabilities allows a speaker to advertise the capabilities of these alternate next hops to peered BGP speakers.

RIPE NCC: The Future of BGP Security

I was recently invited to a webinar for the RIPE NCC about the future of BGP security. The entire series is well worth watching; I was in the final session, which was a panel discussion on where we are now, and where we might go to make BGP security better.

Do We Really Need a New BGP?

From time to time, I run across (yet another) article about why BGP is so bad, and how it needs to be replaced. This one, for instance, is a recent example.

cross posted at APNIC and CircleID

It seems the easiest way to solvet this problem is finding new people—ones who don’t make mistakes—to work on BGP configuration, building IRR databases, and deciding what should be included in BGP? Ivan points out how hopeless of a situation this is going to be, however. As Ivan says, you cannot solve people problems with technology. You can hint in the right direction, and you can try to make things a little more sane, and a little less complex, but people cannot be fixed with technology. Given we cannot fix the people problem, would replacing BGP itself really help? Is there anything we could do to make things better?

To understand the answer to these questions, it is important to tear down a major misconception about BGP. The misconception?

BGP is a routing protocol in the same sense as OSPF, IS-IS, or EIGRP.

BGP was not designed to be a routing protocol in the way other protocol were. It was designed to provide a loop free path through a series of independently operated networks, each with its own policy and business goals. In the sense that BGP provides a loop free route to a destination, it provides routing. But the “routing” it provides is largely couched in terms of explicit, rather than implicit, policy (see the note below). Loop free routes are not always the “shortest” path in terms of hop count, or the “lowest cost” path in terms of delay, or the “best available” path in terms of bandwidth, or anything else. This is why BGP relies on the AS Path to prevent loops. We call things “metrics” in BGP in a loose way, but they are really explicit expressions of policy.

Consider this: the primary policies anyone cares about in interdomain routing are: where do I want this traffic to exit my AS, and where do I want this traffic to enter my AS? The Local Preference is an expression of where traffic to this particular destination should exit this AS. The Multiple Exit Disciminator (MED) is an expression of where this AS would like to receive traffic being forwarded to this destination. Everything other than these are just tie breakers. All the rest of the stuff we do to try to influence the path of traffic into and out of an AS, like messing with the AS Path, are hacks. If you can get this pair of “things people really care about” into your head, the BGP bestpath process, and much of the routing that goes on in the DFZ, makes a lot more sense.

It really is that simple.

How does this relate to the problem of replacing BGP? There are several things you could improve about BGP, but automatic metrics are not one of them. There are, in fact, already “automatic metrics” in BGP, but “automatic metrics” like the IGP cost are tie breakers. A tie breaker is a convenient stand-in for what the protocol designer and/or implementor thinks the most natural policy should be. Whether or not they are right or wrong in a specific situation is a… guess.

What about something like the RPKI? The RPKI is not going to help in most situations where a human makes a mistake in a transit provider. It would help with transit edge failures and hijacks, but these are a different class of problem. You could ask for BGPsec to counter these problems, of course, but BGPsec would likely cause more problems than it solves (I’ve written on this before, here, here, here, here, and here, to start; you can find a lot more on rule11 by following this link).

Given replacing the metrics is not a possibility, and RPKI is only going to get you “so far,” what else can be done? There are, in fact, several practical steps that could be taken.

You could specify that BGP implementations should, by default, only advertise routes if there is some policy configured. Something like, say… RFC8212?

Giving operators more information to understand what they are configuring (perhaps by cleaning up the Internet Routing Registries?) would also be helpful. Perhaps we could build a graph overlay on top of the Default Free Zone (DFZ) so a richer set of policies could be expressed, and policies could be better observed and understood (but you have to convince the transit providers that this would not harm their business before this could happen).

Maybe we could also stop trying to use BGP as the trash can of the Internet, throwing anything we don’t know what else to do with in there. We’ve somehow forgotten the old maxim that a protocol is not done until we have removed everything that is not needed. Now our mantra seems to be “the protocol isn’t done until it solves every problem anyone has ever thought of.” We just keep throwing junk at BGP as if it is the abominable snowman—we assume it’ll bounce when it hits bottom. Guess what: it’s not, and it won’t.

Replacing BGP is not realistic—nor even necessary. Maybe it is best to put it this way:

  • BGP expresses policy
  • Policy is messy
  • Therefore, BGP is messy

We definitely need to work towards building good engineers and good tools—but replacing BGP is not going to “solve” either of these problems.

P.S. I have differentiated between “metrics” and “policy” here—but metrics can be seen as an implicit form of policy. Choosing the highest bandwidth path is a policy. Choosing the path with the shortest hop count is a policy, too. The shortest path (for some meaning of “shortest”) will always be provably loop free, so it is a useful way to always choose a loop free path in the face of simple, uniform, policies. But BGP doesn’t live in the world of simple uniform policies; it lives in the world of “more than one metric.” BGP lives in a world where different policies not only overlap, but directly compete. Computing a path with more than one metric is provably at least bistable, and often completely unstable, no matter what those metrics are.

P.P.S. This article is a more humorous take on finding perfect people.

On the ‘web: What’s Wrong with BGP

Our guests are Russ White, a network architect at LinkedIn; and Sue Hares, a consultant and chair of the Inter-Domain Routing Working Group at the IETF. They discuss the history of BGP, the original problems it was intended to solve, and what might change. This is an informed and wide-ranging conversation that also covers whitebox, software quality, and more. Thanks to Huawei, which covered travel and accommodations to enable the Packet Pushers to attend IETF 99 and record some shows to spread the news about IETF projects and initiatives.

You can jump to the original post on Packet Pushers here.

Optimal Route Reflection

There are—in theory—three ways BGP can be deployed within a single AS. You can deploy a full mesh of iBGP peers; this might be practical for a small’ish deployment (say less than 10), but it quickly becomes a management problem in larger, or constantly changing, deployments. You can deploy multiple BGP confederations; creating internal autonomous systems that are invisible to the world because the internal AS numbers are stripped at the real eBGP edge.

The third solution is (probably) the only solution anyone reading this has deployed in a production network: route reflectors. A quick review might be useful to set the stage.

In this diagram, B and E are connected to eBGP peers, each of which is advertising a different destination; F is advertising the 100::64 prefix, and G is advertising the 101::/64 prefix. Assume A is the route reflector, and B,C, D, and E are route reflector clients. What happens when F advertises 100::/64 to B?

  • B receives the route and advertises it through iBGP to A
  • A adds its router ID to the cluster list, and reflect the route to C, D, and E
  • E receives this route and advertises it through its eBGP session towards G
  • C does not advertise 100::/64 towards D, because D is an iBGP peer (not configured as a route reflector)
  • D does not advertise 100::/64 towards C, because C is an iBGP peer (not configured as a route reflector)

Even if D did readvertise the route towards C, and C back towards A, A would reject the route because its router ID is in the cluster list. Although the improper use of route reflectors can get you into a lot of trouble, the usage depicted here is fairly simple. Here A will only have one path towards 100::/64, so it will only have one possible path across which to run the BGP bestpath calculation.

The case of 101::/64 is a little different, however. The oddity here is the link metrics. In this network, A is going to receive two routes towards 101::/64, through D and E. Assuming all other things are equal (such as the local preference), A will choose the path to the speaker within the AS with the lowest IGP metric. Hence A will choose the path through E, advertising this route to B, C, and D. What if A were not a route reflector? If every router within the AS were part of an iBGP full mesh, what would happen? In this case:

  • B would receive three two routes to 101::/64, one from D with an IGP metric of 30, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, B will choose the path through E to reach 101::/64.
  • C would receive two routes to 101::/64, one from D with an IGP metric of 10, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, C will choose the path through D to reach 101::/64.

Inserting the route reflector, A, into the network does not change the best path to 101::/64 from the perspective of B, but it does change C’s best path from D to E. How can the shortest path be restored in the network? The State/Optimization/Surface (SOS) three way trade off tells use there are two possible solutions—either the state removed by the route reflector must be restored into BGP, or some interaction surface needs to be enabled between BGP and some other system in the network that has the information required to restore optimal routing.

The first of these two options, restoring the state removed through route reflection, is represented by two different solutions, one of which can be considered a subset of the other. The first solution is for the route reflector, A, to send all the routes to 101::/64 to every route reflector client. This is called add paths, and is documented in RFC7911. The problem with this solution is the amount of additional state.

A second option is to provide some set of paths beyond the best path to each client, but not the entire set of paths. This solution still attacks the suboptimal problem by adding state that was removed through the reflection process. In this case, however, rather than adding back all the state, a subset of state is added back. The state added back is normally the second best path, which is enough to provide enough information to re-optimize the network, but minimal enough to not overwhelm BGP.

What about the other option—allowing BGP to interact with some other system that has the information required to tell BGP specifically which state will allow the route reflector clients to compute the optimal path through the network? This third solution is described in BGP Optimal Route Reflection (BGP-ORR). To understand this solution, begin by asking: why does removing BGP advertisements from the control plane cause suboptimal routing? The answer to this question is: because the route reflector client does not have all the available routes, it cannot compare the IGP metric of every path in order to determine the shortest path.

In other words, C actually has two paths to 101::/64, one through A and another through D. If C knew about these two paths, it could compare the two IGP costs, through A and through D, and choose the closest exit point out of the AS. What other router in the netwok has all the relevant information? The route reflector—A. If a link state IGP is being used in this network, A can calculate the shortest path from C to both of the potential exit points, D and E. Further, because it is the route reflector, A knows about both of the routes to reach 101::/64. Hence, A can compute the best path as C would compute it, taking into account the IGP metric for both exit points, and send C the route it knows the BGP best path process on C will choose anyway. This is exactly what BGP Optimal Route Reflection (BGP-ORR) describes.

Hopefully this short tour through BGP route reflection, the problem route reflection causes by removing state from the network, and the potential solutions, is useful in understanding the various drafts and solutions being proposed.

I2RS and Remote Triggered Black Holes

In our last post, we looked at how I2RS is useful for managing elephant flows on a data center fabric. In this post, I want to cover a use case for I2RS that is outside the data center, along the network edge—remote triggered black holes (RTBH). Rather than looking directly at the I2RS use case, however, it’s better to begin by looking at the process for creating, and triggering, RTBH using “plain” BGP. Assume we have the small network illustrated below—


In this network, we’d like to be able to trigger B and C to drop traffic sourced from 2001:db8:3e8:101::/64 inbound into our network (the cloudy part). To do this, we need a triggering router—we’ll use A—and some configuration on the two edge routers—B and C. We’ll assume B and C have up and running eBGP sessions to D and E, which are located in another AS. We’ll begin with the edge devices, as the configuration on these devices provides the setup for the trigger. On B and C, we must configure—

  • Unicast RPF; loose mode is okay. With loose RPF enabled, any route sourced from an address that is pointing to a null destination in the routing table will be dropped.
  • A route to some destination not used in the network pointing to null0. To make things a little simpler we’ll point a route to 2001:db8:1:1::1/64, a route that doesn’t exist anyplace in the network, to null0 on B and C.
  • A pretty normal BGP configuration.

The triggering device is a little more complex. On Router A, we need—

  • A route map that—
    • matches some tag in the routing table, say 101
    • sets the next hop of routes with this tag to 2001:db8:1:1::1/64
    • set the local preference to some high number, say 200
  • redistribute from static routes into BGP filtered through the route map as described.

With all of this in place, we can trigger a black hole for traffic sourced from 2001:db8:3e8:101::/64 by configuring a static route at A, the triggering router, that points at null0, and has a tag of 101. Configuring this static route will—

  • install a static route into the local routing table at A with a tag of 101
  • this static route will be redistributed into BGP
  • since the route has a tag of 101, it will have a local preference of 200 set, and the next hop set to 2001:db8:1:1::1/64
  • this route will be advertised via iBGP to B and C through normal BGP processing
  • when B receives this route, it will choose it as the best path to 2001:db8:3e8:101::/64, and install it in the local routing table
  • since the next hop on this route is set to 2001:db8:1:1::1/64, and 2001:db8:1:1::1/64 points to null0 as a next hop, uRPF will be triggered, dropping all traffic sourced from 2001:db8:3e8:101::/64 at the AS edge

It’s possible to have regional, per neighbor, or other sorts of “scoped” black hole routes by using different routes pointing to null0 on the edge routers. These are “magic numbers,” of course—you must have a list someplace that tells you which route causes what sort of black hole event at your edge, etc.

Note—this is a terrific place to deploy a DevOps sort of solution. Instead of using an appliance sort of router for the triggering router, you could run a handy copy of Cumulus or snaproute in a VM, and build scripts that build the correct routes in BGP, including a small table in the script that allows you to say something like “black hole 2001:db8:3e8:101::/64 on all edges,” or “black hole 2001:db8:3e8:101::/64 on all peers facing provider X,” etc. This could greatly simplify the process of triggering RTBH.

Now, as a counter, we can look at how this might be triggered using I2RS. There are two possible solutions here. The first is to configure the edge routers as before, using “magic number” next hops pointed at the null0 interface to trigger loose uRPF. In this case, an I2RS controller can simply inject the correct route at each edge eBGP speaker to block the traffic directly into the routing table at each device. There would only need to be one such route; the complexity of choosing which peers the traffic should be black holed on could be contained in a script at the controller, rather than dispersed throughout the entire network. This allows RTBH to be triggered on a per edge eBGP speaker basis with no additional configuration on any individual edge router.

Note the dynamic protocol isn’t being replaced in any way. We’re still receiving our primary routing information from BGP, including all the policies available in that protocol. What we’re doing, though, is removing one specific policy point out of BGP and moving it into a controller, where it can be more closely managed, and more easily automated. This is, of course, the entire point of I2RS—to augment, rather than replace, dynamic routing used as the control plane in a network.

Another option, for those devices that support it, is to inject a route that explicitly filters packets sourced from 2001:db8:3e8:101::/64 directly into the RIB using the filter based RIB model. This is a more direct method, if the edge devices support it.

Either way, the I2RS process is simpler than using BGP to trigger RTBH. It gathers as much of the policy as possible into one place, where it can be automated and managed in a more precise, fine grained way.

snaproute Go BGP Code Dive (10): Moving to Open Confirm

In the last post on this topic, we traced how snaproute’s BGP code moved to the open state. At the end of that post, the speaker encodes an open message using packet, _ := bgpOpenMsg.Encode(), and then sends it. What we should be expecting next is for an open message from the new peer to be received and processed. Receiving this open message will be an event, so what we’re going to need to look for is someplace in the code that processes the receipt of an open message. All the way back in the fifth post of this series, we actually unraveled this chain, and found this is the call chain we’re looking for—

  • func (st *OpenSentState) processEvent()
  • st.fsm.StopConnectRetryTimer()
  • bgpMsg := data.(*packet.BGPMessage)
  • if st.fsm.ProcessOpenMessage(bgpMsg) {
    • st.fsm.sendKeepAliveMessage()
    • st.fsm.StartHoldTimer()
    • st.fsm.ChangeState(NewOpenConfirmState(st.fsm)) }

I don’t want to retrace all those steps here, but the call to func (st *OpenSentState) processEvent() (around line 444 in fsm.go) looks correct. The call in question must be a call to a function that processes an event while the peer is in the open state. This call seems to satisfy both requirements. There is a large switch statement in this function; let’s see if we can sort out what a few of these do to get a general sense of what is in this switch.

  • case BGPEventManualStop: this covers the case where the operator manually deconfigures or otherwise stops the BGP process, or the formation of this specific peer
  • case BGPEventAutoStop: this covers the case where the BGP process is brought down for some automatically generated reason; for instance, this (probably) covers the case where the BGP process is shut down because the system itself is going down
  • case BGPEventHoldTimerExp: when the peer was moved into the open state, the hold timer was configured and started running; if the hold timer expires before an open message is received from the peer, then a notification is sent and the peer is pushed back to idle state
  • case BGPEventTcpConnFails: if the TCP socket reports that the connection has failed, the peer is cleared and set back to active state

The particular bit of code in this switch we’re interested in is—

case BGPEventBGPOpen:
  bgpMsg := data.(*packet.BGPMessage)
  if st.fsm.ProcessOpenMessage(bgpMsg) {

Well, this doesn’t look so bad, right? Just a few short lines of code. 🙂

st.fsm.StopConnectRetryTimer() is pretty obvious, so I won’t spend a lot of time here. The peer is now connected, so there’s no reason to keep running the timer that causes events when the timer expires.

bgpMsg := data.(*packet.BGPMessage) might not be so obvious at first. In order to reach this state, the local peer has received a packet of some type. The contents of that packet must somehow be processed to actually form the peering relationship. This line of code just creates a new variable called bgpMsg and assigns the received packet to this variable. The := operator is specific to go, so it’s probably worth pausing for a second to explain.

Typing is a method a programming language uses to control memory usage, catch errors in the code during the compilation process, etc. If you define a new variable that is supposed to hold a whole number, or a number without a floating point component (the fractional part after the decimal point), and assign it the value 2, you might do something like this in C—

int a-number;
a-number = 2;

go does things a little differently, placing the name of the variable before the type, like this—

var a-number int
a-number = 2

The first line is consider the variable declaration, while the second is the variable assignment. These are normally two separate steps. But in go, there is a shortcut to this process. You can declare the variable and assign a value in one step, like this—

a-number := 2

How does the compiler know what kind or type of variable a-number is? By looking at the value assigned. In this case, the coder has declared a variable called bgpMsg, and assigned it the value of the contents of the open message just received in one step.

Next time, we’ll look at how this information is actually process. ’til then, happy coding.

On the ‘net: BGP—the most successful virus

This Weekly Show episode was recorded live at IETF 96 in Berlin in July 2016. Greg Ferro and several guests discuss the state of routing protocols such as BGP, and explore different approaches to routing, like Facebook’s Open/R initiative. They also debate issues around telemetry, network disaggregation, and whether enterprises should participate in the IETF to influence vendor product development.

Listen to the podcast over at Packet Pushers

Tags: |

snaproute Go BGP Code Dive (8): Moving to Open

Last week we left off with our BGP peer in connect state after looking through what this code, around line 261 of fsm.go in snaproute’s Go BGP implementation—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  switch event {
    case BGPEventConnRetryTimerExp:

What we want to do this week is pick up our BGP peering process, and figure out what the code does next. In this particular case, the next step in the process is fairly simple to find, because it’s just another case in the switch statement in (st *ConnectState) processEvent

case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:

This looks like the right place—we’re looking at events that occur while in the connect state, and the result seems to be sending an open message. Before we move down this path, however, I’d like to be certain I’m chasing the right call chain, or logical thread. How can I do this? This code is called when (st *ConnectState) processEvent is called with an event called BGPEventTcpCrAcked or BGPEventTcpConnConfirmed. Let’s chase down where these events might come from to see if this really is the next step in the call chain we’re trying to chase.

Note: Sometimes it’s easier to chase from the end result back towards the caller, and sometimes it’s not. There’s no way to know which is which until you have more experience in chasing through code. It takes time and practice to build these sorts of skills up, just like many other skills—but in chasing through code, you’re not only learning the protocols better, you’re also learning how to code better.

To find what we’re looking for, we can search through the project files for some instance of BGPEventTcpCrAcked, which seems to be the result of receiving an ACK for a TCP session initiated by BGP. We find a few places in fsm.go, as always, but most of them are using the event, rather than causing (or throwing) it—

272: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
371: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
475: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
592: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
709: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:

Until we get to this one—

case inConnCh := 

What does this do? This is a little complex, but let’s try to work through it. When starting a new peer, a port was cloned on which to send TCP packets to the peer. Since the port is cloned to a port the main FSM function is watching—(fsm *FSM) StartFSM()—the main FSM function is going to be notified of any inbound TCP packets received on the local device. When one specific sort of packet is received, an acknowledgement in a new TCP session, the main FSM function is called, resulting in case inConnCh := <-fsm.inConnCh: being called. This, in turn, calls (st *ConnectState) processEvent with BGPEventTcpCrAcked.

If you followed that, you know this verifies what it looked like in the first place—the code above is, in fact, the correct code to process the next phase of peering. The call chain looks something like this—

  • (fsm *FSM) StartFSM() is watching the TCP ports for any new packets
  • When (fsm *FSM) StartFSM() recieves a new TCP ACK, it falls through to case inConnCh := <-fsm.inConnCh: in the switch statement
  • This, in turn, calls (st *ConnectState) processEvent with BGPEventTcpCrAcked
  • (st *ConnectState) processEvent falls through to the case statement case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed, which then calls the correct functions to move beyond connect state

It’s okay if you have to read all of that several times—FSMs (Finite State Machines—remember?) can be very difficult to follow. This means we need to chase down each of these functions to find out how this implementation of BGP actually moves beyond the open state—

  • st.fsm.StopConnectRetryTimer()
  • st.fsm.SetPeerConn(data)
  • st.fsm.sendOpenMessage()
  • st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime, st.fsm.neighborConf.RunningConf.KeepaliveTime)
  • st.fsm.StartHoldTimer()
  • st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))

It’s pretty obvious what StopConnectRetryTimer does—it stops BGP from continuing to try to connect to this peer. Since the peer has acknowledged the initial TCP packet, we shouldn’t keep trying to send it initial TCP packets. SetPeerConn is a bit harder—

func (fsm *FSM) SetPeerConn(data interface{}) {
  if fsm.peerConn != nil {
  pConnDir := data.(PeerConnDir)
  fsm.peerConn = NewPeerConn(fsm, pConnDir.connDir, pConnDir.conn)
  go fsm.peerConn.StartReading()

This just does some general logging (which I’ve removed for clarity), and then tells the main process (through the FSM call) to start reading packets off this new peer’s data structure. I’m not going to dive into these functions deeply here.

Next time, we’ll look at the four remaining functions, as these are where the action really is from a BGP perspective.

snaproute Go BGP Code Dive (7): Moving to Connect

In last week’s post, we looked at how snaproute’s implementation of BGP in Go moves into trying to connect to a new peer—we chased down the connectRetryTimer to see what it does, but we didn’t fully work through what the code does when actually moving to connect. To jump back into the code, this is where we stopped—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  switch event {
    case BGPEventConnRetryTimerExp:

When the connectRetryTimer timer expires, it is not only restarted, but a new connection to the peer is attempted through st.fsm.InitiateConnToPeer(). This, then, is the next stop on the road to figuring out how this implementation of BGP brings up a peer. Before we get there, though, there’s an oddity here that needs to be addressed. If you look through the BGP FSM code, you will only find this call to initiate a connection to a peer in a few places. There is this call, and then one other call, here—

func (st *ConnectState) enter() {

The rest of the instances of InitiateConnToPeer() are related to the definition of the function. This raises the question: why wouldn’t you just call this function directly when moving to connect? In other words, why not call it directly, rather than by setting a timer and calling it when the timer wakes up? One of the prime points of coding coherently is to provide consistent entry and exit points into specific states. The more ways you can enter a state within an FSM, the more confusing the FSM gets, the easier it is to make mistakes when modifying the FSM, and the harder it is to troubleshoot problems with the FSM. If you can construct a code path that funnels every way to get into a single state through a single call, the code will ultimately be easier to understand and maintain.

Now let’s look at what st.fsm.InitiateConnToPeer() actually does—

func (fsm *FSM) InitiateConnToPeer() {
  if bytes.Equal(fsm.pConf.NeighborAddress, net.IPv4bcast) {
    fsm.logger.Info("Unknown neighbor address")
  remote := net.JoinHostPort(fsm.pConf.NeighborAddress.String(), config.BGPPort)
  local := ""

  if strings.TrimSpace(fsm.pConf.UpdateSource) != "" {
    local = net.JoinHostPort(strings.TrimSpace(fsm.pConf.UpdateSource), "0") 
  if fsm.outTCPConn == nil {
    fsm.outTCPConn = NewOutTCPConn(fsm, fsm.outConnCh, fsm.outConnErrCh)
    go fsm.outTCPConn.ConnectToPeer(fsm.connectRetryTime, remote, local)

I’ve removed the logging code for clarity—I’ll be removing the logging code consistently throughout this series.

The first step is to determine if we have a valid, reachable peer IP address. This is taken care of by—

if bytes.Equal(fsm.pConf.NeighborAddress, net.IPv4bcast)

If the neighbor address is the same as an IPv4 broadcast address (either or, then we don’t have a valid peer address. At this point, we just log the event and fail the attempt to connect to this peer. If we have a valid address to peer to, we need to build the data structures that will hold the TCP state. Remember that TCP is a stateful connection, which means we not only need to keep track of our local state, but we also need to keep track of the window and other information for the remote TCP peer. This is why there are two sets of calls to net.JoinHostPort, one for the local state, and one for the remote state.

Now that we have someplace to store the remote and local state, we can actually open a TCP connection (NewOutTCPConn) and then try to open the peering session (ConnectToPeer).

You can find the ConnectToPeer code in fsm/conn.go around line 175; the code is somewhat low level, so we won’t spend any time going through it here. Just taking a quick look shows that it essentially calls o.Connect, which then tries to open a new TCP session to the IP address specified.

Assuming this connection is actually opened, we have successfully moved the peer from idle to connect. We’ll tie up some loose ends in the next installment, and then consider the process of moving beyond connect state.

snaproute Go BGP Code Dive (2)

Now that you have a copy of BGP in Go on your machine—it’s tempting to jump right in to asking the code questions, but it’s important to get a sense of how things are structured. In essence, you need to build a mental map of how the functionality of the protocol you already know is related to specific files and structure, so you can start in the right place when finding out how things work. Interacting with an implementation from the initial stages of process bringup, building data structures, and the like isn’t all that profitable. Rather, asking questions of the code is an iterative/interactive process.

Take what you know, form an impression of where you might look to find the answer to a particular question, and, in the process of finding the answer, learn more about the protocol, which will help you find answers more quickly in the future.

So let’s poke around the directory structure a little, think about how BGP works, and think about what might be where. To begin, what might we call the basic functions of BGP? Let me take a shot at a list (if you see things you think should be on here, feel free to leave a comment—you might think of something I don’t, or we might have different ideas about what these should be, etc.):

  • Handle peering sessions
  • Receive updates
  • Run bestpath
  • Install routes into local tables
  • Install routes into the Routing Information Base (RIB)

Each of these can be broken down in to a lot of other pieces and parts, but we don’t want to go too deep here for the moment—we’re really trying to guess how the basic functions of the protocol align with directories and files in the actual code. Essentially—If I want to know how this particular implementation of BGP handles peering, where would I look? Now, let’s glance at the actual contents of the SnapRoute’s Go BGP implementation, and see what we can figure out—can we match any functions to directories?


Some of the things here I can guess at just from experience, like (note I’m not going to verify this stuff, and I might be wrong in some cases, but that’s okay, we’re just taking a first stab at figuring out where things might be)—

  • api—which means Application Programming Interface. Probably a set of files that declare function calls and the like into other applications.
  • flexswitch—since FlexSwitch is the actual name of the project, this probably contains files related to the overall routing engine SnapRoute is creating/maintaining. I would expect to find interfaces and interprocess communication to other processes in the same project, or something like that.
  • fsm—means Finite State Machine. A routing protocol can be described as a set of states, with specific events that cause the protocol to shift from one state to another. For instance, when a BGP peer shifts from active to idle,, this is a state change. The FSM would be considered the “heart” of the protocol in many ways.
  • ovs—means Open Virtual Switch. This is probably interfaces to OVSDB, which allows this version of BGP to run the OpenSwitch project.
  • rpc—means remote procedure call.

Another good place to look is in the /docs directory, which sometimes has useful information about how the code is structured. In this particular case, there is a diagram in the /docs directory that shows a basic overview of the code.


From this we can gather than the neighbor, FSM, and BGP RIB are considered three different modules in the code base. We can also infer there an external database that holds the BGP tables and configuration, accessed through the Thrift RPC. The server module is interesting; we’ll have to watch for this as we start asking the code specific questions, to figure out what this might be used for. I’ll give you hint up front, and say this is a pretty common structure for just about every piece of software that is driven by events.

That’s enough poking around for this post; we’ll look at some tools next, and then start into actually asking the code questions.

Getting to the point of dual homing

I wonder how many times I’ve seen this sort of diagram across the many years I’ve been doing network design?


It’s usually held up as an example of how clever the engineer running the network is about resilience. “You see,” the diagram asserts, “I’m smart enough to purchase connectivity from two providers, rather than one.”

Can I point something out? Admittedly it might not be all that obvious from the diagram, but… Reality is just about as likely to squish your network connectivity like a bug no a windshield as it is any other network. Particularly if both of these connections are in the same regional area. The tricky part is knowing, of course, what a “regional area” might happen to mean for any particular provider.

The problem with this design is very basic, and tied to the concept of shared link risk groups. But let me start someplace a little simpler than that—with the basic, and important, point that putting fiber in the ground, and maintaining fiber that’s in the ground, is expensive. Unless you live in Greenland, fiber can be physically buried pretty easily (fiber in Greenland is generally buried with dynamite by a blasting crew, or through conduit that’s bolted to the surface of the ubiquitous rock). But it’s not the burying that costs a lot of money—it’s the politics.

To bury a cable, you must get a right of way. Getting a right of way could well be very expensive in any given city. I remember encountering one particular situation where the land under consideration was owned, in theory, by a railroad. Well, it was close enough to an old station that it must have been. But it took several years of looking through old piles of paper to find the correct paper trail and figure out how, precisely actually owned the land in a legally provable way. This is not a task for the faint of heart.

What has this to do with the image above? A lot, actually. It’s so expensive to install last mile fiber providers often share this last mile. To explain, let’s look at a small picture, just below, that might be helpful.


This is the way many providers actually build their last mile. There is (normally a pair of) fiber ring(s), with a set of ROADM’s at key locations in the region (ROADM actually means “randomly dropping all de traffic that matters,” but don’t tell anyone, it’s a secret). When a customer is connected to the network, they are assigned a lightwave on the fiber that carries their traffic, from the customer edge device, over a virtual layer 2 circuit (generally point-to-point, but not always), to a central office or exchange point. Here the different lightwaves are split up and handed to different providers through good old fashioned routing. One provider normally owns the fiber, and other providers lease wavelengths, or bandwidth, etc., to reach customers in the region.

Looking at this second image, you might be able to see what the problem is with the first. It’s possible—actually probable, in fact I’ve seen it happen in real life—that a single backhoe fade within the same region will take out both provider’s circuits at the same time.

The problem here isn’t really the lack of diversity. Rather, it’s that the lack of diversity is hidden through the magical abstraction of virtualization. Two logical circuits that share the the same fate because they both run on the same physical media, by the way, are called a Shared Risk Link Group (SRLG). Providers aren’t likely to tell you when you’re at risk from this sort of problem for several reasons.

First, telling you who leased fiber from whom is bad business. Second, they may not actually know enough about their competitors to point this problem out. Third, it’s really in their business interest to try to convince you not to do this, but rather to buy all your upstream from them.

So—what can you do about this?

If you’re going to connect to two providers, try to do so in two different regions. This is often difficult, as you don’t really know where the regions are, and connecting two sites that provide backup for one another across multiple regional rings can be a challenge for geographical reasons.

One alternative here is to connect to a local exchange point (an IXP), and from their fabric to the various providers. While the IXP will likely lease their circuits from others, they will have a much better idea of where the cables physically run, and how to provide diverse circuits (but only if you know what you’re asking for).

Another alternative is to simply stick with a single provider, and insist on physical diversity in any resilient links. This plays into the provider’s hand of trying to get you to buy from a single source, but it gets around the problem of trying to figure out what cable is where, and who uses what (information you’re not generally going to be able to find anyway), and puts it on the shoulders of the provider—who does know, at least for their network.

The next time you think you’ve solved the resilience problem by quickly and easily dual homing, remember shared risk, and remember to look for the deeper problem that’s been hidden away through an abstraction—an abstraction that far too often is leaky.

When prepend fails, what next? (3)

We began this short series with a simple problem—what do you do if your inbound traffic across two Internet facing links is imbalanced? In other words, how do you do BGP load balancing? The first post looked at problems with AS Path prepend, while the second looked at de-aggregating and using communities to modify the local preference within the upstream provider’s network.

There is one specific solution I want to discuss a bit more before I end this little series: de-aggregation. Advertising longer prefixes is the “big hammer” of routing; you should always be careful when advertising more specifics. The Default Free Zone (DFZ) is much like the “commons” of an old village. No-one actually “owns” the routing table in the global Internet, but everyone benefits from it. De-aggregating don’t really cost you anything, but it does cost everyone else something. It’s easy enough to inject another route into the routing table, but remember the longer prefix you inject shows up everywhere in the world. You’re fixing your problem by taking up some small amount of memory in every router that’s connected to the DFZ in the world. If everyone de-aggregates, everyone has to buy larger routers and more memory. Including you.

There is a fine line between using a commonly held resource and abusing a commonly held resource. If everyone abuses the commons because it “does not cost them anything,” what results is the tragedy of the commons. Once a set of commons are ruined, it’s very difficult to recover the original intent and trust relationships that caused the commons in the first place. So before you de-aggregate, you should think about whether or not it is really necessary.

Is this really necessary? Does it really matter if your two inbound links are not balanced? There may be financial reasons why it does matter, such as the costs of the two links, or the cost of bursting over a set level on one of the two links. These are certainly considerations, but it might make more sense to modify the sizing of the available links rather than putting a technical solution in place that will need to be managed and maintained.

Remember everything you configure will eventually break, and everything that breaks results in a call at 2AM. Think through the options you have available before putting a optimization in place.

Are there ways I can limit the damage to the commons?


Returning to our original network, is it possible to de-aggregate in a way that pulls traffic from AS65001 into AS65004, but doesn’t impact the table size of anyone these two providers are connected to? Most providers to, in fact, allow you to not only send a community to set the local preference within their AS, but also to block the advertisement of any particular route to their peers. You might need to play around with these communities a bit to understand the relationship between the community and inbound traffic flow; for instance, what impact will blocking the advertisement of a more specific to the transit peers of one upstream versus blocking the route to some set of customers connected to the provider? As there is no way for you to directly know how and where the provider is connected. You can work directly with the provider to sort out what to advertise where while reducing your global impact, or you might just need to play around with different combinations to see what works best.

Is my peering the right peering? Another option is to think through who you are peering with. Assume, for a moment, that you are peering with one more regional provider, and one more global provider. In this case, your customer base is going to play a large role in which provider sends you more traffic.

For instance, if you are a regional bank, or health care provider, most of your customers are going to be connected to a regional provider (rather than a tier 1), and hence you are likely to receive most of your traffic on the regional provider’s link. If, however, your business is more global, the regional provider is not going to send you a lot of traffic—mostly just people who happen to be accessing your network from within your region. In this case, the imbalance between the two inbound links should be expected.

An observation: if this is so, maybe it is better to peer with two providers that will bring you closer to your customers. If your customers are global, maybe it’s better to peer with two providers at the national or global level, rather than one global and one regional—and the other way around. Perhaps it is better to balance your inbound traffic by carefully considering who your customers are, and how to best reach them, than it is to try and play engineering tricks to draw equal amounts of traffic over the networks of two completely different kinds of providers.

The bottom line is this: the engineering solution is the last solution you should reach for. I know—we are all engineers here, and there’s nothing quite like getting under a heavy load and solving it with a nice, long set of configuration commands that make you feel like you spent your money well in buying that big hunk of iron racked up in the DMARC.

But real engineering begins when you ask the background questions, and really understand the problem.