Do We Really Need a New BGP?

From time to time, I run across (yet another) article about why BGP is so bad, and how it needs to be replaced. This one, for instance, is a recent example.

cross posted at APNIC and CircleID

It seems the easiest way to solvet this problem is finding new people—ones who don’t make mistakes—to work on BGP configuration, building IRR databases, and deciding what should be included in BGP? Ivan points out how hopeless of a situation this is going to be, however. As Ivan says, you cannot solve people problems with technology. You can hint in the right direction, and you can try to make things a little more sane, and a little less complex, but people cannot be fixed with technology. Given we cannot fix the people problem, would replacing BGP itself really help? Is there anything we could do to make things better?

To understand the answer to these questions, it is important to tear down a major misconception about BGP. The misconception?

BGP is a routing protocol in the same sense as OSPF, IS-IS, or EIGRP.

BGP was not designed to be a routing protocol in the way other protocol were. It was designed to provide a loop free path through a series of independently operated networks, each with its own policy and business goals. In the sense that BGP provides a loop free route to a destination, it provides routing. But the “routing” it provides is largely couched in terms of explicit, rather than implicit, policy (see the note below). Loop free routes are not always the “shortest” path in terms of hop count, or the “lowest cost” path in terms of delay, or the “best available” path in terms of bandwidth, or anything else. This is why BGP relies on the AS Path to prevent loops. We call things “metrics” in BGP in a loose way, but they are really explicit expressions of policy.

Consider this: the primary policies anyone cares about in interdomain routing are: where do I want this traffic to exit my AS, and where do I want this traffic to enter my AS? The Local Preference is an expression of where traffic to this particular destination should exit this AS. The Multiple Exit Disciminator (MED) is an expression of where this AS would like to receive traffic being forwarded to this destination. Everything other than these are just tie breakers. All the rest of the stuff we do to try to influence the path of traffic into and out of an AS, like messing with the AS Path, are hacks. If you can get this pair of “things people really care about” into your head, the BGP bestpath process, and much of the routing that goes on in the DFZ, makes a lot more sense.

It really is that simple.

How does this relate to the problem of replacing BGP? There are several things you could improve about BGP, but automatic metrics are not one of them. There are, in fact, already “automatic metrics” in BGP, but “automatic metrics” like the IGP cost are tie breakers. A tie breaker is a convenient stand-in for what the protocol designer and/or implementor thinks the most natural policy should be. Whether or not they are right or wrong in a specific situation is a… guess.

What about something like the RPKI? The RPKI is not going to help in most situations where a human makes a mistake in a transit provider. It would help with transit edge failures and hijacks, but these are a different class of problem. You could ask for BGPsec to counter these problems, of course, but BGPsec would likely cause more problems than it solves (I’ve written on this before, here, here, here, here, and here, to start; you can find a lot more on rule11 by following this link).

Given replacing the metrics is not a possibility, and RPKI is only going to get you “so far,” what else can be done? There are, in fact, several practical steps that could be taken.

You could specify that BGP implementations should, by default, only advertise routes if there is some policy configured. Something like, say… RFC8212?

Giving operators more information to understand what they are configuring (perhaps by cleaning up the Internet Routing Registries?) would also be helpful. Perhaps we could build a graph overlay on top of the Default Free Zone (DFZ) so a richer set of policies could be expressed, and policies could be better observed and understood (but you have to convince the transit providers that this would not harm their business before this could happen).

Maybe we could also stop trying to use BGP as the trash can of the Internet, throwing anything we don’t know what else to do with in there. We’ve somehow forgotten the old maxim that a protocol is not done until we have removed everything that is not needed. Now our mantra seems to be “the protocol isn’t done until it solves every problem anyone has ever thought of.” We just keep throwing junk at BGP as if it is the abominable snowman—we assume it’ll bounce when it hits bottom. Guess what: it’s not, and it won’t.

Replacing BGP is not realistic—nor even necessary. Maybe it is best to put it this way:

  • BGP expresses policy
  • Policy is messy
  • Therefore, BGP is messy

We definitely need to work towards building good engineers and good tools—but replacing BGP is not going to “solve” either of these problems.

P.S. I have differentiated between “metrics” and “policy” here—but metrics can be seen as an implicit form of policy. Choosing the highest bandwidth path is a policy. Choosing the path with the shortest hop count is a policy, too. The shortest path (for some meaning of “shortest”) will always be provably loop free, so it is a useful way to always choose a loop free path in the face of simple, uniform, policies. But BGP doesn’t live in the world of simple uniform policies; it lives in the world of “more than one metric.” BGP lives in a world where different policies not only overlap, but directly compete. Computing a path with more than one metric is provably at least bistable, and often completely unstable, no matter what those metrics are.

P.P.S. This article is a more humorous take on finding perfect people.

On the ‘web: What’s Wrong with BGP

Our guests are Russ White, a network architect at LinkedIn; and Sue Hares, a consultant and chair of the Inter-Domain Routing Working Group at the IETF. They discuss the history of BGP, the original problems it was intended to solve, and what might change. This is an informed and wide-ranging conversation that also covers whitebox, software quality, and more. Thanks to Huawei, which covered travel and accommodations to enable the Packet Pushers to attend IETF 99 and record some shows to spread the news about IETF projects and initiatives.

You can jump to the original post on Packet Pushers here.

Optimal Route Reflection

There are—in theory—three ways BGP can be deployed within a single AS. You can deploy a full mesh of iBGP peers; this might be practical for a small’ish deployment (say less than 10), but it quickly becomes a management problem in larger, or constantly changing, deployments. You can deploy multiple BGP confederations; creating internal autonomous systems that are invisible to the world because the internal AS numbers are stripped at the real eBGP edge.

The third solution is (probably) the only solution anyone reading this has deployed in a production network: route reflectors. A quick review might be useful to set the stage.

In this diagram, B and E are connected to eBGP peers, each of which is advertising a different destination; F is advertising the 100::64 prefix, and G is advertising the 101::/64 prefix. Assume A is the route reflector, and B,C, D, and E are route reflector clients. What happens when F advertises 100::/64 to B?

  • B receives the route and advertises it through iBGP to A
  • A adds its router ID to the cluster list, and reflect the route to C, D, and E
  • E receives this route and advertises it through its eBGP session towards G
  • C does not advertise 100::/64 towards D, because D is an iBGP peer (not configured as a route reflector)
  • D does not advertise 100::/64 towards C, because C is an iBGP peer (not configured as a route reflector)

Even if D did readvertise the route towards C, and C back towards A, A would reject the route because its router ID is in the cluster list. Although the improper use of route reflectors can get you into a lot of trouble, the usage depicted here is fairly simple. Here A will only have one path towards 100::/64, so it will only have one possible path across which to run the BGP bestpath calculation.

The case of 101::/64 is a little different, however. The oddity here is the link metrics. In this network, A is going to receive two routes towards 101::/64, through D and E. Assuming all other things are equal (such as the local preference), A will choose the path to the speaker within the AS with the lowest IGP metric. Hence A will choose the path through E, advertising this route to B, C, and D. What if A were not a route reflector? If every router within the AS were part of an iBGP full mesh, what would happen? In this case:

  • B would receive three two routes to 101::/64, one from D with an IGP metric of 30, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, B will choose the path through E to reach 101::/64.
  • C would receive two routes to 101::/64, one from D with an IGP metric of 10, and a second from E with an IGP metric of 20. Assuming all other path attributes are equal, C will choose the path through D to reach 101::/64.

Inserting the route reflector, A, into the network does not change the best path to 101::/64 from the perspective of B, but it does change C’s best path from D to E. How can the shortest path be restored in the network? The State/Optimization/Surface (SOS) three way trade off tells use there are two possible solutions—either the state removed by the route reflector must be restored into BGP, or some interaction surface needs to be enabled between BGP and some other system in the network that has the information required to restore optimal routing.

The first of these two options, restoring the state removed through route reflection, is represented by two different solutions, one of which can be considered a subset of the other. The first solution is for the route reflector, A, to send all the routes to 101::/64 to every route reflector client. This is called add paths, and is documented in RFC7911. The problem with this solution is the amount of additional state.

A second option is to provide some set of paths beyond the best path to each client, but not the entire set of paths. This solution still attacks the suboptimal problem by adding state that was removed through the reflection process. In this case, however, rather than adding back all the state, a subset of state is added back. The state added back is normally the second best path, which is enough to provide enough information to re-optimize the network, but minimal enough to not overwhelm BGP.

What about the other option—allowing BGP to interact with some other system that has the information required to tell BGP specifically which state will allow the route reflector clients to compute the optimal path through the network? This third solution is described in BGP Optimal Route Reflection (BGP-ORR). To understand this solution, begin by asking: why does removing BGP advertisements from the control plane cause suboptimal routing? The answer to this question is: because the route reflector client does not have all the available routes, it cannot compare the IGP metric of every path in order to determine the shortest path.

In other words, C actually has two paths to 101::/64, one through A and another through D. If C knew about these two paths, it could compare the two IGP costs, through A and through D, and choose the closest exit point out of the AS. What other router in the netwok has all the relevant information? The route reflector—A. If a link state IGP is being used in this network, A can calculate the shortest path from C to both of the potential exit points, D and E. Further, because it is the route reflector, A knows about both of the routes to reach 101::/64. Hence, A can compute the best path as C would compute it, taking into account the IGP metric for both exit points, and send C the route it knows the BGP best path process on C will choose anyway. This is exactly what BGP Optimal Route Reflection (BGP-ORR) describes.

Hopefully this short tour through BGP route reflection, the problem route reflection causes by removing state from the network, and the potential solutions, is useful in understanding the various drafts and solutions being proposed.

I2RS and Remote Triggered Black Holes

In our last post, we looked at how I2RS is useful for managing elephant flows on a data center fabric. In this post, I want to cover a use case for I2RS that is outside the data center, along the network edge—remote triggered black holes (RTBH). Rather than looking directly at the I2RS use case, however, it’s better to begin by looking at the process for creating, and triggering, RTBH using “plain” BGP. Assume we have the small network illustrated below—


In this network, we’d like to be able to trigger B and C to drop traffic sourced from 2001:db8:3e8:101::/64 inbound into our network (the cloudy part). To do this, we need a triggering router—we’ll use A—and some configuration on the two edge routers—B and C. We’ll assume B and C have up and running eBGP sessions to D and E, which are located in another AS. We’ll begin with the edge devices, as the configuration on these devices provides the setup for the trigger. On B and C, we must configure—

  • Unicast RPF; loose mode is okay. With loose RPF enabled, any route sourced from an address that is pointing to a null destination in the routing table will be dropped.
  • A route to some destination not used in the network pointing to null0. To make things a little simpler we’ll point a route to 2001:db8:1:1::1/64, a route that doesn’t exist anyplace in the network, to null0 on B and C.
  • A pretty normal BGP configuration.

The triggering device is a little more complex. On Router A, we need—

  • A route map that—
    • matches some tag in the routing table, say 101
    • sets the next hop of routes with this tag to 2001:db8:1:1::1/64
    • set the local preference to some high number, say 200
  • redistribute from static routes into BGP filtered through the route map as described.

With all of this in place, we can trigger a black hole for traffic sourced from 2001:db8:3e8:101::/64 by configuring a static route at A, the triggering router, that points at null0, and has a tag of 101. Configuring this static route will—

  • install a static route into the local routing table at A with a tag of 101
  • this static route will be redistributed into BGP
  • since the route has a tag of 101, it will have a local preference of 200 set, and the next hop set to 2001:db8:1:1::1/64
  • this route will be advertised via iBGP to B and C through normal BGP processing
  • when B receives this route, it will choose it as the best path to 2001:db8:3e8:101::/64, and install it in the local routing table
  • since the next hop on this route is set to 2001:db8:1:1::1/64, and 2001:db8:1:1::1/64 points to null0 as a next hop, uRPF will be triggered, dropping all traffic sourced from 2001:db8:3e8:101::/64 at the AS edge

It’s possible to have regional, per neighbor, or other sorts of “scoped” black hole routes by using different routes pointing to null0 on the edge routers. These are “magic numbers,” of course—you must have a list someplace that tells you which route causes what sort of black hole event at your edge, etc.

Note—this is a terrific place to deploy a DevOps sort of solution. Instead of using an appliance sort of router for the triggering router, you could run a handy copy of Cumulus or snaproute in a VM, and build scripts that build the correct routes in BGP, including a small table in the script that allows you to say something like “black hole 2001:db8:3e8:101::/64 on all edges,” or “black hole 2001:db8:3e8:101::/64 on all peers facing provider X,” etc. This could greatly simplify the process of triggering RTBH.

Now, as a counter, we can look at how this might be triggered using I2RS. There are two possible solutions here. The first is to configure the edge routers as before, using “magic number” next hops pointed at the null0 interface to trigger loose uRPF. In this case, an I2RS controller can simply inject the correct route at each edge eBGP speaker to block the traffic directly into the routing table at each device. There would only need to be one such route; the complexity of choosing which peers the traffic should be black holed on could be contained in a script at the controller, rather than dispersed throughout the entire network. This allows RTBH to be triggered on a per edge eBGP speaker basis with no additional configuration on any individual edge router.

Note the dynamic protocol isn’t being replaced in any way. We’re still receiving our primary routing information from BGP, including all the policies available in that protocol. What we’re doing, though, is removing one specific policy point out of BGP and moving it into a controller, where it can be more closely managed, and more easily automated. This is, of course, the entire point of I2RS—to augment, rather than replace, dynamic routing used as the control plane in a network.

Another option, for those devices that support it, is to inject a route that explicitly filters packets sourced from 2001:db8:3e8:101::/64 directly into the RIB using the filter based RIB model. This is a more direct method, if the edge devices support it.

Either way, the I2RS process is simpler than using BGP to trigger RTBH. It gathers as much of the policy as possible into one place, where it can be automated and managed in a more precise, fine grained way.

snaproute Go BGP Code Dive (10): Moving to Open Confirm

In the last post on this topic, we traced how snaproute’s BGP code moved to the open state. At the end of that post, the speaker encodes an open message using packet, _ := bgpOpenMsg.Encode(), and then sends it. What we should be expecting next is for an open message from the new peer to be received and processed. Receiving this open message will be an event, so what we’re going to need to look for is someplace in the code that processes the receipt of an open message. All the way back in the fifth post of this series, we actually unraveled this chain, and found this is the call chain we’re looking for—

  • func (st *OpenSentState) processEvent()
  • st.fsm.StopConnectRetryTimer()
  • bgpMsg := data.(*packet.BGPMessage)
  • if st.fsm.ProcessOpenMessage(bgpMsg) {
    • st.fsm.sendKeepAliveMessage()
    • st.fsm.StartHoldTimer()
    • st.fsm.ChangeState(NewOpenConfirmState(st.fsm)) }

I don’t want to retrace all those steps here, but the call to func (st *OpenSentState) processEvent() (around line 444 in fsm.go) looks correct. The call in question must be a call to a function that processes an event while the peer is in the open state. This call seems to satisfy both requirements. There is a large switch statement in this function; let’s see if we can sort out what a few of these do to get a general sense of what is in this switch.

  • case BGPEventManualStop: this covers the case where the operator manually deconfigures or otherwise stops the BGP process, or the formation of this specific peer
  • case BGPEventAutoStop: this covers the case where the BGP process is brought down for some automatically generated reason; for instance, this (probably) covers the case where the BGP process is shut down because the system itself is going down
  • case BGPEventHoldTimerExp: when the peer was moved into the open state, the hold timer was configured and started running; if the hold timer expires before an open message is received from the peer, then a notification is sent and the peer is pushed back to idle state
  • case BGPEventTcpConnFails: if the TCP socket reports that the connection has failed, the peer is cleared and set back to active state

The particular bit of code in this switch we’re interested in is—

case BGPEventBGPOpen:
  bgpMsg := data.(*packet.BGPMessage)
  if st.fsm.ProcessOpenMessage(bgpMsg) {

Well, this doesn’t look so bad, right? Just a few short lines of code. 🙂

st.fsm.StopConnectRetryTimer() is pretty obvious, so I won’t spend a lot of time here. The peer is now connected, so there’s no reason to keep running the timer that causes events when the timer expires.

bgpMsg := data.(*packet.BGPMessage) might not be so obvious at first. In order to reach this state, the local peer has received a packet of some type. The contents of that packet must somehow be processed to actually form the peering relationship. This line of code just creates a new variable called bgpMsg and assigns the received packet to this variable. The := operator is specific to go, so it’s probably worth pausing for a second to explain.

Typing is a method a programming language uses to control memory usage, catch errors in the code during the compilation process, etc. If you define a new variable that is supposed to hold a whole number, or a number without a floating point component (the fractional part after the decimal point), and assign it the value 2, you might do something like this in C—

int a-number;
a-number = 2;

go does things a little differently, placing the name of the variable before the type, like this—

var a-number int
a-number = 2

The first line is consider the variable declaration, while the second is the variable assignment. These are normally two separate steps. But in go, there is a shortcut to this process. You can declare the variable and assign a value in one step, like this—

a-number := 2

How does the compiler know what kind or type of variable a-number is? By looking at the value assigned. In this case, the coder has declared a variable called bgpMsg, and assigned it the value of the contents of the open message just received in one step.

Next time, we’ll look at how this information is actually process. ’til then, happy coding.

On the ‘net: BGP—the most successful virus

This Weekly Show episode was recorded live at IETF 96 in Berlin in July 2016. Greg Ferro and several guests discuss the state of routing protocols such as BGP, and explore different approaches to routing, like Facebook’s Open/R initiative. They also debate issues around telemetry, network disaggregation, and whether enterprises should participate in the IETF to influence vendor product development.

Listen to the podcast over at Packet Pushers

Tags: |

snaproute Go BGP Code Dive (8): Moving to Open

Last week we left off with our BGP peer in connect state after looking through what this code, around line 261 of fsm.go in snaproute’s Go BGP implementation—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  switch event {
    case BGPEventConnRetryTimerExp:

What we want to do this week is pick up our BGP peering process, and figure out what the code does next. In this particular case, the next step in the process is fairly simple to find, because it’s just another case in the switch statement in (st *ConnectState) processEvent

case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:

This looks like the right place—we’re looking at events that occur while in the connect state, and the result seems to be sending an open message. Before we move down this path, however, I’d like to be certain I’m chasing the right call chain, or logical thread. How can I do this? This code is called when (st *ConnectState) processEvent is called with an event called BGPEventTcpCrAcked or BGPEventTcpConnConfirmed. Let’s chase down where these events might come from to see if this really is the next step in the call chain we’re trying to chase.

Note: Sometimes it’s easier to chase from the end result back towards the caller, and sometimes it’s not. There’s no way to know which is which until you have more experience in chasing through code. It takes time and practice to build these sorts of skills up, just like many other skills—but in chasing through code, you’re not only learning the protocols better, you’re also learning how to code better.

To find what we’re looking for, we can search through the project files for some instance of BGPEventTcpCrAcked, which seems to be the result of receiving an ACK for a TCP session initiated by BGP. We find a few places in fsm.go, as always, but most of them are using the event, rather than causing (or throwing) it—

272: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
371: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
475: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
592: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
709: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:

Until we get to this one—

case inConnCh := 

What does this do? This is a little complex, but let’s try to work through it. When starting a new peer, a port was cloned on which to send TCP packets to the peer. Since the port is cloned to a port the main FSM function is watching—(fsm *FSM) StartFSM()—the main FSM function is going to be notified of any inbound TCP packets received on the local device. When one specific sort of packet is received, an acknowledgement in a new TCP session, the main FSM function is called, resulting in case inConnCh := <-fsm.inConnCh: being called. This, in turn, calls (st *ConnectState) processEvent with BGPEventTcpCrAcked.

If you followed that, you know this verifies what it looked like in the first place—the code above is, in fact, the correct code to process the next phase of peering. The call chain looks something like this—

  • (fsm *FSM) StartFSM() is watching the TCP ports for any new packets
  • When (fsm *FSM) StartFSM() recieves a new TCP ACK, it falls through to case inConnCh := <-fsm.inConnCh: in the switch statement
  • This, in turn, calls (st *ConnectState) processEvent with BGPEventTcpCrAcked
  • (st *ConnectState) processEvent falls through to the case statement case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed, which then calls the correct functions to move beyond connect state

It’s okay if you have to read all of that several times—FSMs (Finite State Machines—remember?) can be very difficult to follow. This means we need to chase down each of these functions to find out how this implementation of BGP actually moves beyond the open state—

  • st.fsm.StopConnectRetryTimer()
  • st.fsm.SetPeerConn(data)
  • st.fsm.sendOpenMessage()
  • st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime, st.fsm.neighborConf.RunningConf.KeepaliveTime)
  • st.fsm.StartHoldTimer()
  • st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))

It’s pretty obvious what StopConnectRetryTimer does—it stops BGP from continuing to try to connect to this peer. Since the peer has acknowledged the initial TCP packet, we shouldn’t keep trying to send it initial TCP packets. SetPeerConn is a bit harder—

func (fsm *FSM) SetPeerConn(data interface{}) {
  if fsm.peerConn != nil {
  pConnDir := data.(PeerConnDir)
  fsm.peerConn = NewPeerConn(fsm, pConnDir.connDir, pConnDir.conn)
  go fsm.peerConn.StartReading()

This just does some general logging (which I’ve removed for clarity), and then tells the main process (through the FSM call) to start reading packets off this new peer’s data structure. I’m not going to dive into these functions deeply here.

Next time, we’ll look at the four remaining functions, as these are where the action really is from a BGP perspective.

snaproute Go BGP Code Dive (7): Moving to Connect

In last week’s post, we looked at how snaproute’s implementation of BGP in Go moves into trying to connect to a new peer—we chased down the connectRetryTimer to see what it does, but we didn’t fully work through what the code does when actually moving to connect. To jump back into the code, this is where we stopped—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  switch event {
    case BGPEventConnRetryTimerExp:

When the connectRetryTimer timer expires, it is not only restarted, but a new connection to the peer is attempted through st.fsm.InitiateConnToPeer(). This, then, is the next stop on the road to figuring out how this implementation of BGP brings up a peer. Before we get there, though, there’s an oddity here that needs to be addressed. If you look through the BGP FSM code, you will only find this call to initiate a connection to a peer in a few places. There is this call, and then one other call, here—

func (st *ConnectState) enter() {

The rest of the instances of InitiateConnToPeer() are related to the definition of the function. This raises the question: why wouldn’t you just call this function directly when moving to connect? In other words, why not call it directly, rather than by setting a timer and calling it when the timer wakes up? One of the prime points of coding coherently is to provide consistent entry and exit points into specific states. The more ways you can enter a state within an FSM, the more confusing the FSM gets, the easier it is to make mistakes when modifying the FSM, and the harder it is to troubleshoot problems with the FSM. If you can construct a code path that funnels every way to get into a single state through a single call, the code will ultimately be easier to understand and maintain.

Now let’s look at what st.fsm.InitiateConnToPeer() actually does—

func (fsm *FSM) InitiateConnToPeer() {
  if bytes.Equal(fsm.pConf.NeighborAddress, net.IPv4bcast) {
    fsm.logger.Info("Unknown neighbor address")
  remote := net.JoinHostPort(fsm.pConf.NeighborAddress.String(), config.BGPPort)
  local := ""

  if strings.TrimSpace(fsm.pConf.UpdateSource) != "" {
    local = net.JoinHostPort(strings.TrimSpace(fsm.pConf.UpdateSource), "0") 
  if fsm.outTCPConn == nil {
    fsm.outTCPConn = NewOutTCPConn(fsm, fsm.outConnCh, fsm.outConnErrCh)
    go fsm.outTCPConn.ConnectToPeer(fsm.connectRetryTime, remote, local)

I’ve removed the logging code for clarity—I’ll be removing the logging code consistently throughout this series.

The first step is to determine if we have a valid, reachable peer IP address. This is taken care of by—

if bytes.Equal(fsm.pConf.NeighborAddress, net.IPv4bcast)

If the neighbor address is the same as an IPv4 broadcast address (either or, then we don’t have a valid peer address. At this point, we just log the event and fail the attempt to connect to this peer. If we have a valid address to peer to, we need to build the data structures that will hold the TCP state. Remember that TCP is a stateful connection, which means we not only need to keep track of our local state, but we also need to keep track of the window and other information for the remote TCP peer. This is why there are two sets of calls to net.JoinHostPort, one for the local state, and one for the remote state.

Now that we have someplace to store the remote and local state, we can actually open a TCP connection (NewOutTCPConn) and then try to open the peering session (ConnectToPeer).

You can find the ConnectToPeer code in fsm/conn.go around line 175; the code is somewhat low level, so we won’t spend any time going through it here. Just taking a quick look shows that it essentially calls o.Connect, which then tries to open a new TCP session to the IP address specified.

Assuming this connection is actually opened, we have successfully moved the peer from idle to connect. We’ll tie up some loose ends in the next installment, and then consider the process of moving beyond connect state.

snaproute Go BGP Code Dive (2)

Now that you have a copy of BGP in Go on your machine—it’s tempting to jump right in to asking the code questions, but it’s important to get a sense of how things are structured. In essence, you need to build a mental map of how the functionality of the protocol you already know is related to specific files and structure, so you can start in the right place when finding out how things work. Interacting with an implementation from the initial stages of process bringup, building data structures, and the like isn’t all that profitable. Rather, asking questions of the code is an iterative/interactive process.

Take what you know, form an impression of where you might look to find the answer to a particular question, and, in the process of finding the answer, learn more about the protocol, which will help you find answers more quickly in the future.

So let’s poke around the directory structure a little, think about how BGP works, and think about what might be where. To begin, what might we call the basic functions of BGP? Let me take a shot at a list (if you see things you think should be on here, feel free to leave a comment—you might think of something I don’t, or we might have different ideas about what these should be, etc.):

  • Handle peering sessions
  • Receive updates
  • Run bestpath
  • Install routes into local tables
  • Install routes into the Routing Information Base (RIB)

Each of these can be broken down in to a lot of other pieces and parts, but we don’t want to go too deep here for the moment—we’re really trying to guess how the basic functions of the protocol align with directories and files in the actual code. Essentially—If I want to know how this particular implementation of BGP handles peering, where would I look? Now, let’s glance at the actual contents of the SnapRoute’s Go BGP implementation, and see what we can figure out—can we match any functions to directories?


Some of the things here I can guess at just from experience, like (note I’m not going to verify this stuff, and I might be wrong in some cases, but that’s okay, we’re just taking a first stab at figuring out where things might be)—

  • api—which means Application Programming Interface. Probably a set of files that declare function calls and the like into other applications.
  • flexswitch—since FlexSwitch is the actual name of the project, this probably contains files related to the overall routing engine SnapRoute is creating/maintaining. I would expect to find interfaces and interprocess communication to other processes in the same project, or something like that.
  • fsm—means Finite State Machine. A routing protocol can be described as a set of states, with specific events that cause the protocol to shift from one state to another. For instance, when a BGP peer shifts from active to idle,, this is a state change. The FSM would be considered the “heart” of the protocol in many ways.
  • ovs—means Open Virtual Switch. This is probably interfaces to OVSDB, which allows this version of BGP to run the OpenSwitch project.
  • rpc—means remote procedure call.

Another good place to look is in the /docs directory, which sometimes has useful information about how the code is structured. In this particular case, there is a diagram in the /docs directory that shows a basic overview of the code.


From this we can gather than the neighbor, FSM, and BGP RIB are considered three different modules in the code base. We can also infer there an external database that holds the BGP tables and configuration, accessed through the Thrift RPC. The server module is interesting; we’ll have to watch for this as we start asking the code specific questions, to figure out what this might be used for. I’ll give you hint up front, and say this is a pretty common structure for just about every piece of software that is driven by events.

That’s enough poking around for this post; we’ll look at some tools next, and then start into actually asking the code questions.

Getting to the point of dual homing

I wonder how many times I’ve seen this sort of diagram across the many years I’ve been doing network design?


It’s usually held up as an example of how clever the engineer running the network is about resilience. “You see,” the diagram asserts, “I’m smart enough to purchase connectivity from two providers, rather than one.”

Can I point something out? Admittedly it might not be all that obvious from the diagram, but… Reality is just about as likely to squish your network connectivity like a bug no a windshield as it is any other network. Particularly if both of these connections are in the same regional area. The tricky part is knowing, of course, what a “regional area” might happen to mean for any particular provider.

The problem with this design is very basic, and tied to the concept of shared link risk groups. But let me start someplace a little simpler than that—with the basic, and important, point that putting fiber in the ground, and maintaining fiber that’s in the ground, is expensive. Unless you live in Greenland, fiber can be physically buried pretty easily (fiber in Greenland is generally buried with dynamite by a blasting crew, or through conduit that’s bolted to the surface of the ubiquitous rock). But it’s not the burying that costs a lot of money—it’s the politics.

To bury a cable, you must get a right of way. Getting a right of way could well be very expensive in any given city. I remember encountering one particular situation where the land under consideration was owned, in theory, by a railroad. Well, it was close enough to an old station that it must have been. But it took several years of looking through old piles of paper to find the correct paper trail and figure out how, precisely actually owned the land in a legally provable way. This is not a task for the faint of heart.

What has this to do with the image above? A lot, actually. It’s so expensive to install last mile fiber providers often share this last mile. To explain, let’s look at a small picture, just below, that might be helpful.


This is the way many providers actually build their last mile. There is (normally a pair of) fiber ring(s), with a set of ROADM’s at key locations in the region (ROADM actually means “randomly dropping all de traffic that matters,” but don’t tell anyone, it’s a secret). When a customer is connected to the network, they are assigned a lightwave on the fiber that carries their traffic, from the customer edge device, over a virtual layer 2 circuit (generally point-to-point, but not always), to a central office or exchange point. Here the different lightwaves are split up and handed to different providers through good old fashioned routing. One provider normally owns the fiber, and other providers lease wavelengths, or bandwidth, etc., to reach customers in the region.

Looking at this second image, you might be able to see what the problem is with the first. It’s possible—actually probable, in fact I’ve seen it happen in real life—that a single backhoe fade within the same region will take out both provider’s circuits at the same time.

The problem here isn’t really the lack of diversity. Rather, it’s that the lack of diversity is hidden through the magical abstraction of virtualization. Two logical circuits that share the the same fate because they both run on the same physical media, by the way, are called a Shared Risk Link Group (SRLG). Providers aren’t likely to tell you when you’re at risk from this sort of problem for several reasons.

First, telling you who leased fiber from whom is bad business. Second, they may not actually know enough about their competitors to point this problem out. Third, it’s really in their business interest to try to convince you not to do this, but rather to buy all your upstream from them.

So—what can you do about this?

If you’re going to connect to two providers, try to do so in two different regions. This is often difficult, as you don’t really know where the regions are, and connecting two sites that provide backup for one another across multiple regional rings can be a challenge for geographical reasons.

One alternative here is to connect to a local exchange point (an IXP), and from their fabric to the various providers. While the IXP will likely lease their circuits from others, they will have a much better idea of where the cables physically run, and how to provide diverse circuits (but only if you know what you’re asking for).

Another alternative is to simply stick with a single provider, and insist on physical diversity in any resilient links. This plays into the provider’s hand of trying to get you to buy from a single source, but it gets around the problem of trying to figure out what cable is where, and who uses what (information you’re not generally going to be able to find anyway), and puts it on the shoulders of the provider—who does know, at least for their network.

The next time you think you’ve solved the resilience problem by quickly and easily dual homing, remember shared risk, and remember to look for the deeper problem that’s been hidden away through an abstraction—an abstraction that far too often is leaky.

When prepend fails, what next? (3)

We began this short series with a simple problem—what do you do if your inbound traffic across two Internet facing links is imbalanced? In other words, how do you do BGP load balancing? The first post looked at problems with AS Path prepend, while the second looked at de-aggregating and using communities to modify the local preference within the upstream provider’s network.

There is one specific solution I want to discuss a bit more before I end this little series: de-aggregation. Advertising longer prefixes is the “big hammer” of routing; you should always be careful when advertising more specifics. The Default Free Zone (DFZ) is much like the “commons” of an old village. No-one actually “owns” the routing table in the global Internet, but everyone benefits from it. De-aggregating don’t really cost you anything, but it does cost everyone else something. It’s easy enough to inject another route into the routing table, but remember the longer prefix you inject shows up everywhere in the world. You’re fixing your problem by taking up some small amount of memory in every router that’s connected to the DFZ in the world. If everyone de-aggregates, everyone has to buy larger routers and more memory. Including you.

There is a fine line between using a commonly held resource and abusing a commonly held resource. If everyone abuses the commons because it “does not cost them anything,” what results is the tragedy of the commons. Once a set of commons are ruined, it’s very difficult to recover the original intent and trust relationships that caused the commons in the first place. So before you de-aggregate, you should think about whether or not it is really necessary.

Is this really necessary? Does it really matter if your two inbound links are not balanced? There may be financial reasons why it does matter, such as the costs of the two links, or the cost of bursting over a set level on one of the two links. These are certainly considerations, but it might make more sense to modify the sizing of the available links rather than putting a technical solution in place that will need to be managed and maintained.

Remember everything you configure will eventually break, and everything that breaks results in a call at 2AM. Think through the options you have available before putting a optimization in place.

Are there ways I can limit the damage to the commons?


Returning to our original network, is it possible to de-aggregate in a way that pulls traffic from AS65001 into AS65004, but doesn’t impact the table size of anyone these two providers are connected to? Most providers to, in fact, allow you to not only send a community to set the local preference within their AS, but also to block the advertisement of any particular route to their peers. You might need to play around with these communities a bit to understand the relationship between the community and inbound traffic flow; for instance, what impact will blocking the advertisement of a more specific to the transit peers of one upstream versus blocking the route to some set of customers connected to the provider? As there is no way for you to directly know how and where the provider is connected. You can work directly with the provider to sort out what to advertise where while reducing your global impact, or you might just need to play around with different combinations to see what works best.

Is my peering the right peering? Another option is to think through who you are peering with. Assume, for a moment, that you are peering with one more regional provider, and one more global provider. In this case, your customer base is going to play a large role in which provider sends you more traffic.

For instance, if you are a regional bank, or health care provider, most of your customers are going to be connected to a regional provider (rather than a tier 1), and hence you are likely to receive most of your traffic on the regional provider’s link. If, however, your business is more global, the regional provider is not going to send you a lot of traffic—mostly just people who happen to be accessing your network from within your region. In this case, the imbalance between the two inbound links should be expected.

An observation: if this is so, maybe it is better to peer with two providers that will bring you closer to your customers. If your customers are global, maybe it’s better to peer with two providers at the national or global level, rather than one global and one regional—and the other way around. Perhaps it is better to balance your inbound traffic by carefully considering who your customers are, and how to best reach them, than it is to try and play engineering tricks to draw equal amounts of traffic over the networks of two completely different kinds of providers.

The bottom line is this: the engineering solution is the last solution you should reach for. I know—we are all engineers here, and there’s nothing quite like getting under a heavy load and solving it with a nice, long set of configuration commands that make you feel like you spent your money well in buying that big hunk of iron racked up in the DMARC.

But real engineering begins when you ask the background questions, and really understand the problem.

When prepend fails, what next? (1)

So you want to load share better on your inbound ‘net links. If you look around the ‘web, it won’t take long to find a site that explains how to configure AS Path Prepending. So the next time you have downtime, you configure it up, turn everything back on, and… Well, it moved some traffic, but not as much as you’d like. So you wait ’til the next scheduled maintenance window and configure a couple of extra prepends into the mix. Now you fire it all back up and… not much happens. Why not? There are a couple of reasons prepending isn’t always that effective—but it primarily has to do with the way the Internet itself tends to be built. Let’s use the figure below as an example network.


You’re sitting at AS65000, and you’re trying to get the traffic to be relatively balanced across the 65001->65000 and the 65004->65000 links. Say you’ve prepended towards AS65001, as that’s the provider sending you more traffic. Assume, for a moment, that AS65003 accepts routes from both AS65001 and AS65004 on an equal basis. When you prepend, you’re causing the route towards your destinations to appear to be longer from AS65003’s perspective. This path will be affected by the first prepend.

But now consider the second prepend—will it have any impact on the traffic flow? AS65003 only has two paths to the destination, one through AS65001 and one through AS65004. It can only choose one of these two paths. If the single prepend worked, a second prepend isn’t going to make any difference. This alerts us to the first problem with prepending: it’s only as effective when it’s within the realistic parameters of the AS Path. Adding 256 prepends in this network isn’t going to have any impact more than the first prepend.

If the effectiveness of prepending is related to the overall path length through the network (edge to edge), then we should ask—what is the average path length of the global Internet? As it turns out, there are folks who measure this sort of thing on a regular basis, and have for quite a long time (in terms of Internet time scales)—CAIDA, RIPE, and Potaroo, for instance, all have pretty extensive measurements taken from the Internet Default Free Zone (DFZ) over time. Here is a chart of the average AS Path length in the DFZ since 1998:


As it turns out, the average AS Path length hasn’t changed much in the last eight years—even though the number of routes and the number of connected autonomous systems has dramatically increased over that same time period. The lesson here is the first AS path prepend is probably going to have the most impact, the second will have a lesser impact, and after that—you’re probably just typing for the fun of it.

There are two other reasons prepending can fail.

First, consider the connection between AS65001 and AS65004. We know this is some sort of peering relationship; it could be settlement free, it could have some sort of settlement on it, or—well, who knows? But one thing you can know is that AS65001 is always going to prefer your route from you over your route learned through AS65004. AS65001 is going to configure this preference using LOCAL_PREF, which comes way before your puny little AS Path Prepend. Bottom line? You’re never going to draw traffic across the 65004->65001 link using prepend.

Second, consider AS65002 sitting up in the corner. Once again, note that AS65001 is always going to prefer routes to its customer learned from its customers. So to add one more to the point above, you’re never going to get the traffic from AS65002 to travel through AS65004 instead of AS65001.

All this to say: if a majority of your traffic is being sourced from one of your two provider’s customers, prepend is going to be useless in redirecting that traffic through another provider.

Now that we know why prepend doesn’t always work, what can we do about it? We’ll save the answer ’til next week’s Design Board.

Securing BGP: A Case Study (8)

Throughout the last several months, I’ve been building a set of posts examining securing BGP as a sort of case study around protocol and/or system design. The point of this series of posts isn’t to find a way to secure BGP specifically, but rather to look at the kinds of problems we need to think about when building such a system. The interplay between technical and business requirements are wide and deep. In this post, I’m going to summarize the requirements drawn from the last seven posts in the series.

Don’t try to prove things you can’t. This might feel like a bit of an “anti-requirement,” but the point is still important. In this case, we can’t prove which path along which traffic will flow. We also can’t enforce policies, specifically “don’t transit this AS;” the best we can do is to provide information and letting other operators make a local decision about what to follow and what not to follow. In the larger sense, it’s important to understand what can, and what can’t, be solved, or rather what the practical limits of any solution might be, as close to the beginning of the design phase as possible.

In the case of securing BGP, I can, at most, validate three pieces of information:

  • That the origin AS in the AS Path matches the owner of the address being advertised.
  • That the AS Path in the advertisement is a valid path, in the sense that each pair of autonomous systems in the AS Path are actually connected, and that no-one has “inserted themselves” in the path silently.
  • The policies of each pair of autonomous systems along the path towards one another. This is completely voluntary information, of course, and cannot be enforced in any way if it is provided, but more information provided will allow for stronger validation.

There is a fine balance between centralized and distributed systems. There are actually things that can be centralized or distributed in terms of BGP security: how ownership is claimed over resources, and how the validation information is carried to each participating AS. In the case of ownership, the tradeoff is between having a widely trusted third party validate ownership claims and having a third party who can shut down an entire business. In the case of distributing the information, there is a tradeoff between the consistency and the accessibility of the validation information. These are going to be points on which reasonable people can disagree, and hence are probably areas where the successful system must have a good deal of flexibility.

Cost is a major concern. There are a number of costs that need to be considered when determining which solution is best for securing BGP, including—

  • Physical equipment costs. The most obvious cost is the physical equipment required to implement each solution. For instance, any solution that requires providers to replace all their edge routers is simply not going to be acceptable.
  • Process costs. Any solution that requires a lot of upkeep and maintenance is going to be cast aside very quickly. Good intentions are overruled by the tyranny of the immediate about 99.99% of the time.

Speed is also a cost that can be measured in business terms; if increasing security decreases the speed of convergence, providers who deploy security are at a business disadvantage relative to their competitors. The speed of convergence must be on the order of Internet level convergence today.

Information costs are a particularly important issue. There are at least three kinds of information that can leak out of any attempt to validate BGP, each of them related to connectivity—

  • Specific information about peering, such as how many routers interconnect two autonomous systems, where interconnections are, and how interconnection points are related to one another.
  • Publicly verifiable claims about interconnection. Many providers argue there is a major difference between connectivity information that can be observed and connectivity information that is claimed.
  • Publicly verifiable information about business relationships. Virtually every provider considers it important not to release at least some information about their business relationships with other providers and customers.

While there is some disagreement in the community over each of these points, it’s clear that releasing the first of these is almost always going to be unacceptable, while the second and third are more situational.

With these requirements in place, it’s time to look at a couple of proposed systems to see how they measure up.

Securing BGP: A Case Study (7)

In the last post on this series on securing BGP, I considered a couple of extra questions around business problems that relate to BGP. This time, I want to consider the problem of convergence speed in light of any sort of BGP security system. The next post (to provide something of a road map) should pull all the requirements side together into a single post, so we can begin working through some of the solutions available. Ultimately, as this is a case study, we’re after a set of tradeoffs for each solution, rather than a final decision about which solution to use.

The question we need to consider here is: should the information used to provide validation for BGP be somewhat centralized, or fully distributed? The CAP theorem tells us that there are a range of choices here, with the two extreme cases being—

  • A single copy of the database we’re using to provide validation information which is always consistent
  • Multiple disconnected copies of the database we’re using to provide validation which is only intermittently consistent

Between these two extremes there are a range of choices (reducing all possibilities to these two extremes is, in fact, a misuse of the CAP theorem). To help understand this as a range of tradeoffs, take a look at the chart below—


The further we go to the right along this chart, the more—

    • copies of the database there are in existence—more copies means more devices that must have a copy, and hence more devices that must receive and update a local copy, which means slower convergence.
    • slower the connectivity between the copies of the database.

In complexity model terms, both of these relate to the interaction surface; slower and larger interaction surfaces face their tradeoff in the amount and speed of state that can be successfully (or quickly) managed in a control plane (hence the tradeoffs we see in the CAP theorem are directly mirrored in the complexity model). Given this, what is it we need out of a system used to provide validation for BGP? Let’s set up a specific situation that might help answer this question.

Assume, for a moment, that your network is under some sort of distributed denial of service (DDoS) attack. You call up some sort of DDoS mitigation provider, and they say something like “just transfer your origin validation to us, so we can advertise the route without it being black holed; we’ll scrub the traffic and transfer the clean flows back to you through a tunnel.” Now ask this: how long are you willing to wait before the DDoS protection takes place? Two or three days? A day? Hours? Minutes? If you can locate that amount of time along the chart above, then you can get a sense of the problem we’re trying to solve.

To put this in different terms: any system that provides BGP validation information must converge at roughly the speed of BGP itself.

So—why not just put the information securing BGP in BGP itself, so that routing converges at the same speed as the validation information? This implies every edge device in my network must handle cryptographic processing to verify the validation information. There are some definite tradeoffs to consider here, but we’ll leave those to the evaluation of proposed solutions.

Before leaving this post and beginning on the process of wrapping up the requirements around securing BGP (to be summarized in the next post), one more point needs to be considered. I’ll just state the point here, because the reason for this requirement should be pretty obvious.

Injecting validation information into the routing system should expose no more information about the peering structure of my AS than can be inferred through data mining of publicly available information. For instance, today I can tell that AS65000 is connected to AS65001. I can probably infer something about their business relationship, as well. What I cannot tell, today, is how many actual eBGP speakers connect the two autonomous systems, nor can I infer anything about the location of those connection points. Revealing this information could lead to some serious security and policy problems for a network operator.

Reaction: BGP convergence, divergence & the ‘net

Let’s have a little talk about BGP convergence.

We tend to make a number of assumptions about the Internet, and sometimes these assumptions don’t always stand up to critical analysis. . . . On the Internet anyone can communicate with anyone else – right? -via APNIC

Geoff Huston’s recent article on the reality of Internet connectivity—no, everyone cannot connect to everyone—prompted a range of reactions from various folks I know.

For instance, BGP is broken! After all, any routing protocol that can’t provide basic reachability to every attached destination must be broken, right? The problem with this statement is it assumes BGP is, at core, a routing protocol. To set the record straight, BGP is not, at heart, a routing protocol in the traditional sense of the term. BGP is a system used to describe bilateral peering arrangements between independent parties in a way that provides loop free reachability information. The primary focus of BGP is not loop free reachability, but policy.

After all, BGP convergence is a big deal, right? Part of the problem here is that we use BGP as a routing protocol in some situations (for instance, on data center fabrics), so we have a hard time adjusting our thinking to the original peering policy based focus it was designed for. In the larger ‘net, it’s not a bug that some destinations are unreachable from some sources. It’s an expression of policy, and hence it’s a feature. There are certainly times when such policies are unintentional, but unintentional/unplanned policy is policy just the same as intentional/planned policy is.

We shouldn’t declare BGP broken for doing something it’s supposed to do.

There’s another point here, as well: Some networks never converge. And that’s okay. This is, perhaps, even harder for network engineers to get their heads around. I’ve spent twenty years making sure networks converge quickly, as loop free as possible, with as little chance for failure as possible, and using the least number of resources possible. But every network in the world doesn’t always have to converge to a single view of the topology and reachability. Really!

The problem here is the micro and macro views of the world. The ‘net doesn’t converge for two reason.

First, there’s that pesky policy problem again. Policy, in the real world, never converges. There are always contradictory policies, and policies will often form bistable states. This is maddening, of course, to the mind of an engineer, but it’s just reality intruding on our little bubble. Bubbles are, after all, meant to be burst.

Second, there’s that whole CAP theorem thing in there someplace. Not many people understand the application of CAP to routing, so I’m stuffing a post or two on this on my todo list, but just remember: you can choose to a Consistent database, a database that is Accessible by every reader/user all the time, or a database that can be Partitioned. If you think about it, routing protocols are readable by every network device all the time, and they are partitioned among all the routers/intermediate systems in the network. Which means… They aren’t going to be consistent.

As in, if you feed a routing protocol enough changes often enough, it won’t ever converge—because it’s eventual consistency will always be catching up with reality. This is just the way the world is built—piling all the SDN unicorn magic in the world into routing isn’t going to solve this one, folks. On a network the size of the Internet, someone, somewhere, is always going to be changing something. This cripples BGP convergence; the ‘net never converges.

In the history of ideas, perhaps BGP shouldn’t float to the top as one of the most brilliant (and Tony and Yaakov would probably even agree with you)—but it has, on the other hand, been one of the most successful. It’s just tight enough to work often enough to rely on the connectivity as described, and it’s just loose enough to allow policies to be injected where they need to be. No such system is ever going to be “perfect.”

We could beat our heads against a wall trying, of course, but even virtual reality has physical limitations.