snaproute Go BGP Code Dive (12): Moving to Established

In last week’s post, the new BGP peer we’re tracing through the snaproute BGP code moved from open to openconfirmed by receiving, and processing, the open message. In processing the open message, the list of AFIs this peer will support was built, the hold timer set, and the hold timer started. The next step is to move to established. RFC 4271, around page 70, describes the process as—

If the local system receives a KEEPALIVE message (KeepAliveMsg (Event 26)), the local system:
 - restarts the HoldTimer and
 - changes its state to Established.

In response to any other event (Events 9, 12-13, 20, 27-28), the local system:
 - sends a NOTIFICATION with a code of Finite State Machine Error,
 - sets the ConnectRetryTimer to zero,
 - releases all BGP resources,
 - drops the TCP connection,
 - increments the ConnectRetryCounter by 1,
 - (optionally) performs peer oscillation damping if the DampPeerOscillations attribute is set to TRUE, and
 - changes its state to Idle.

 

For a bit of review (because this is running so long, you might forget how the state machine works), the way the snaproute code is written is as a state machine. The way the state machine works is there are a series of steps the BGP peer must go through, each step being represented by a function call in the fsm.go file. As the peer moves from one state to another, a function call “moves the pointer” from the current state to the next one, such that any event which occurs will call a different function, based on the current state. I know this is rather difficult to follow, but what this means, in practical terms, is that if the underlying TCP session is acknowledged or confirmed while the peer is in connected state, the following code from around line 272 in fsm.go are executed—

case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
 st.fsm.StopConnectRetryTimer()
 st.fsm.SetPeerConn(data)
 st.fsm.sendOpenMessage()
 st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime,
  st.fsm.neighborConf.RunningConf.KeepaliveTime)
 st.fsm.StartHoldTimer()
 st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))

However, if this same event occurs—an open acknowledgement for the underlying TCP session is received—while the peer is in openconfirm state, a different set of code is executed, from around line 593 in fsm.go

case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
 st.fsm.HandleAnotherConnection(data)

This is a general characteristic of any FSM—the event is matched against the current state to determine what action to take next. With all of this in mind, any event received while the peer is in openconfirm state will be processed by func (st *OpenConfirmState) processEvent, which is around line 558 is fsm.go. This code consists of a switch statement, which looks like this—

func (st *OpenConfirmState) processEvent(event BGPFSMEvent, data interface{}) {
 switch event {
  case BGPEventManualStop:
   ....
  case BGPEventAutoStop:
   ....
  case BGPEventHoldTimerExp:
   ....
  case BGPEventKeepAliveTimerExp:
   ....
  case BGPEventTcpConnValid: // Supported later
  case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed: // Collision Detection... needs work
   ....
  case BGPEventTcpConnFails, BGPEventNotifMsg:
   ....
  case BGPEventBGPOpen: // Collision Detection... needs work
  case BGPEventHeaderErr, BGPEventOpenMsgErr:
   ....
  case BGPEventOpenCollisionDump:
   ....
  case BGPEventNotifMsgVerErr:
   ....
  case BGPEventKeepAliveMsg:
   .... 
  case BGPEventConnRetryTimerExp, BGPEventDelayOpenTimerExp, BGPEventIdleHoldTimerExp,
   ....
  }
}

 

I’ve cut out the actions taken in each case to make it easier to see the structure of the entire switch statement in one sweep. Most of these options are actually error conditions that take exactly the same steps. Let’s look at one to see what it does—

case BGPEventHoldTimerExp:
 st.fsm.SendNotificationMessage(packet.BGPHoldTimerExpired, 0, nil)
 st.fsm.StopConnectRetryTimer()
 st.fsm.ClearPeerConn()
 st.fsm.StopConnToPeer()
 st.fsm.IncrConnectRetryCounter()
 st.fsm.ChangeState(NewIdleState(st.fsm))

 

If the hold timer expires while the peer is in openconfirmed state—

  • A notification is sent by SendNotificationMessage; this will tell the peer that the session is being torn down, so the two speakers can have synchronized state
  • The connect retry timer is stopped, so the local BGP speaker will not try to reconnect until the peer has passed through the idle state; this prevents any problems that might result from stepping outside the BGP state machine
  • The peer connection is cleared; the just empties the various data structures associated with the peer, so old information isn’t carried into a new peering session
  • The peering connection is stopped by StopConnToPeer
  • The connection retry counter is incremented, which allows the operator to see how many times this peer has been torn down and restarted
  • The state of the peer is changed to idle

This set of actions only changes slightly from state to state; if you search for this set of steps, you’re likely to find it at least a few dozen times throughout fsm.go.

There is one other interesting point about this code worth mentioning. The folks at snaproute apparently haven’t implemented peer collision detection, as evidenced by the comments in the code itself. For instance—


  case BGPEventTcpConnValid: // Supported later
  case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed: // Collision Detection... needs work
   ....
  case BGPEventTcpConnFails, BGPEventNotifMsg:
   ....
  case BGPEventBGPOpen: // Collision Detection... needs work

Each of these three events—receiving a new TCP connection towards a peer that is already in openconfirmed state, or receiving an open message from a peer that is already in openconfirmed state— represents an event that should not take place. What should the snaproute code do here? According to section 6.8 of RFC4271, it should—

Unless allowed via configuration, a connection collision with an existing BGP connection that is in the Established state causes closing of the newly created connection.

So when they eventually fill this bit of code in, you can be pretty certain what the actual code will do—it will reset the peering session in a way that’s similar to the other error code already present. The bit of code that’s interesting in the context of moving from openconfirmed to established are around line 627 in fsm.go

case BGPEventKeepAliveMsg:
 st.fsm.StartHoldTimer()
 st.fsm.ChangeState(NewEstablishedState(st.fsm))

 

The actual processing to move from openconfirmed to established is simple: if the local peer receives a keep alive message while in the openconfirmed state, move the peer to established.

As we’ve reached established state, the next step is to understand how updates are received and processed for this new peer.

snaproute Go BGP Code Dive (11): Moving to Open Confirm

In the last post in this series, we began considering the bgp code that handles the open message that begins moving a new peer to open confirmed state. This is the particular bit of code of interest—

case BGPEventBGPOpen:
  st.fsm.StopConnectRetryTimer()
  bgpMsg := data.(*packet.BGPMessage)
  if st.fsm.ProcessOpenMessage(bgpMsg) {
    st.fsm.sendKeepAliveMessage()
    st.fsm.StartHoldTimer()
    st.fsm.ChangeState(NewOpenConfirmState(st.fsm))
  }

We looked at how this code assigns the contents of the received packet to bgpMsg; now we need to look at how this information is actually processed. bgpMsg is passed to st.fsm.ProcessOpenMessage() in the next line. This call is preceded by the st.fsm, which means this function is going to be found in the FSM, which means fsm.go. Indeed, func (fsm *FSM) ProcessOpenMessage... is around line 1172 in fsm.go—

func (fsm *FSM) ProcessOpenMessage(pkt *packet.BGPMessage) bool {
 body := pkt.Body.(*packet.BGPOpen)

 if uint32(body.HoldTime) < fsm.holdTime {
  fsm.SetHoldTime(uint32(body.HoldTime), uint32(body.HoldTime/3))
 }

 if body.MyAS == fsm.Manager.gConf.AS {
  fsm.peerType = config.PeerTypeInternal—
 } else {
  fsm.peerType = config.PeerTypeExternal
 }

 afiSafiMap := packet.GetProtocolFromOpenMsg(body)
 for protoFamily, _ := range afiSafiMap {
  if fsm.neighborConf.AfiSafiMap[fusion_builder_container hundred_percent="yes" overflow="visible"][fusion_builder_row][fusion_builder_column type="1_1" background_position="left top" background_color="" border_size="" border_color="" border_style="solid" spacing="yes" background_image="" background_repeat="no-repeat" padding="" margin_top="0px" margin_bottom="0px" class="" id="" animation_type="" animation_speed="0.3" animation_direction="left" hide_on_mobile="no" center_content="no" min_height="none"][protoFamily] {
   fsm.afiSafiMap[protoFamily] = true
  }
 }

 return fsm.Manager.receivedBGPOpenMessage(fsm.id, fsm.peerConn.dir, body)
}

There are three “sections” in this function, each one takes care of a different thing. The first section—

if uint32(body.HoldTime) < fsm.holdTime {
 fsm.SetHoldTime(uint32(body.HoldTime), uint32(body.HoldTime/3))
}

This is fairly simple; it compares the received hold time with the locally configured hold time, setting the final hold time to the lower of these two numbers. This is in line with the most recent BGP specification (RFC 4271), section 4.2, which states—

This 2-octet unsigned integer indicates the number of seconds the sender proposes for the value of the Hold Timer. Upon receipt of an OPEN message, a BGP speaker MUST calculate the value of the Hold Timer by using the smaller of its configured Hold Time and the Hold Time received in the OPEN message.

The second section of this code is a little more confusing—

if body.MyAS == fsm.Manager.gConf.AS {
 fsm.peerType = config.PeerTypeInternal
} else {
 fsm.peerType = config.PeerTypeExternal
}

This obviously somehow sets the type of peer, internal (iBGP) or external (eBGP), but how precisely does this work? The if statement is the crucial point here; if the statement is true, then first branch is executed, which sets the peer type to iBGP. If the <codeif statement evaluates as !true, the second branch is executed, setting the peer type to eBGP.

Note the difference between = and ==. In both C and Go, = assigns the value or the contents of the variable on the right side of the = to the variable on the left side. The == operator compares the two values, returning a 0 if the values (or contents of the two variables) are the same, and 0 if the values (or contents of the two variables) does not match.

The if statement itself is comparing body.MyAS to fsm.Manager.gConf.AS; what do these contain? body.MyAS is an element of the body structure, which is taken from the packet contents at the beginning of the function by the line body := pkt.Body.(*packet.BGPOpen). body.MyAS, is, then the AS number of the remote peer. On the other hand, fsm.Manager.gConf.AS is being taken from the local fsm state, in particular the configuration state for the local peer process. Given these two definitions, these lines of code make sense; if the local and remote AS match, then the neighbor type should be set to iBGP. If they don’t match, then the neighbor type should be set to eBGP.

The final section of code is the most complex of the three—

afiSafiMap := packet.GetProtocolFromOpenMsg(body)
for protoFamily, _ := range afiSafiMap {
 if fsm.neighborConf.AfiSafiMap[protoFamily] {
  fsm.afiSafiMap[protoFamily] = true
 }

The first line of code here grabs a list of the address families (AFIs)/subaddress families (SAFIs) supported by the peer, as reported in its open message and places them into a list. The second line of code, for protoFamily, _ := range afiSafiMap {, walks through a list of each possible protocol family, checking each one to see if it’s included in the peer’s list of AFIs/SAFIs. If a particular AFI/SAFI is included in the peer’s supported list, then the AFI/SAFI is set to true, which will serve as an indicator to any other process interacting with this particular peer which specific AFIs/SAFIs are supported.

At this point, the open message received from the new peer has been processed. Once ProcessOpenMessage finishes, it will return to the main FSM, and to the remainder of the switch statement above.

st.fsm.sendKeepAliveMessage() will now send the first TCP keepalive to this new peer; as there is no timer for sending keepalive messages set at this point, and there is no way to tell how long processing the open message has taken, the safest thing to do is to send this first keepalive message.

st.fsm.StartHoldTimer() will now start a hold timer. If this timer expires, the peer will be brought down—this is something look at later, when we consider various error conditions the various bits of code might encounter, and the expiration (waking up) of various timers set along the way.

Finally, st.fsm.ChangeState(NewOpenConfirmState(st.fsm)) sets the current state to open confirm, bringing us one step closer to exchanging databases, and transitioning this new peer into the normal state for BGP neighbors.

We’ll consider the next step in this process in the next code dive.
[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]

snaproute Go BGP Code Dive (10): Moving to Open Confirm

In the last post on this topic, we traced how snaproute’s BGP code moved to the open state. At the end of that post, the speaker encodes an open message using packet, _ := bgpOpenMsg.Encode(), and then sends it. What we should be expecting next is for an open message from the new peer to be received and processed. Receiving this open message will be an event, so what we’re going to need to look for is someplace in the code that processes the receipt of an open message. All the way back in the fifth post of this series, we actually unraveled this chain, and found this is the call chain we’re looking for—

  • func (st *OpenSentState) processEvent()
  • st.fsm.StopConnectRetryTimer()
  • bgpMsg := data.(*packet.BGPMessage)
  • if st.fsm.ProcessOpenMessage(bgpMsg) {
    • st.fsm.sendKeepAliveMessage()
    • st.fsm.StartHoldTimer()
    • st.fsm.ChangeState(NewOpenConfirmState(st.fsm)) }

I don’t want to retrace all those steps here, but the call to func (st *OpenSentState) processEvent() (around line 444 in fsm.go) looks correct. The call in question must be a call to a function that processes an event while the peer is in the open state. This call seems to satisfy both requirements. There is a large switch statement in this function; let’s see if we can sort out what a few of these do to get a general sense of what is in this switch.

  • case BGPEventManualStop: this covers the case where the operator manually deconfigures or otherwise stops the BGP process, or the formation of this specific peer
  • case BGPEventAutoStop: this covers the case where the BGP process is brought down for some automatically generated reason; for instance, this (probably) covers the case where the BGP process is shut down because the system itself is going down
  • case BGPEventHoldTimerExp: when the peer was moved into the open state, the hold timer was configured and started running; if the hold timer expires before an open message is received from the peer, then a notification is sent and the peer is pushed back to idle state
  • case BGPEventTcpConnFails: if the TCP socket reports that the connection has failed, the peer is cleared and set back to active state

The particular bit of code in this switch we’re interested in is—

case BGPEventBGPOpen:
  st.fsm.StopConnectRetryTimer()
  bgpMsg := data.(*packet.BGPMessage)
  if st.fsm.ProcessOpenMessage(bgpMsg) {
    st.fsm.sendKeepAliveMessage()
    st.fsm.StartHoldTimer()
    st.fsm.ChangeState(NewOpenConfirmState(st.fsm))
  }

Well, this doesn’t look so bad, right? Just a few short lines of code. 🙂

st.fsm.StopConnectRetryTimer() is pretty obvious, so I won’t spend a lot of time here. The peer is now connected, so there’s no reason to keep running the timer that causes events when the timer expires.

bgpMsg := data.(*packet.BGPMessage) might not be so obvious at first. In order to reach this state, the local peer has received a packet of some type. The contents of that packet must somehow be processed to actually form the peering relationship. This line of code just creates a new variable called bgpMsg and assigns the received packet to this variable. The := operator is specific to go, so it’s probably worth pausing for a second to explain.

Typing is a method a programming language uses to control memory usage, catch errors in the code during the compilation process, etc. If you define a new variable that is supposed to hold a whole number, or a number without a floating point component (the fractional part after the decimal point), and assign it the value 2, you might do something like this in C—

int a-number;
a-number = 2;

go does things a little differently, placing the name of the variable before the type, like this—

var a-number int
a-number = 2

The first line is consider the variable declaration, while the second is the variable assignment. These are normally two separate steps. But in go, there is a shortcut to this process. You can declare the variable and assign a value in one step, like this—

a-number := 2

How does the compiler know what kind or type of variable a-number is? By looking at the value assigned. In this case, the coder has declared a variable called bgpMsg, and assigned it the value of the contents of the open message just received in one step.

Next time, we’ll look at how this information is actually process. ’til then, happy coding.

snaproute Go BGP Code Dive (9): Moving to Open

In the last session of snaproute BGP code dive—number 8, in fact— I started looking at how snaproute’s BGP moves from connect to open. This is the chain of calls from that post—

  • st.fsm.StopConnectRetryTimer()
  • st.fsm.SetPeerConn(data)
  • st.fsm.sendOpenMessage()
  • st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime, st.fsm.neighborConf.RunningConf.KeepaliveTime)
  • st.fsm.StartHoldTimer()
  • st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))

The past post covered the first two steps in this process, so this post will begin with the third step, st.fsm.sendOpenMessage(). Note the function call has st.fm... in the front, so this is a call by reference. Each FSM that is spun up (think of them as threads, or even processes, if you must, to get this concept in your head, even though they’re not) can have its own copy of this function, with its own state. When reading the code to sort out how it works, this doesn’t have much practical impact, other than telling us the sendOpenMessage function we’re looking for is going to be in the FSM file. The function is located around line 1233 in fsm.go:

func (fsm *FSM) sendOpenMessage() {
  optParams := packet.ConstructOptParams(uint32(fsm.pConf.LocalAS), fsm.neighborConf.AfiSafiMap,
               fsm.neighborConf.RunningConf.AddPathsRx, fsm.neighborConf.RunningConf.AddPathsMaxTx)
  bgpOpenMsg := packet.NewBGPOpenMessage(fsm.pConf.LocalAS, uint16(fsm.holdTime), 
                fsm.gConf.RouterId.To4().String(), optParams) 
  packet, _ := bgpOpenMsg.Encode()
  num, err := (*fsm.peerConn.conn).Write(packet)
  if err != nil {
    return
  }
}

Each of these calls is fairly straightforward—

  • optParams := packet.ConstructOptParams builds a data structure which can be used to build an open packet. Notice the contruction involves such things as the AFI and AS number; these make sense if we're trying to build an open message to send to a new peer.
  • bgpOpenMsg := packet.NewBGPOpenMessage takes the data structure just built, adds some additional information, and actually builds a packet. If we were to look at this function, we'd find it takes this information and inserts it into the right TLVs, in the right order, to build a BGP open message.
  • packet, _ := bgpOpenMsg.Encode takes the set of TLVs and pushes any necessary headers, etc., onto the data structure to make it into an actual packet.
  • num, err := (*fsm.peerConn.conn).Write(packet) writes the packet to the TCP stream opened up way back in
    the fifth episode of this long running serial.

In the last step, note there is a check to make certain the packet was actually written to the TCP socket. If the packet is not written, an error is logged (I've removed the logging code as always to make following the actual chain of events simpler), and the startup process is aborted. Now that the packet is written, and hence in flight to the new peer, what happens? st.fsm.SetHoldTime, around line 1107 of fsm.go.

func (fsm *FSM) SetHoldTime(holdTime uint32, keepaliveTime uint32) {
  if holdTime < 0 || (holdTime > 0 && holdTime < 3) {
    return
  }

  fsm.holdTime = holdTime
  fsm.keepAliveTime = keepaliveTime
}

This code is fairly simple; it just checks to make certain the hold time for this new peer isn't being set to something less than 3 seconds, and then sets the hold time as configured. Once the hold timer is set, st.fsm.StartHoldTimer() actually starts the hold timer, around line 1120 of fsm.go

func (fsm *FSM) StartHoldTimer() {
  if fsm.holdTime != 0 {
    fsm.holdTimer.Reset(time.Duration(fsm.holdTime) * time.Second)  
  }
}

Finally, st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm)) simply sets the state machine for the peer that was just sent an open packet to the open state. What happens after the peer reaches the open state? We'll have to wait 'til next week to see how the snaproute BGP code moves from open to open confirmed, and then to established.

snaproute Go BGP Code Dive (8): Moving to Open

Last week we left off with our BGP peer in connect state after looking through what this code, around line 261 of fsm.go in snaproute’s Go BGP implementation—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  switch event {
  ....
    case BGPEventConnRetryTimerExp:
      st.fsm.StopConnToPeer()
      st.fsm.StartConnectRetryTimer()
      st.fsm.InitiateConnToPeer()
....

What we want to do this week is pick up our BGP peering process, and figure out what the code does next. In this particular case, the next step in the process is fairly simple to find, because it’s just another case in the switch statement in (st *ConnectState) processEvent

case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
  st.fsm.StopConnectRetryTimer()
  st.fsm.SetPeerConn(data)
  st.fsm.sendOpenMessage()
  st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime,
    st.fsm.neighborConf.RunningConf.KeepaliveTime)
  st.fsm.StartHoldTimer()
  st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))
....

This looks like the right place—we’re looking at events that occur while in the connect state, and the result seems to be sending an open message. Before we move down this path, however, I’d like to be certain I’m chasing the right call chain, or logical thread. How can I do this? This code is called when (st *ConnectState) processEvent is called with an event called BGPEventTcpCrAcked or BGPEventTcpConnConfirmed. Let’s chase down where these events might come from to see if this really is the next step in the call chain we’re trying to chase.

Note: Sometimes it’s easier to chase from the end result back towards the caller, and sometimes it’s not. There’s no way to know which is which until you have more experience in chasing through code. It takes time and practice to build these sorts of skills up, just like many other skills—but in chasing through code, you’re not only learning the protocols better, you’re also learning how to code better.

To find what we’re looking for, we can search through the project files for some instance of BGPEventTcpCrAcked, which seems to be the result of receiving an ACK for a TCP session initiated by BGP. We find a few places in fsm.go, as always, but most of them are using the event, rather than causing (or throwing) it—

272: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
371: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
475: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
592: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
709: case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:

Until we get to this one—

case inConnCh := 

What does this do? This is a little complex, but let’s try to work through it. When starting a new peer, a port was cloned on which to send TCP packets to the peer. Since the port is cloned to a port the main FSM function is watching—(fsm *FSM) StartFSM()—the main FSM function is going to be notified of any inbound TCP packets received on the local device. When one specific sort of packet is received, an acknowledgement in a new TCP session, the main FSM function is called, resulting in case inConnCh := <-fsm.inConnCh: being called. This, in turn, calls (st *ConnectState) processEvent with BGPEventTcpCrAcked.

If you followed that, you know this verifies what it looked like in the first place—the code above is, in fact, the correct code to process the next phase of peering. The call chain looks something like this—

  • (fsm *FSM) StartFSM() is watching the TCP ports for any new packets
  • When (fsm *FSM) StartFSM() recieves a new TCP ACK, it falls through to case inConnCh := <-fsm.inConnCh: in the switch statement
  • This, in turn, calls (st *ConnectState) processEvent with BGPEventTcpCrAcked
  • (st *ConnectState) processEvent falls through to the case statement case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed, which then calls the correct functions to move beyond connect state

It’s okay if you have to read all of that several times—FSMs (Finite State Machines—remember?) can be very difficult to follow. This means we need to chase down each of these functions to find out how this implementation of BGP actually moves beyond the open state—

  • st.fsm.StopConnectRetryTimer()
  • st.fsm.SetPeerConn(data)
  • st.fsm.sendOpenMessage()
  • st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime, st.fsm.neighborConf.RunningConf.KeepaliveTime)
  • st.fsm.StartHoldTimer()
  • st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))

It’s pretty obvious what StopConnectRetryTimer does—it stops BGP from continuing to try to connect to this peer. Since the peer has acknowledged the initial TCP packet, we shouldn’t keep trying to send it initial TCP packets. SetPeerConn is a bit harder—

func (fsm *FSM) SetPeerConn(data interface{}) {
  if fsm.peerConn != nil {
    return
  }
  pConnDir := data.(PeerConnDir)
  fsm.peerConn = NewPeerConn(fsm, pConnDir.connDir, pConnDir.conn)
  go fsm.peerConn.StartReading()
}

This just does some general logging (which I’ve removed for clarity), and then tells the main process (through the FSM call) to start reading packets off this new peer’s data structure. I’m not going to dive into these functions deeply here.

Next time, we’ll look at the four remaining functions, as these are where the action really is from a BGP perspective.

snaproute Go BGP Code Dive (7): Moving to Connect

In last week’s post, we looked at how snaproute’s implementation of BGP in Go moves into trying to connect to a new peer—we chased down the connectRetryTimer to see what it does, but we didn’t fully work through what the code does when actually moving to connect. To jump back into the code, this is where we stopped—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  switch event {
  ....
    case BGPEventConnRetryTimerExp:
      st.fsm.StopConnToPeer()
      st.fsm.StartConnectRetryTimer()
      st.fsm.InitiateConnToPeer()
....

When the connectRetryTimer timer expires, it is not only restarted, but a new connection to the peer is attempted through st.fsm.InitiateConnToPeer(). This, then, is the next stop on the road to figuring out how this implementation of BGP brings up a peer. Before we get there, though, there’s an oddity here that needs to be addressed. If you look through the BGP FSM code, you will only find this call to initiate a connection to a peer in a few places. There is this call, and then one other call, here—

func (st *ConnectState) enter() {
  ....
  st.fsm.AcceptPeerConn()
  st.fsm.InitiateConnToPeer()
}

The rest of the instances of InitiateConnToPeer() are related to the definition of the function. This raises the question: why wouldn’t you just call this function directly when moving to connect? In other words, why not call it directly, rather than by setting a timer and calling it when the timer wakes up? One of the prime points of coding coherently is to provide consistent entry and exit points into specific states. The more ways you can enter a state within an FSM, the more confusing the FSM gets, the easier it is to make mistakes when modifying the FSM, and the harder it is to troubleshoot problems with the FSM. If you can construct a code path that funnels every way to get into a single state through a single call, the code will ultimately be easier to understand and maintain.

Now let’s look at what st.fsm.InitiateConnToPeer() actually does—

func (fsm *FSM) InitiateConnToPeer() {
  if bytes.Equal(fsm.pConf.NeighborAddress, net.IPv4bcast) {
    fsm.logger.Info("Unknown neighbor address")
    return
  }
  remote := net.JoinHostPort(fsm.pConf.NeighborAddress.String(), config.BGPPort)
  local := ""

  if strings.TrimSpace(fsm.pConf.UpdateSource) != "" {
    local = net.JoinHostPort(strings.TrimSpace(fsm.pConf.UpdateSource), "0") 
  }
  if fsm.outTCPConn == nil {
    fsm.outTCPConn = NewOutTCPConn(fsm, fsm.outConnCh, fsm.outConnErrCh)
    go fsm.outTCPConn.ConnectToPeer(fsm.connectRetryTime, remote, local)
  }
}

I’ve removed the logging code for clarity—I’ll be removing the logging code consistently throughout this series.

The first step is to determine if we have a valid, reachable peer IP address. This is taken care of by—

if bytes.Equal(fsm.pConf.NeighborAddress, net.IPv4bcast)

If the neighbor address is the same as an IPv4 broadcast address (either 0.0.0.0 or 255.255.255.255), then we don’t have a valid peer address. At this point, we just log the event and fail the attempt to connect to this peer. If we have a valid address to peer to, we need to build the data structures that will hold the TCP state. Remember that TCP is a stateful connection, which means we not only need to keep track of our local state, but we also need to keep track of the window and other information for the remote TCP peer. This is why there are two sets of calls to net.JoinHostPort, one for the local state, and one for the remote state.

Now that we have someplace to store the remote and local state, we can actually open a TCP connection (NewOutTCPConn) and then try to open the peering session (ConnectToPeer).

You can find the ConnectToPeer code in fsm/conn.go around line 175; the code is somewhat low level, so we won’t spend any time going through it here. Just taking a quick look shows that it essentially calls o.Connect, which then tries to open a new TCP session to the IP address specified.

Assuming this connection is actually opened, we have successfully moved the peer from idle to connect. We’ll tie up some loose ends in the next installment, and then consider the process of moving beyond connect state.

snaproute Go BGP Code Dive (6): Starting a Peer

In our last post on BGP code, we unraveled the call chain snaproute’s Go BGP implementation uses to bring a peer up. Let’s look at this call chain a bit more to see if we can figure out what it actually does—or rather, how it actually works. I’m going to skip the actual beginning of the FSM itself, and just move to the first state, looking at how the FSM is designed to move from state to state. The entire thing kicks off here—

func (st *IdleState) processEvent(event BGPFSMEvent, data interface{}) {
  st.logger.Info(fmt.Sprintln("Neighbor:", st.fsm.pConf.NeighborAddress, "FSM:", st.fsm.id,
    "State: Idle Event:", BGPEventTypeToStr[event]))
    switch event {
      case BGPEventManualStart, BGPEventAutoStart:
        st.fsm.SetConnectRetryCounter(0)
        st.fsm.StartConnectRetryTimer()
        st.fsm.ChangeState(NewConnectState(st.fsm))
....
}

What we need to do is chase down each of these three calls to figure out what they actually do. The first is simple—it just sets a retry counter (connectRetryCounter) to 0, indicating we haven’t tried to restart this peer at all. In other words, this is the first attempt to move from idle to a full peering relationship. This counter is primarily used for telemetry, which means it’s a counter used to show you, the user, how many times this peering relationship has been attempted. The second call resets connectRetryTime to a number of seconds—

func (fsm *FSM) StartConnectRetryTimer() {
  fsm.connectRetryTimer.Reset(time.Duration(fsm.connectRetryTime) * time.Second)
}

Looking for what this is normally set to leads us to—

const BGPConnectRetryTime uint16 = 120 // seconds

Let’s chase this retry timer a bit more, to see if we can figure out what happens when it expires (or rather wakes up). The timer itself, as we can see from the timer definition above, is called connectRetryTimer. Searching for this in the code reveals 50 instances, but only one looks like it actually does something. This one instance is in the main BGP FSM function, around line 913 in fsm.go, func (fsm *FSM) StartFSM().

This function is a large switch, a fairly common construction used to react to a large set of events, or to process one of a number of different packet types, TLVs, etc. To understand what’s happening here, we need to spend a minute thinking through what a switch actually does. If you’re looking at the code, you can see it looks something like this (in a general form)—

switch (x) {
  case (1) 
    do something
    return
  case (2)
    do something else
    case (3)
    do this other thing
}

The switch statement tells the program to evaluate x against each case statement. The first time it finds a match for x, the code is executed from that point forward. This last bit is important to understand; if x==2 in the example above, do something else and do this other thing are both executed. If x==1, only do something is executed, and then the program returns, which just means it returns to the calling function.

If you think this is a bit confusing, it is—so you need to be careful when reading switch statements to make certain you understand where the processing ends.

Go uses a slightly different format for switches, specifically—

for {
  select {
    case x ==1:
      do something;
      return;
    case x == 2:
      do something else;
      return;
  }
}

The format is slightly different, but the idea is precisely the same. Don’t let the way languages express logical constructions mess you up when reading code; if you understand the basic sorts of looping and other constructions (which you can learn in any language, pretty much), you can often decipher what any construction intends in any language. This is another one of those rule 11 things.

We’ve spent a good bit of time just understanding what we’re looking at; now let’s at least look at what this timer expiring (remember, waking up) actually does. Looking at the switch—

for {
  select {
    case <-fsm.connectRetryTimer.C:
      fsm.ProcessEvent(BGPEventConnRetryTimerExp, nil)
....

When this timer expires, it calls into StartFSM, which in turns uses the switch statement to figure out which event has just occurred and call the correct bit of code to process the event. In this case, the switch statement lands on case <-fsm.connectRetryTimer.C:, which calls ProcessEvent(BGPEventConnRetryTimerExp, nil). There are, as usual, a number of different calls to this function. Which one should we look at? Since we are moving from Idle to Connected, we're going to care about where we end up if we're in Connected state. We can figure this out by looking at each call to ProcessEvent(BGPEventConnRetryTimerExp, nil), then scooting up a bit in the code to see what function we're in when this call is made. The one we're interested in is around line 261—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  st.logger.Info(fmt.Sprintln("Neighbor:", st.fsm.pConf.NeighborAddress, "FSM:", st.fsm.id, "State: Connect Event:",
                 BGPEventTypeToStr[event]))
  switch event {
  ....
    case BGPEventConnRetryTimerExp:
      st.fsm.StopConnToPeer()
      st.fsm.StartConnectRetryTimer()
      st.fsm.InitiateConnToPeer()
  ....

So if the peer is in idle state, and the connect retry timer wakes up (or expires), then the connection process is stopped, the connect retry timer is restarted, and the local BGP process attempts to start the connection over again.

*whew*—that's enough for one week of digging around in the code—we've covered a lot of ground here!