snaproute Go BGP Code Dive (12): Moving to Established

snaproute Go BGP Code Dive (12): Moving to Established

In last week’s post, the new BGP peer we’re tracing through the snaproute BGP code moved from open to openconfirmed by receiving, and processing, the open message. In processing the open message, the list of AFIs this peer will support was built, the hold timer set, and the hold timer started. The next step is to move to established. RFC 4271, around page 70, describes the process as—

If the local system receives a KEEPALIVE message (KeepAliveMsg (Event 26)), the local system:
 - restarts the HoldTimer and
 - changes its state to Established.

In response to any other event (Events 9, 12-13, 20, 27-28), the local system:
 - sends a NOTIFICATION with a code of Finite State Machine Error,
 - sets the ConnectRetryTimer to zero,
 - releases all BGP resources,
 - drops the TCP connection,
 - increments the ConnectRetryCounter by 1,
 - (optionally) performs peer oscillation damping if the DampPeerOscillations attribute is set to TRUE, and
 - changes its state to Idle.

 

For a bit of review (because this is running so long, you might forget how the state machine works), the way the snaproute code is written is as a state machine. The way the state machine works is there are a series of steps the BGP peer must go through, each step being represented by a function call in the fsm.go file. As the peer moves from one state to another, a function call “moves the pointer” from the current state to the next one, such that any event which occurs will call a different function, based on the current state. I know this is rather difficult to follow, but what this means, in practical terms, is that if the underlying TCP session is acknowledged or confirmed while the peer is in connected state, the following code from around line 272 in fsm.go are executed—

case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
 st.fsm.StopConnectRetryTimer()
 st.fsm.SetPeerConn(data)
 st.fsm.sendOpenMessage()
 st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime,
  st.fsm.neighborConf.RunningConf.KeepaliveTime)
 st.fsm.StartHoldTimer()
 st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))

However, if this same event occurs—an open acknowledgement for the underlying TCP session is received—while the peer is in openconfirm state, a different set of code is executed, from around line 593 in fsm.go

case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
 st.fsm.HandleAnotherConnection(data)

This is a general characteristic of any FSM—the event is matched against the current state to determine what action to take next. With all of this in mind, any event received while the peer is in openconfirm state will be processed by func (st *OpenConfirmState) processEvent, which is around line 558 is fsm.go. This code consists of a switch statement, which looks like this—

func (st *OpenConfirmState) processEvent(event BGPFSMEvent, data interface{}) {
 switch event {
  case BGPEventManualStop:
   ....
  case BGPEventAutoStop:
   ....
  case BGPEventHoldTimerExp:
   ....
  case BGPEventKeepAliveTimerExp:
   ....
  case BGPEventTcpConnValid: // Supported later
  case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed: // Collision Detection... needs work
   ....
  case BGPEventTcpConnFails, BGPEventNotifMsg:
   ....
  case BGPEventBGPOpen: // Collision Detection... needs work
  case BGPEventHeaderErr, BGPEventOpenMsgErr:
   ....
  case BGPEventOpenCollisionDump:
   ....
  case BGPEventNotifMsgVerErr:
   ....
  case BGPEventKeepAliveMsg:
   .... 
  case BGPEventConnRetryTimerExp, BGPEventDelayOpenTimerExp, BGPEventIdleHoldTimerExp,
   ....
  }
}

 

I’ve cut out the actions taken in each case to make it easier to see the structure of the entire switch statement in one sweep. Most of these options are actually error conditions that take exactly the same steps. Let’s look at one to see what it does—

case BGPEventHoldTimerExp:
 st.fsm.SendNotificationMessage(packet.BGPHoldTimerExpired, 0, nil)
 st.fsm.StopConnectRetryTimer()
 st.fsm.ClearPeerConn()
 st.fsm.StopConnToPeer()
 st.fsm.IncrConnectRetryCounter()
 st.fsm.ChangeState(NewIdleState(st.fsm))

 

If the hold timer expires while the peer is in openconfirmed state—

  • A notification is sent by SendNotificationMessage; this will tell the peer that the session is being torn down, so the two speakers can have synchronized state
  • The connect retry timer is stopped, so the local BGP speaker will not try to reconnect until the peer has passed through the idle state; this prevents any problems that might result from stepping outside the BGP state machine
  • The peer connection is cleared; the just empties the various data structures associated with the peer, so old information isn’t carried into a new peering session
  • The peering connection is stopped by StopConnToPeer
  • The connection retry counter is incremented, which allows the operator to see how many times this peer has been torn down and restarted
  • The state of the peer is changed to idel

This set of actions only changes slightly from state to state; if you search for this set of steps, you’re likely to find it at least a few dozen times throughout fsm.go.

There is one other interesting point about this code worth mentioning. The folks at snaproute apparently haven’t implemented peer collision detection, as evidenced by the comments in the code itself. For instance—


  case BGPEventTcpConnValid: // Supported later
  case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed: // Collision Detection... needs work
   ....
  case BGPEventTcpConnFails, BGPEventNotifMsg:
   ....
  case BGPEventBGPOpen: // Collision Detection... needs work

Each of these three events—receiving a new TCP connection towards a peer that is already in openconfirmed state, or receiving an open message from a peer that is already in openconfirmed state— represents an event that should not take place. What should the snaproute code do here? According to section 6.8 of RFC4271, it should—

Unless allowed via configuration, a connection collision with an existing BGP connection that is in the Established state causes closing of the newly created connection.

So when they eventually fill this bit of code in, you can be pretty certain what the actual code will do—it will reset the peering session in a way that’s similar to the other error code already present. The bit of code that’s interesting in the context of moving from openconfirmed to established are around line 627 in fsm.go

case BGPEventKeepAliveMsg:
 st.fsm.StartHoldTimer()
 st.fsm.ChangeState(NewEstablishedState(st.fsm))

 

The actual processing to move from openconfirmed to established is simple: if the local peer receives a keep alive message while in the openconfirmed state, move the peer to established.

As we’ve reached established state, the next step is to understand how updates are received and processed for this new peer.