snaproute Go BGP Code Dive (5): Starting a Peer

Last time we looked at the snaproute BGP code, we discovered the peer bringup process is a finite state machine. With this in mind, let’s try to unravel the state machine into a set of calls, beginning from our original starting point, a debug message that prints on the screen when a new peering relationship is established. The key word in the debug message was ConnEstablished, which led to:

func (fsm *FSM) ConnEstablished() {
  fsm.logger.Info(fmt.Sprintln("Neighbor:", fsm.pConf.NeighborAddress, "FSM", fsm.id, "ConnEstablished - start"))
  fsm.Manager.fsmEstablished(fsm.id, fsm.peerConn.conn)
  fsm.logger.Info(fmt.Sprintln("Neighbor:", fsm.pConf.NeighborAddress, "FSM", fsm.id, "ConnEstablished - end"))
}

From here, we searched for calls to ConnEstablished, and found—

func (fsm *FSM) ChangeState(newState BaseStateIface) {
...
  if oldState == BGPFSMEstablished && fsm.State.state() != BGPFSMEstablished {
    fsm.ConnBroken()
  } else if oldState != BGPFSMEstablished && fsm.State.state() == BGPFSMEstablished {
    fsm.ConnEstablished()
  }
}

Looking for ChangeState leads us to a lot of different calls, but only one that seems to relate to establishing a new peer, as evidenced by a state that relates to established in some way. This, in turn, leads to—

func (st *OpenConfirmState) processEvent(event BGPFSMEvent, data interface{}) {
  ....
  switch event {
  ...
  case BGPEventKeepAliveMsg:
    st.fsm.StartHoldTimer()
    st.fsm.ChangeState(NewEstablishedState(st.fsm))
  ...
}

…and hence to processEvent, for which there are a ton of calls. The name of the function, however, and the way it’s called, imply we’ve just landed on the tail end of a finite state machine (FSM), which we can trace back by looking at the call by reference pointer in front of processEvent. Let’s trace in more detail from here. First, we need to look for OpenConfirmState, which is the call by in the function call above. What’s actually happening here is that when a peer is in OpenConfirm, and some event occurs, then func (st *OpenConfirmState) processEvent(event BGPFSMEvent, ... is called to handle the event. What we want to do is figure out how the state machine gets to OpenConfirm. Searching for OpenConfirmState yields a number of calls, but the one that’s interesting is—

func (st *OpenSentState) processEvent(event BGPFSMEvent, data interface{}) {
  ....
  switch event {
    case BGPEventBGPOpen:
    ....
    st.fsm.ChangeState(NewOpenConfirmState(st.fsm))
  }
}

To back up one more step in the FSM, we need to search for OpenSentState. This again produces a number of results, but the interesting one is—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  ....
  switch event {
  ....
    case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
    ....
    st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))
  }

To back up one more step, we need to find ConnectState, which lands here—

func (st *IdleState) processEvent(event BGPFSMEvent, data interface{}) {
  ....
  switch event {
    case BGPEventManualStart, BGPEventAutoStart:
      st.fsm.SetConnectRetryCounter(0)
      st.fsm.StartConnectRetryTimer()
      st.fsm.ChangeState(NewConnectState(st.fsm))

Then IdleState, which lands here—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  ....
  switch event {
    case BGPEventManualStop:
    st.fsm.StopConnToPeer()
    st.fsm.SetConnectRetryCounter(0)
    st.fsm.StopConnectRetryTimer()
    st.fsm.ChangeState(NewIdleState(st.fsm))

To back up one more, we need to search for ConnectState, which has two results of interest. The first is—

func (st *ActiveState) processEvent(event BGPFSMEvent, data interface{}) {
  ....
  switch event {
    ....
    case BGPEventConnRetryTimerExp:
      st.fsm.StartConnectRetryTimer()
      st.fsm.ChangeState(NewConnectState(st.fsm))

This particular call appears to be moving into the connect state from a connect retry timer expiring, though. This isn’t going to take us back any further steps in the process “from nothing,” so we need to look at the other call—

func (st *IdleState) processEvent(event BGPFSMEvent, data interface{}) {
  ....
  switch event {
    case BGPEventManualStart, BGPEventAutoStart:
    st.fsm.SetConnectRetryCounter(0)
    st.fsm.StartConnectRetryTimer()
    st.fsm.ChangeState(NewConnectState(st.fsm))

So we need to look for IdleState. Here we run into another, similar, situation—IdleState is called from a lot of different places. Again, what we want is the call that moves us from something prior to idle state, rather than a call that deals with errors. As it turns out, there is only one such call—

func (fsm *FSM) StartFSM() {

  if fsm.State == nil {
    fsm.logger.Info(fmt.Sprintln("Neighbor:", fsm.pConf.NeighborAddress, "FSM:", fsm.id,
    "Start state is not set... starting the state machine in IDLE state"))
    fsm.State = NewIdleState(fsm)
  }

  fsm.State.enter()

This is not like the other calls we’ve been chasing, in that it doesn’t use either ChangeState or processEvent with a call by reference. Rather, this looks like it is the actual starting point of the FSM itself, entered when a new neighbor is discovered (something we won’t be dealing with in this series).

Now that we’ve unwound the call chain, we can work back through it and build a simpler version with just raw function calls to understand the process of bringing up a peer. Starting from the beginning (and hence reversing the order in which we’ve discovered the call chain)—

  • StartFSM()
  • func (st *IdleState) processEvent()
    • st.fsm.SetConnectRetryCounter(0)
    • st.fsm.StartConnectRetryTimer()
    • st.fsm.ChangeState(NewConnectState(st.fsm))
  • func (st *ConnectState) processEvent()
    • st.fsm.StopConnToPeer()
    • st.fsm.SetConnectRetryCounter(0)
    • st.fsm.StopConnectRetryTimer()
    • st.fsm.ChangeState(NewIdleState(st.fsm))
  • func (st *IdleState) processEvent()
    • st.fsm.SetConnectRetryCounter(0)
    • st.fsm.StartConnectRetryTimer()
    • st.fsm.ChangeState(NewConnectState(st.fsm))
  • func (st *ConnectState) processEvent()
    • st.fsm.StopConnectRetryTimer()
    • st.fsm.SetPeerConn(data)
    • st.fsm.sendOpenMessage()
    • st.fsm.SetHoldTime(st.fsm.neighborConf.RunningConf.HoldTime, st.fsm.neighborConf.RunningConf.KeepaliveTime)
    • st.fsm.StartHoldTimer()
    • st.BaseState.fsm.ChangeState(NewOpenSentState(st.BaseState.fsm))
  • func (st *OpenSentState) processEvent()
    • st.fsm.StopConnectRetryTimer()
    • bgpMsg := data.(*packet.BGPMessage)
    • if st.fsm.ProcessOpenMessage(bgpMsg) {
      st.fsm.sendKeepAliveMessage()
      st.fsm.StartHoldTimer()
      st.fsm.ChangeState(NewOpenConfirmState(st.fsm))
      }
  • func (st *OpenConfirmState) processEvent()
    • st.fsm.StartHoldTimer()
    • st.fsm.ChangeState(NewEstablishedState(st.fsm))
  • func (st *OpenConfirmState) processEvent()
  • func (fsm *FSM) ChangeState()
  • func (fsm *FSM) ConnEstablished()

In the next post in this series, we’ll start looking at what some of these functions do—and then we’ll look at some of the “byways” representing error conditions to see how they work.