snaproute Go BGP Code Dive (6): Starting a Peer

In our last post on BGP code, we unraveled the call chain snaproute’s Go BGP implementation uses to bring a peer up. Let’s look at this call chain a bit more to see if we can figure out what it actually does—or rather, how it actually works. I’m going to skip the actual beginning of the FSM itself, and just move to the first state, looking at how the FSM is designed to move from state to state. The entire thing kicks off here—

func (st *IdleState) processEvent(event BGPFSMEvent, data interface{}) {
  st.logger.Info(fmt.Sprintln("Neighbor:", st.fsm.pConf.NeighborAddress, "FSM:", st.fsm.id,
    "State: Idle Event:", BGPEventTypeToStr[event]))
    switch event {
      case BGPEventManualStart, BGPEventAutoStart:
        st.fsm.SetConnectRetryCounter(0)
        st.fsm.StartConnectRetryTimer()
        st.fsm.ChangeState(NewConnectState(st.fsm))
....
}

What we need to do is chase down each of these three calls to figure out what they actually do. The first is simple—it just sets a retry counter (connectRetryCounter) to 0, indicating we haven’t tried to restart this peer at all. In other words, this is the first attempt to move from idle to a full peering relationship. This counter is primarily used for telemetry, which means it’s a counter used to show you, the user, how many times this peering relationship has been attempted. The second call resets connectRetryTime to a number of seconds—

func (fsm *FSM) StartConnectRetryTimer() {
  fsm.connectRetryTimer.Reset(time.Duration(fsm.connectRetryTime) * time.Second)
}

Looking for what this is normally set to leads us to—

const BGPConnectRetryTime uint16 = 120 // seconds

Let’s chase this retry timer a bit more, to see if we can figure out what happens when it expires (or rather wakes up). The timer itself, as we can see from the timer definition above, is called connectRetryTimer. Searching for this in the code reveals 50 instances, but only one looks like it actually does something. This one instance is in the main BGP FSM function, around line 913 in fsm.go, func (fsm *FSM) StartFSM().

This function is a large switch, a fairly common construction used to react to a large set of events, or to process one of a number of different packet types, TLVs, etc. To understand what’s happening here, we need to spend a minute thinking through what a switch actually does. If you’re looking at the code, you can see it looks something like this (in a general form)—

switch (x) {
  case (1) 
    do something
    return
  case (2)
    do something else
    case (3)
    do this other thing
}

The switch statement tells the program to evaluate x against each case statement. The first time it finds a match for x, the code is executed from that point forward. This last bit is important to understand; if x==2 in the example above, do something else and do this other thing are both executed. If x==1, only do something is executed, and then the program returns, which just means it returns to the calling function.

If you think this is a bit confusing, it is—so you need to be careful when reading switch statements to make certain you understand where the processing ends.

Go uses a slightly different format for switches, specifically—

for {
  select {
    case x ==1:
      do something;
      return;
    case x == 2:
      do something else;
      return;
  }
}

The format is slightly different, but the idea is precisely the same. Don’t let the way languages express logical constructions mess you up when reading code; if you understand the basic sorts of looping and other constructions (which you can learn in any language, pretty much), you can often decipher what any construction intends in any language. This is another one of those rule 11 things.

We’ve spent a good bit of time just understanding what we’re looking at; now let’s at least look at what this timer expiring (remember, waking up) actually does. Looking at the switch—

for {
  select {
    case <-fsm.connectRetryTimer.C:
      fsm.ProcessEvent(BGPEventConnRetryTimerExp, nil)
....

When this timer expires, it calls into StartFSM, which in turns uses the switch statement to figure out which event has just occurred and call the correct bit of code to process the event. In this case, the switch statement lands on case <-fsm.connectRetryTimer.C:, which calls ProcessEvent(BGPEventConnRetryTimerExp, nil). There are, as usual, a number of different calls to this function. Which one should we look at? Since we are moving from Idle to Connected, we're going to care about where we end up if we're in Connected state. We can figure this out by looking at each call to ProcessEvent(BGPEventConnRetryTimerExp, nil), then scooting up a bit in the code to see what function we're in when this call is made. The one we're interested in is around line 261—

func (st *ConnectState) processEvent(event BGPFSMEvent, data interface{}) {
  st.logger.Info(fmt.Sprintln("Neighbor:", st.fsm.pConf.NeighborAddress, "FSM:", st.fsm.id, "State: Connect Event:",
                 BGPEventTypeToStr[event]))
  switch event {
  ....
    case BGPEventConnRetryTimerExp:
      st.fsm.StopConnToPeer()
      st.fsm.StartConnectRetryTimer()
      st.fsm.InitiateConnToPeer()
  ....

So if the peer is in idle state, and the connect retry timer wakes up (or expires), then the connection process is stopped, the connect retry timer is restarted, and the local BGP process attempts to start the connection over again.

*whew*—that's enough for one week of digging around in the code—we've covered a lot of ground here!