snaproute Go BGP Code Dive (4): Starting a Peer

In the last three episodes of this series, we discussed getting a copy of SnapRoute’s BGP code using Git, we looked at the basic structure of the project, and then we did some general housekeeping. At this point, I’m going to assume you have the tools you need installed, and you’re ready to follow along as we ask the code questions about how BGP actually works.

Now, let’s start with a simple question: how does BGP bring a new peer up?

It seems like we should be able to look for some file that’s named something with peering in it, but, looking at the files in the project, there doesn’t seem to be any such thing (click to show a larger version of the image below if you can’t read it).

ls-go-bgp

Hmmm… Given it’s not obvious where to start, what do we do? There are a number of options, of course, but three options stand out as the easiest.

First, you can just poke around the code for a bit to see if you find anything that looks like it might be what you’re looking for. This is not, generally, for the faint of heart. Over time, as you become more familiar with the way coders tend to structure things, this method will work more often than not, but for now, let’s look for an easier way to find a starting point.

Second, you could compile the code, setting breakpoints in the TCP code just before TCP sends packets off to the correct process for handling, fire a BGP packet at the box using a simulator, and see where the code actually takes you. This also isn’t for the faint of heart, so let’s see if we can think of something simpler.

Third, you can do it the way I normally do. Run the code as a process and turn on debugging, generally in an emulator so you can connect peers, etc. Now, connect a peer, and capture a few debug messages. Shut everything down and return to your code directory, then search the code for one or more of the debug messages you just captured. This should lead you to the code you’re looking for. In this case, I run into a message that looks something like this—

Neighbor: x.x.x.x FSM xx ConnEstablished - start

There is one word in here that is pretty odd—ConnEstablished—and hence will probably yield results if I do a search for it. I’m going to resort to Atom here, as I don’t want to get into grep on the command like, but doing a search across the entire project shows me (once again, click for a larger version)—

go-bgp-conn-est

Hmmm… The most interesting of these is the one where the message is actually printed on the console, which is line 1319 in fsm.go. Popping into this file, I find the following—

func (fsm *FSM) ConnEstablished() {
  fsm.logger.Info(fmt.Sprintln("Neighbor:", fsm.pConf.NeighborAddress, "FSM", fsm.id, "ConnEstablished - start"))
  fsm.Manager.fsmEstablished(fsm.id, fsm.peerConn.conn)
  fsm.logger.Info(fmt.Sprintln("Neighbor:", fsm.pConf.NeighborAddress, "FSM", fsm.id, "ConnEstablished - end"))
}

Now if I find where this function is called, I can find out where, in the code, neighbors actually actually established. This process of tracing back from the end point to figure out what’s actually happening can be a bit tedious, but until you’re more familiar with the basic structure of the code, it’s often the only choice you’re going to have.

As it turns out, the name of the function we need to find is ConnEstablished. We can repeat our original search to find out where this function is actually called (see the image above, as it’s the same search). There is only one call to this function, found in the same file—

func (fsm *FSM) ChangeState(newState BaseStateIface) {
  ....
  } else if oldState != BGPFSMEstablished && fsm.State.state() == BGPFSMEstablished {
    fsm.ConnEstablished()
  }
}

You might notice there are a number of calls to PeerConnEstablished, as well—and we would simplify our search by jumping directly to that call—but for the moment let’s take the long way around by tracing back one step at a time.

Looking at the code, we find there are a number of calls to ChangeState, but the one that’s interesting is here—

st.fsm.ChangeState(NewEstablishedState(st.fsm))

Which is around line 611 in fsm.go, for those who are trying to follow along. This particular call is interesting among all the other calls because it is the only one that mentions the state we’re looking for, established state. We can figure out where to look next by going to the top of the function in which this line of code is called, which is—

func (st *OpenConfirmState) processEvent(event BGPFSMEvent, data interface{}) {
  st.logger.Info(fmt.Sprintln("Neighbor:", st.fsm.pConf.NeighborAddress, "FSM:", st.fsm.id,
    "State: OpenConfirm Event:", BGPEventTypeToStr[fusion_builder_container hundred_percent="yes" overflow="visible"][fusion_builder_row][fusion_builder_column type="1_1" background_position="left top" background_color="" border_size="" border_color="" border_style="solid" spacing="yes" background_image="" background_repeat="no-repeat" padding="" margin_top="0px" margin_bottom="0px" class="" id="" animation_type="" animation_speed="0.3" animation_direction="left" hide_on_mobile="no" center_content="no" min_height="none"][event]))

  ....

Now we’ve run into something odd—the function name is literally processEvent. This seems a little generic. In fact, if we search the code for processEvent, we’re going to find hundreds of instances of this function call. It looks like we’re lost in the weeds, doesn’t it? Not necessarily…

If you’ll notice, just before the function name, there’s a set of parenthesis with (st *OpenConfirmState). This is, in fact, what I would call in C a call by reference, something that’s rather common in building a finite state machine like this in code. Let me explain…

A finite state machine is normally a flow chart that shows each possible state the system can be in, how it can enter that state, and how it can exit the state. Sometimes this FSM is represented in text form, where the state is listed, possible inputs are listed, and the resulting state is given for each possible input in this particular state. Forinstance, the BGP specification contains such an FSM, as shown below—

8.1.4.  TCP Connection-Based Events

Event 14: TcpConnection_Valid

Definition: Event indicating the local system reception of a TCP connection request with a valid source IP address, TCP port, destination IP address, and TCP Port.  The definition of invalid source and invalid IP address is determined by the implementation.

BGP's destination port SHOULD be port 179, as defined by IANA.

TCP connection request is denoted by the local system receiving a TCP SYN.

Status: Optional

Optional Attribute Status:

1) The TrackTcpState attribute SHOULD be set to TRUE if this event occurs.

Event 15: Tcp_CR_Invalid

Definition: Event indicating the local system reception of a TCP connection request with either an invalid source address or port number, or an invalid destination address or port number.

When we run into something like processEvent in a file called fsm, we’re probably looking at a finite state machine broken up into a set of functions, each of which represent a single state, and each of which perform the right actions to move from the current state to a new state in the FSM. I know this is difficult to grock, so let me give you a more visual representation.

bgp-cd-fsm

State A is where we begin… This state would be represented as a single function in the source code. When State A is reached, this function is called, and, depending on the input, the function for State A will either call State B’s function, or State C’s function. This chain of events will continue until the final state is reached, and the FSM either enters a steady state, or exits. What tends to be confusing about this process is that these functions might not, in fact, call one another. Instead, what generally happens is the function for State A will be called, which will result in State C being the new state. The program will exit and wait for another event. When this next event occurs, the application will send this new event to the function for the current state, State C, which will then process the event, leaving the process in state D, for instance. This process/move to a new state/exit/wait cycle happens until the state reaches a steady state, or until the process ends.

Instead of calling each function a different name, this code is built with the same function name in each state structure. Each state is represented by a structure, and each structure has a function that is called when an event happens while the FSM is in that particular state. If you’re in state A and event occurs, you call (*stateA) processEvent. If you’re in state B and an event occurs, you call (*stateB) processEvent. There is one structure for each state, and a single function to handle events while in that state.

This means we’re not going to be able to just jump back function by function to trace what happens. Instead of tracing the functions, we’re going to need to trace the state by looking at the function within each state that deals with events. Lucky for us, the current state is contained right there in the function call—(st *OpenConfirmState). What we’re going to need to do, then, is trace back the successive states by looking at how we get to OpenConfirmState, and then how we get to the state that gets us to OpenConfirmState, etc. Along the way, we’re going to see precisely how a new peer is brought up in this version of BGP. We’ll start tracing these states next time.