So, software is eating the world—and you thought this was going to make things simpler, right? If you haven’t found the tradeoffs, you haven’t looked hard enough. I should trademark that or something! 🙂 While a lot of folks are thinking about code quality and supply chain are common concerns, there are a lot of little “side trails” organizations do not tend to think about. One such was recently covered in a paper on underhanded code, which is code designed to pass a standard review which be used to harm the system later on. For instance, you might see at some spot—

if (buffer_size=REALLYLONGDECLAREDVARIABLENAMEHERE) {
/* do some stuff here */
} /* end of if */

Can you spot what the problem might be? In C, the = is different than the ==. Which should it really be here? Even astute reviewers can easily miss this kind of detail—not least because it could be an intentional construction. Using a strongly typed language can help prevent this kind of thing, like Rust (listen to this episode of the Hedge for more information on Rust), but nothing beats having really good code formatting rules, even if they are apparently arbitrary, for catching these things.

The paper above lists these—

  • Use syntax highlighting and typefaces that clearly distinguish characters. You should be able to easily tell the difference between a lowercase l and a 1.
  • Require all comments to be on separate lines. This is actually pretty hard in C, however.
  • Prettify code into a standard format not under the attacker’s control.
  • Use compiler warnings in static analysis.
  • Forbid unneeded dangerous constructions
  • Use runtime memory corruption detection
  • Use fuzzing
  • Watch your test coverage

Not all of these are directly applicable for the network engineer dealing with automation, but they do provide some good pointers, or places to start. A few more…

Yoda assignments are named after Yoda’s constant placement of the subject after the verb (or in a split infinitive)—”succeed you will…” It’s not technically wrong in terms of grammar, but it is just hard enough to understand that it makes you listen carefully and think a bit harder. In software development, the variable taking the assignment should be on the left, and the thing being assigned should be on the right. Reversing these is a Yoda assignment; it’s technically correct, but it’s harder to read.

Arbitrary standardization is useful when there are many options that ultimately result in the same outcome. Don’t let options proliferate just because you can.

Use macros!

There are probably plenty more, but this is an area where we really are not paying attention right now.

The word on the street is that everyone—especially network engineers—must learn to code. A conversation with a friend and an article passing through my RSS reader brought this to mind once again—so once more into the breach. Part of the problem here is that we seem to have a knack for asking the wrong question. When we look at network engineer skill sets, we often think about the ability to configure a protocol or set of features, and then the ability to quickly troubleshoot those protocols or features using a set of commands or techniques.

This is, in some sense, what various certifications have taught us—we have reached the expert level when we can configure a network quickly, or when we can prove we understand a product line. There is, by the way, a point of truth in this. If you claim your expertise is with a particular vendor’s gear, then it is true that you must be able to configure and troubleshoot on that vendor’s gear to be an expert. There is also a problem of how to test for networking skills without actually implementing something, and how to implement things without actually configuring them. This is a problem we are discussing in the new “certification” I’ve been working on, as well.

This is also, in some sense, what the hiring processes we use have taught us. Computers like to classify things in clear and definite ways. The only clear and definite way to classify networking skills is by asking questions like “what protocols do you understand how to configure and troubleshoot?” It is, it seems, nearly impossible to test design or communication skills in a way that can be easily placed on a resume.

Coding, I think, is one of those skills that is easy to appear to measure accurately, and it’s also something the entire world insists is the “coming thing.” No coding skills, no job. So it’s easy to ask the easy question—what languages do you know, how many lines of code have you written, etc. But again, this is the wrong question (or these are the wrong questions).

What is the right question? In terms of coding skills, more along the lines of something like, “do you know how to build and use tools to solve business problems?” I phrase it this way because the one thing I have noticed about every really good coder I have known is they all spend as much time building tools as they do building shipping products. They build tools to test their code, or to modify the code they’ve already written en masse, etc. In fact, the excellent coders I know treat functions like tools—if they have to drive a nail twice, they stop and create a hammer rather than repeating the exercise with some other tool.

So why is coding such an important skill to gain and maintain for the network engineer? This paragraph seems to sum it up nicely for me—

“Coding is not the fundamental skill,” writes startup founder and ex-Microsoft program manager Chris Granger. What matters, he argues, is being able to model problems and use computers to solve them. ”We don’t want a generation of people forced to care about Unicode and UI toolkits. We want a generation of writers, biologists, and accountants that can leverage computers.”

It’s not the coding that matters, it’s “being able to model problems and use computers to solve them.” This is the essence of tool building or engineering—seeing the problem, understanding the problem, and then thinking through (sometimes by trial and error) how to build a tool that will solve the problem in a consistent, easy to manage way. I fear that network engineers are taking their attitude of configuring things and automating it to make the configuration and troubleshooting faster. We seem to end up asking “how do I solve the problem of making the configuration of this network faster,” rather than asking “what business problem am I trying to solve?”

To make effective use of the coding skills we’re telling everyone to learn, we need to go back to basics and understand the problems we’re trying to solve—and the set of possible solutions we can use to solve those problems. Seen this way, the routing protocol becomes “just another tool,” just like a function call, that can be used to solve a specific set of problems—instead of a set of configuration lines that we invoke like a magic incantation to make things happen.

Coding skills are important—but they require the right mindset if we’re going to really gain the sorts of efficiencies we think are possible.

The Internet, and networking protocols more broadly, were grounded in a few simple principles. For instance, there is the end-to-end principle, which argues the network should be a simple fat pipe that does not modify data in transit. Many of these principles have tradeoffs—if you haven’t found the tradeoffs, you haven’t looked hard enough—and not looking for them can result in massive failures at the network and protocol level.

Another principle networking is grounded in is the Robustness Principle, which states: “Be liberal in what you accept, and conservative in what you send.” In protocol design and implementation, this means you should accept the widest range of inputs possible without negative consequences. A recent draft, however, challenges the robustness principle—draft-iab-protocol-maintenance.

According to the authors, the basic premise of the robustness principle lies in the problem of updating older software for new features or fixes at the scale of an Internet sized network. The general idea is a protocol designer can set aside some “reserved bits,” using them in a later version of the protocol, and not worry about older implementations misinterpreting them—new meanings of old reserved bits will be silently ignored. In a world where even a very old operating system, such as Windows XP, is still widely used, and people complain endlessly about forced updates, it seems like the robustness principle is on solid ground in this regard.

The argument against this in the draft is implementing the robustness principle allows a protocol to degenerate over time. Older implementations are not removed from service because it still works, implementations are not updated in a timely manner, and the protocol tends to have an ever-increasing amount of “dead code” in the form of older expressions of data formats. Given an infinite amount of time, an infinity number of versions of any given protocol will be deployed. As a result, the protocol can and will break in an infinite number of ways.

The logic of the draft is something along the lines of: old ways of doing things should be removed from protocols which are actively maintained in order to unify and simplify the protocol. At least for actively maintained protocols, reliance on the robustness principle should be toned down a little.

Given the long list of examples in the draft, the authors make a good case.

There is another side to the argument, however. The robustness principle is not “just” about keeping older versions of software working “at scale.” All implementations, no matter how good their quality, have defects (or rather, unintended features). Many of these defects involve failing to release or initialize memory, failing to bounds check inputs, and other similar oversights. A common way to find these errors is to fuzz test code—throw lots of different inputs at it to see if it spits up an error or crash.

The robustness principle runs deeper than infinite versions—it also helps implementations deal with defects in “the other end” that generate bad data. The robustness principle, then, can help keep a network running even in the face of an implementation defect.

Where does this leave us? Abandoning the robustness principle is clearly not a good thing—while the network might end up being more correct, it might also end up simply not running. Ever. The Internet is an interlocking system of protocols, hardware, and software; the robustness principle is the lubricant that makes it all work at all.

Clearly, then, there must be some sort of compromise position that will work. Perhaps a two pronged attack might work. First, don’t discard errors silently. Instead, build logging into software that catches all errors, regardless of how trivial they might seem. This will generate a lot of data, but we need to be clear on the difference between instrumenting something and actually paying attention to what is instrumented. Instrumenting code so that “unknown input” can be caught and logged periodically is not a bad thing.

Second, perhaps protocols need some way to end of life older versions. Part of the problem with the robustness principle is it allows an infinite number of versions of a single protocol to exist in the same network. Perhaps the IETF and other standards organizations should rethink this, explicitly taking older ways of doing things out of specs on a periodic basis. A draft that says “you shouldn’t do this any longer,” or “this is no longer in accordance with the specification,” would not be a bad thing.

For the more “average” network engineer, this discussion around the robustness principle should lead to some important lessons, particularly as we move ever more deeply into an automated world. Be clear about versioning of APIs and components. Deprecate older processes when they should no longer be used.

Control your technical debt, or it will control you.

The argument around learning to code, it seems, always runs something like this:

We don’t need network engineers any longer, or we won’t in five years. Everything is going to be automated. All we’ll really need is coders who can write a python script to make it all work. Forget those expert level certifications. Just go to a coding boot camp, or get a good solid degree in coding, and you’ll be set for the rest of your life!

It certainly seems plausible on the surface. The market is pretty clearly splitting into definite camps—cloud, disaggregated, and hyperconverged—and this split is certainly going to drive a lot of change in what network engineers do every day. But is this idea of abandoning network engineering skills and replacing them wholesale with coding skills really viable?

To think this question through, it’s best to start with another one. Assume everyone in the world decides to become a coder tomorrow. Every automotive engineer and mechanic, every civil engineer and architect, every chef, and every grocer moves into coding. The question that should rise just at this moment is: what is it that’s being coded? Back end coders code database systems and business logic. Front end coders code user interfaces. Graphics coders code ray tracing systems, artificial surfaces, and other such things. There is no way to do any of these things successfully if you don’t know the goal of the project. There’s no point in coding a GUI if you don’t understand user interface design. There’s no point in coding a back end system if you don’t understand accounting, or database design, or data analytics.

Given all of this, what piece of knowledge is lacking the path we are being urged to go down?

Network engineering.

If you want to code databases, you need to learn database theory. If you want to code accounting systems, you need to learn accounting. If you want to code networks, you need to learn network engineering.

But what do we mean when we say “network engineering?” Isn’t network engineering just buying some vendor gear, stringing it together, and the configuring it based on some set of arcane rules no-one really understands anyway? Isn’t network engineering much like building a castle out of plastic play blocks, just fitting them together in a way that makes sense, and ignoring or smoothing over the rough edges where things don’t quite fit right?

In short, no.

I’m not going to discourage you from learning to code—and I don’t just mean throwing around some python scripts to automate some odds and ends, or to complete a challenge on some web site. I truly believe that coding, real coding, is a good skill to have. But to believe we are going to eliminate network engineering through automation is to trade a skill set that has always been of rather limited value—purchasing, installing, and configuring vendor built appliances—with another one that is probably of less value—automating the configuration of vendor built devices. I fail to see how this is a good idea. If we all become coders, there will be no networks to code—because there will be no network engineers to build them.

Yes, silo’d engineers are going to be in less demand in the future than they are today—but this is old news, at least as old as my time administering a Netware Network at BASF, and even older. The market always wants specific skills right now, and engineers always need to build skills for the long term.

If you want to learn something now—if you want to learn something that will stand the test of time—if you want to learn something that will outlast vendors and appliances and white box and disaggregation and…

Learn network engineering.

Not how to configure devices—the skill that has stood in for real network engineering knowledge for far too long. Not how to automate the configuration of network devices—the skill that we are increasingly turning to, to replace our knowledge of the CLI.

Learn how the protocols really work, from theory to implementation, rather than how to configure them. Learn how devices switch packets, and why they work this way, rather than the available bandwidth on the latest gear. Learn how to design a network, rather than how to deploy vendor gear. Learn how to troubleshoot a network, rather than how to issue commands and look for responses.

It’s time we stopped spreading the “if you just learn to code, you’ll be in demand in five years” hype. If you have network engineering skills, then learning to code is a good thing. But I know plenty of really good coders who are not employed because they don’t have any skill other than coding (and to them, I say, learn network engineering). Learning to code is not a magic carpet that will take you to a field of dreams.

There is still hard work to do, there are still hard things to learn, there are still problems to be solved.

It’s fine—in fact crucial—to be an engineer who knows how to code. But you need to be an engineer, before learning to code is all that useful.

In the last post on this topic, we found the tail of the update chain. The actual event appears to be processed here—

case BGPEventUpdateMsg:
  st.fsm.StartHoldTimer()
  bgpMsg := data.(*packet.BGPMessage)
  st.fsm.ProcessUpdateMessage(bgpMsg)

—which is found around line 734 of fsm.go. The second line of code in this snippet is interesting; it’s a little difficult to understand what it’s actually doing. There are three crucial elements to figuring out what is going on here—

:=, in go, is a way of assigning information to a data structure. But what, precisely, is being assigned to bgpMsg from the data structure?

The * (asterisk) is a way to reference a pointer within a structure. We’ve not talked about pointers before, so it’s worth spending just a moment with them. The illustration below will help a bit.

Each letter in the string “this is a string” is stored in a single memory location (this isn’t necessarily true, but let’s assume it is for this example). Further, each memory location has a location identifier, or rather some form of number that says, “this is memory location x.” This memory locator is, of course a number—hence the memory locator itself can be assigned to a variable, which can then be treated as a separate object from the string itself.

This memory locator is called a pointer.

It should only make sense that the locator is called a pointer, because it points to the string. The question that should pop up in your head right now is—”but wait, if each letter is stored in a different memory location, then which memory location does the pointer actually point to?” If you’re trying to describe the entire string, the pointer would normally point to the first character in the string. You can, of course, also describe just some part of the string by pointing to a memory location that’s someplace in the middle of the string. For instance, you could point to just the part of the string “is a string” by finding the memory location of the second “i” in the string, and storing its memory location.

How can you find the location of a string, or some other data structure? You place an & (ampersand) in front of it. So, if you do this—

my-pointer = &a-string

Now I have the pointer, but how do I get back to the value from the pointer? Like this—

a-string-copy = *my-pointer

So the * takes the data that is pointed at by the pointer and pulls it out for assignment to another variable. In this case, then, this line of code—

bgpMsg := data.(*packet.BGPMessage)

—is—

  • taking the data located at packet.BGPMessage
  • assigning it to the data structure bgpMsg

In other words, this is copying the actual packet contents out of the buffer into which they were copied by the BGP FSM when the packet was received, and into another structure where they can be processed as an update. We need to look elsewhere for the code that removes messages from this data structure—we will most likely find it when we start looking through ProcessUpdateMessage, which is where we will start next time.

Just in time for Hallo’ween, the lucky thirteenth post in the BGP code dive series. In this series, we’re working through the Snaproute Go implementation of BGP just to see how a production, open source BGP implementation really works. Along the way, we’re learning something about how larger, more complex projects are structured, and also something about the Go programming language. The entire series can be found on the series page.

In the last post in this series, we left off with our newly established peer just sitting there sending and receiving keepalives. But BGP peers are not designed just to exchange random traffic, they’re actually designed to exchange reachability and topology information about the network. BGP carries routing information in updated, which are actually complicated containers for a lot of different kinds of reachability information. In BGP, a reachable destination is called an NLRI, or Network Layer Reachability Information. Starting with this code dive, we’re going to look at how the snaproute BGP implementation processes updates, sorting out NLRIs, etc.

When you’re reading through code, whether looking for a better understanding of an implementation, a better understanding of a protocol, or even to figure out “what went wrong” on the wire or in the network, the hardest part is often just figuring out where to start. In the first post in this series (as my dog would say, “much time ago—in fact, forever, I counted!”), I put the code on a box and ran some logging. The logging gave me a string to search for in the code base, which then led to the beginning of the call chain, and the unraveling of the finite state machine.

This time, being more familiar with the code base, we’re going to start from a different place. We’re going to guess where to start. 🙂 Given—

  • BGP works entirely around a finite state machine in this implementation (which is, in fact, how it operates in most implementations)
  • Updates are really only processed while the peering BGP speaker is in the established state
  • Receiving an update is what we’d probably consider an event

We’ll take a look at the code in the FSM for processing events while the peering relationship is in the established state to see what we can find. The relevant function in the FSM begins around line 673 (note the line numbers can change for various reasons, so this is just a quick reference that might not remain constant over time)—

func (st *EstablishedState) processEvent(event BGPFSMEvent, data interface{}) {
 switch event {

The top of this event gives us a large switch statement, each of which is going to relate to a specific event. The events are—

case BGPEventManualStop:
case BGPEventAutoStop:
case BGPEventHoldTimerExp:
case BGPEventKeepAliveTimerExp:
case BGPEventTcpConnValid:
case BGPEventTcpCrAcked, BGPEventTcpConnConfirmed:
case BGPEventTcpConnFails, BGPEventNotifMsgVerErr, BGPEventNotifMsg:
case BGPEventBGPOpen:
case BGPEventOpenCollisionDump:
case BGPEventUpdateMsg:
case BGPEventUpdateMsgErr:
case BGPEventConnRetryTimerExp, BGPEventDelayOpenTimerExp, BGPEventIdleHoldTimerExp, 
  BGPEventOpenMsgErr, BGPEventBGPOpenDelayOpenTimer, BGPEventHeaderErr:

 

In fact, the event we’re interested in is included in the list of events while a peer is in the established state. It’s around line 734—

case BGPEventUpdateMsg:
  st.fsm.StartHoldTimer()
  bgpMsg := data.(*packet.BGPMessage)
  st.fsm.ProcessUpdateMessage(bgpMsg)

case BGPEventUpdateMsgErr:
  bgpMsgErr := data.(*packet.BGPMessageError)
  st.fsm.SendNotificationMessage(bgpMsgErr.TypeCode, bgpMsgErr.SubTypeCode, bgpMsgErr.Data)
  st.fsm.StopConnectRetryTimer()
  st.fsm.ClearPeerConn()
  st.fsm.StopConnToPeer()
  st.fsm.IncrConnectRetryCounter()
  st.fsm.ChangeState(NewIdleState(st.fsm))

 

The first of the two case statements here deals with receiving an actual update; the second with an update error. The process of handling an update is really only three lines of code.

st.fsm.StartHoldTimer() resets the hold timer. Updates count as keepalives in BGP, allowing active peers to skip sending keepalive messages, saving bandwidth and processing power.

bgpMsg := data.(*packet.BGPMessage) copies the actual BGP message just received into a local data structure. This is a little bit of magic we’ll look at more closely in the next post.

st.fsm.ProcessUpdateMessage(bgpMsg) processes the actual BGP message that was just copied into a local structure. Again, chasing through this is something that will need to wait until a later post.

For now, we’ve found the tail of the update processing chain in the snaproute BGP code—more next time.