In the argument between OSPF and BGP in the data center fabric over at Justin’s blog, I am decidedly in the camp of IS-IS. Rather than lay my reasons out here, however (a topic for another blog post?), I want to focus on something else Justin said that I think is incredibly important for network engineers to understand.
I think whiteboards are the most important tool for network design currently available, which makes me sad. I wish that wasn’t true, I want much better tools. I can’t even tell you the number of disasters averted by 2-3 great network engineers arguing over a whiteboard.
For those not following the current state of the ITU, a proposal has been put forward to (pretty much) reorganize the standards body around “New IP.” Don’t be confused by the name—it’s exactly what it sounds like, a proposal for an entirely new set of transport protocols to replace the current IPv4/IPv6/TCP/QUIC/routing protocol stack nearly 100% of the networks in operation today run on. Ignoring, for the moment, the problem of replacing the entire IP infrastructure, what can we learn from this proposal?
We’re actually pretty good at finding, and “solving” (for some meaning of “solving,” of course), these kinds of immediately obvious tradeoffs. It’s obvious the street sweepers are going to lose their jobs if we replace them with a robot. What might not be so obvious is the loss of the presence of a person on the street. That’s a pair of eyes who can see when a child is being taken by someone who’s not a family member, a pair of ears that can hear the rumble of a car that doesn’t belong in the neighborhood, a pair of hands that can help someone who’s fallen, etc.
Post-mortem reviews seem to be quite common in the software engineering and application development sides of the IT world—but I do not recall a lot of post-mortems in network engineering across my 30 years. This puzzling observation sprang to mind while I was reading a post over at the ACM this last week about how to effectively learn from the post-mortem exercise.
The common pattern seems to be setting aside a one hour meeting, inviting a lot of people, trying to shift blame while not actually saying you are shifting blame (because we are all supposed to live in a blame-free environment now—fix the problem, not the blame!), and then … a list is created on a whiteboard, pictures are taken, and everyone walks away with a rock-solid plan to never do that again.
This is the first of the ironies of automation Lisanne Bainbridge discusses—and this is the irony I’d like to explore. The irony she is articulating is this: the less you work on a system, the less likely you are to be able to control that system efficiently. Once a system is automated, however, you will not work on the system on a regular basis, but you will be required to take control of the system when the automated controller fails in some way. Ironically, in situations where the automated controller fails, the amount of control required to make things right again will be greater than in normal operation.
In the case of machine operation, it turns out that the human operator is required to control the machine in just the situations where the least amount of experience is available. This is analogous to the automated warehouse in which automated systems are used to stack and sort material. When the automated systems break down, there is absolutely no way for the humans involved to figure out why things are stacked the way they are, nor how to sort things out to get things running again.
Simon Weckhert recently hacked Google Maps into guiding drivers around a street through a rather simple mechanism: he placed 95 cellphones, all connected to Google Maps, in a little wagon and walked down the street with the wagon in tow. Maps saw this group of cell phones as a very congested street—95 cars cannot even physically fit into the street he was walking down—and guided other drivers around the area. The idea is novel, and the result rather funny, but it also illustrates a weakness in our “modern scientific mindset” that often bleeds over into network engineering.
The basic problem is this: we assume users will use things the way we intend them to. This never works out in the real world, because users are going to use wrenches as hammers, cell phones as if they were high-end cameras, and many other things in ways they were never intended. To make matters worse, users often “infer” the way something works, and adapt their actions to get what they want based on their inference. For instance, everyone who drives “reverse-engineers” the road in their head, thinking about what the maximum safe speed might be, etc. Social media users do the same thing when posting or reading through their timeline, causing people to create novel and interesting ideas about how these things work that have no bearing on reality.
One of my pet peeves about the network “engineering” world is this: we do too little engineering and too much administration. What brought this to mind this week is an article about Margaret Hamilton about the time she spent working on software development for the Apollo space program, and the lessons she learned about software development there. To wit—
Engineering—back in 1969 as well as here in 2020—carries a whole set of associated values with it, and one of the most important is the necessity of proofing for disaster before human usage. You don’t “fail fast” when building a bridge: You ensure the bridge works first.
Sounds simple in theory—but it is not in practice.
Let’s take, as an example, replacing some of the capacity in your data center designed on a rather traditional two-layer hierarchy, aggregation, and core.
If you haven’t found the tradeoffs, you haven’t looked hard enough. Something I say rather often—as Eyvonne would say, a “Russism.” Fair enough, and it’s easy enough to say “if you haven’t found the tradeoffs, you haven’t looked hard enough,” but what does it mean, exactly? How do you apply this to the everyday world of designing, deploying, operating, and troubleshooting networks?
Humans tend to extremes in their thoughts. In many cases, we end up considering everything a zero-sum game, where any gain on the part of someone else means an immediate and opposite loss on my part. In others, we end up thinking we are going to get a free lunch. The reality is there is no such thing as a free lunch, and while there are situations that are a zero-sum game, not all situations are. What we need is a way to “cut the middle” to realistically appraise each situation and realistically decide what the tradeoffs might be.
Network engineers do not need to become full-time coders to succeed—but some coding skills are really useful. In this episode of the Hedge, David Barrosso (you can find David’s github repositories here), Phill Simmonds, and Russ White discuss which programming skills are useful for network engineers.
Raise your hand if you think moving to platform as a service or infrastructure as a service is all about saving money. Raise it if you think moving to “the cloud” is all about increasing business agility and flexibility.
Put your hand down. You’re wrong.
Let’s be honest. For the last twenty years we network engineers have specialized in building extremely complex systems and formulating the excuses required when things don’t go right. We’ve specialized in saying “yes” to every requirement (or even wish) because we think that by saying “yes” we will become indispensable. Rather than building platforms on which the business can operate, we’ve built artisanal, complex, pets that must be handled carefully lest they turn into beasts that devour time and money. You know, like the person who tries to replicate store-bought chips by purchasing expensive fryers and potatoes, and ends up just making a mess out of the kitchen?