WRITTEN
It Always Takes Too Long
It always takes longer to find a problem than it should. Moving through the troubleshooting process often feels like swimming in molasses—it’s never fast enough or far enough to get the application back up and running before that crucial deadline. The “swimming in molasses effect” doesn’t end when the problem is found out, either—repairing the problem requires juggling a thousand variables, most of which are unknown, combined with the wit and sagacity of a soothsayer to work with vendors, code releases, and unintended consequences.
It’s enough to make a network engineer want to find a mountain top and assume an all-knowing pose—even if they don’t know anything at all.
The problem of taking longer, though, applies in every area of computer networking. It takes too long for the packet to get there, it takes to long for the routing protocol to converge, it takes too long to support a new application or server. It takes so long to create and validate a network design change that the hardware, software and processes created are obsolete before they are used.
Why does it always take too long? A short story often told to me by my Grandfather—a farmer—might help.
One morning a farmer got early in the morning, determined to throw some hay down to the horses in the stable. While getting dressed, he noticed one of the buttons on his shirt was loose. “No time for that now,” he thought, “I’ll deal with it later.” Getting out to the barn, he climbed up the ladder to the loft, and picked up a pitchfork. When he drove the fork into the hay, the handle broke.
He sighed, took the broken pieces down the ladder, and headed over to his shed to replace the handle—but when he got there, he realized he didn’t have a new handle that would fit. Sighing again, he took the broken pieces to his old trusty truck and headed into town—arriving before the hardware store opened. “Well, I’m already here, might as well get some coffee,” he thought, so he headed to the diner. After a bit, he headed to the store to buy a handle—but just as he walked out past the door, the loose button caught on the handle, popping off.
It took a few minutes to search for the lost button, but he found it and headed over to the cleaners to have it sewn back on “real fast.” Well, he couldn’t wander around town in his undershirt, so he just stepped next door to the barber’s, where there were a few friendly games of checkers already in progress. He played a couple of games, then the barber came out to remind him that he needed a haircut (a thing barbers tend to do all the time for some reason), so he decided to have it done. “Might was well not waste the time in town now I’m here,” he thought.
The haircut finished, he went back to get his shirt, and realized it was just about lunch. Back to the diner again. Once he was done, he jumped in his truck and headed back to the farm. And then he realized—the horses were hungry, the hay hadn’t been pitched, and … his pitchfork was broken.
And this is why it always takes longer than it should to get anything done with a network. You take the call and listen to the customer talk about what the application is doing, which takes a half an hour. You then think about what might be wrong, perhaps kicking a few routers “just for good measure” before you start troubleshooting in earnest. You look for a piece of information you need to understand the problem, only to find the telemetry system doesn’t collect that data “yet”—so you either open a ticket (a process that takes a half an hour), or you “fix it now” (which takes several hours). Once you have that information, you form a theory, so you telnet into a network device to check on a few things… only to discover the device you’re looking at has the wrong version of code… This requires a maintenance window to fix, so you put in a request…
Once you even figure out what the problem is, you encounter a series of hurdles lined up in front of you. The code needs to be upgraded, but you have to contact the vendor to make certain the new code supports all the “stuff” you need. The configuration has to be changed, but you have to determine if the change will impact all the other applications on the network. You have to juggle a seemingly infinite number of unintended consequences in a complex maze of software and hardware and people and processes.
And you wonder, the entire time, why you just didn’t learn to code and become a backend developer, or perhaps a mountain-top guru.
So the next time you think it’s taking to long to fix the problem, or design a new addition to the network, or for the vendor to create that perfect bit of code, remember the farmer, and the button that left the horses hungry.
Leveraging Similarities
We tend to think every technology and every product is roughly unique—so we tend to stay up late at night looking at packet captures and learning how to configure each product individually, and chasing new ones as if they are the brightest new idea (or, in marketing terms, the best thing since sliced bread). Reality check: they aren’t. This applies across life, of course, but especially to technology. From a recent article—
RFC1925 rule 11 states—
Rule 11 isn’t just a funny saying—rule 11 is your friend. If want to learn new things quickly, learn rule 11 first. A basic understanding of the theory of networking will carry across all products, all marketing campaigns, and all protocols.
Whatever it is, you need more (RFC1925 rule 9)
There is never enough. Whatever you name in the world of networking, there is simply not enough. There are not enough ports. There is not enough speed. There is not enough bandwidth. Many times, the problem of “not enough” manifests itself as “too much”—there is too much buffering and there are too many packets being dropped. Not so long ago, the Internet community decided there were not enough IP addresses and decided to expand the address space from 32 bits in IPv4 to 128 bits in IPv6. The IPv6 address space is almost unimaginably huge—2 to the 128th power is about 340 trillion, trillion, trillion addresses. That is enough to provide addresses to stacks of 10 billion computers blanketing the entire Earth. Even a single subnet of this space is enough to provide addresses for a full data center where hundreds of virtual machines are being created every minute; each /64 (the default allocation size for an IPv6 address) contains 4 billion IPv4 address spaces.
But… what if the current IPv6 address space simply is not enough? Engineers working in the IETF have created two different solutions over the years for just this eventuality. In 1994 RFC1606 provided a “letter from the future” describing the eventual deployment of IPv9, which was (in this eventual future) coming to the end of its useful lifetime because the Internet was running out of numbers. In RFC1606, it is noted that IPv9’s 49 levels of hierarchy had proven popular, but not all the levels had found a use. The highest level in use seems to be level 39, which was being used to address individual subatomic particles. Part of the dwindling address space considered in RFC1606 was the default allocation of about 1 billion addresses to each household. As the number of homes built increased globally, the IPv9 address space came under increasing pressure. The allocation of groups of addresses to recyclable items was not helpful, either, regardless of the ability to multicast “all cardboard items” in a recycling bin.
An alternate proposal, written many years later, is RFC8135, which considers complex addressing in IPv6. RFC8135 begins by describing the different ways in which a set of numbers, such as the 128-bit space in the IPv6 address, can be represented, including integers, prime, and composite. Each of these are considered in some detail, but eventually rejected for various reasons. For instance, integer (or fixed point) addresses are rejected because the location of a host is not fixed, so fixed point addresses are a poor representation of the host. Prime addresses are likewise rejected because they take too long to compute, and composite addresses are rejected because they are too difficult to differentiate from prime addresses.
RFC8135 proposes a completely different way of looking at the 128-bits available in the IPv6 address space. Rather than treating IPv6’s address space as a simple integer, this specification advocates for treating it as a floating number. This allows for a much larger space, particularly as aggregation can be indicated through scientific notation. The main problem the authors note with this proposal is users may believe that when they assign a floating address to their device, the device itself thereby becomes waterproof and floating. The authors advice users count on a waterproofing app, available in most app stores, for this function, rather than counting on the floating address. The authors also note duct tape can be used to permanently attach a floating address to a fixed device, if needed.
The danger, of course, is that in the quest for “more,” network designers, network operators, and protocol designers could end up embracing the ridiculous. It all brings to mind the point Andrew Tanenbaum made in a standard work on Networking, Computer Networks. Tanenbaum calculates the bandwidth of a station wagon full of magnetic tape (specifically VHS format) backups. After considering the amount of time it would take to drive the station wagon across the continental United States, he concludes the vehicle has more bandwidth than any link available at that time. A similar calculation could be made with a mid-sized shipping box available from any overnight package carrier, filled with SSD drives (or similar). The conclusion, according to Dr. Tanenbaum, is networks are a sop to human impatience.
As there is no bound to human impatience, no matter how much you have, as RFC1925 says, you will always need more.
NATs, PATs, and Network Hygiene
While reading a research paper on address spoofing from 2019, I ran into this on NAT (really PAT) failures—
The authors state 49% of the NATs they discovered in their investigation of spoofed addresses fail in one of these two ways. From what I remember way back when the first NAT/PAT device (the PIX) was deployed in the real world (I worked in TAC at the time), there was a lot of discussion about what a firewall should do with packets sourced from addresses not indicated anywhere.
If I have an access list including 192.168.1.0/24, and I get a packet sourced from 192.168.2.24, what should the NAT do? Should it forward the packet, assuming it’s from some valid public IP space? Or should it block the packet because there’s no policy covering this source address?
This is similar to the discussion about whether BGP speakers should send routes to an external peer if there is no policy configured. The IETF (though not all vendors) eventually came to the conclusion that BGP speakers should not advertise to external peers without some form of policy configured.
My instinct is the NATs here are doing the right thing—these packets should be forwarded—but network operators should be aware of this failure mode and configure their intentions explicitly. I suspect most operators don’t realize this is the way most NAT implementations work, and hence they aren’t explicitly filtering source addresses that don’t fall within the source translation pool.
In the real world, there should also be a box just outside the NATing device that’s running unicast reverse path forwarding checks. This would resolve these sorts of spoofed packets from being forwarding into the DFZ—but uRPF is rarely implemented by edge providers, and most edge connected operators (enterprises) don’t think about the importance of uRPF to their security.
All this to say—if you’re running a NAT or PAT, make certain you understand how it works. Filters are tricky in the best of circumstances. NAT and PATs just make filters trickier.
Details and Complexity
What is the first thing almost every training course in routing protocols begin with? Building adjacencies. What is considered the “deep stuff” in routing protocols? Knowing packet formats and processes down to the bit level. What is considered the place where the rubber meets the road? How to configure the protocol.
I’m not trying to cast aspersions at widely available training, but I sense we have this all wrong—and this is a sense I’ve had ever since my first book was released in 1999. It’s always hard for me to put my finger on why I consider this way of thinking about network engineering less-than-optimal, or why we approach training this way.
This, however, is one thing I think is going on here—
We believe that by knowing ever-deeper reaches of detail about a protocol, we are not only more educated engineers, but we will be able to make better decisions in the design and troubleshooting spaces.
To some degree, we think we are managing the complexity of the protocol by “making our knowledge practical”—by knowing the bits, bytes, and configurations. This natural tendency to “dig in,” to learn more detail, turns out to be counterproductive. Continuing from the same article—
The scientific opinion of many psychologists and behavioral scientists suggests the key to time-sensitive decision making in complex and chaotic situations is simplicity, not complexity. Simple-to-remember rules of thumb, or heuristics, speed the cognitive process, enabling faster decisionmaking and action. Recognizing that heuristics have limitations and are not a substitute for basic research and analysis, they nevertheless help break complexity-induced paralysis and support the development of good plans that can achieve timely and acceptable results. The best heuristics capture useful information in an intuitive, easy-to-recall way. Their utility is in assisting decision makers in complex and chaotic situations to make better and timelier decisions.
Knowing why a protocol works the way it does—understanding what it’s doing and why—from an abstract perspective is, I believe, a more important skill for the average network engineer than knowing the bits and bytes—or the configuration.
Abstract correctly—but abstract more. Get back to the basics and know why things work the way they do. It’s easier to fill in the details if you know the how and why.
It’s Most Complicated than You Think (RFC1925, Rule 8)
It’s not unusual in the life of a network engineer to go entire weeks, perhaps even months, without “getting anything done.” This might seem odd for those who do not work in and around the odd combination of layer 1, layer 3, layer 7, and layer 9 problems network engineers must span and understand, but it’s normal for those in the field. For instance, a simple request to support a new application might require the implementation of some feature, which in turn requires upgrading several thousand devices, leading to the discovery that some number of these devices simply do not support the new software version, requiring a purchase order and change management plan to be put in place to replace those devices, which results in … The chain of dominoes, once it begins, never seems to end.
Or, as those who have dealt with these problems many times might say, it is more complicated than you think. This is such a useful phrase, in fact, it has been codified as a standard rule of networking in RFC1925 (rule 8, to be precise).
Take, for instance, the problem of sending documents through electronic mail—in the real world, there are various mechanisms available to group documents, so the recipient understands what documents go together as a set, which ones are separate—staples, paperclips, binders, folders, etc. In the virtual world, however, documents are just a big blob of bits. How does anyone know which documents go with which in this situation? The obvious solution is to create electronic versions of staples and paperclips, as described in RFC1927. This only seems simple, however; it is more complicated than you think.
For instance, how do you know someone along the document transmission path has not altered the staples and/or paper clips? To prevent staple tampering, electronic staples must be cryptographically signed in some way. In the real world, paper clips (in particular) are removed from documents and re-used to save money and resources. Likewise, there must be some process to discover unused digital document sets so the paper clips may be removed and placed in some form of storage for reuse. Some people like to use differently colored staples or paperclips; how should these be represented in the digital world? RFC1927 describes MIME labels to resolve most of these problems, but there is one final problem that brings the complexity of grouping electronic documents to an entirely new level: metadata creep. What happens when the amount of data required to describe the staple or paperclip becomes larger than the documents being grouped?
Something as simple as representing characters in a language can often be more complex than it might initially seem. RFC5242 attempts to resolve the complexity of the many available encoding schemes with a single coding scheme. Rather than assigning each symbol within a language to a single number within a number space, like ASCII and UNICODE do, however, RFC5242 suggests creating a set of codes which describe how a character looks, rather than what it stands for. This allows the authors to use four principles—if it looks alike, it is alike; if it is the same thing, it is the same thing; san-serif is preferred; combine characters rather than creating new ones where possible—to create a simplified way to describe any possible character in virtually any “Latin” language. The result requires a bit more space to store in some cases, and is more difficult to process, but it is simpler at least from some perspective.
RFC5242 reminds me of a protocol custom-developed for an application I once had to troubleshoot—the entire protocol was sent in actual ASCII text. At least it was simpler to read on the network packet capture tool. There are, of course, many other examples of things being more complex than initially thought in the networking world—which is probably a good thing, because it means those many reports of the demise of the network engineer are probably greatly exaggerated.
Illusory Correlation and Security
Fear sells. Fear of missing out, fear of being an imposter, fear of crime, fear of injury, fear of sickness … we can all think of times when people we know (or worse, a people in the throes of madness of crowds) have made really bad decisions because they were afraid of something. Bruce Schneier has documented this a number of times. For instance: “it’s smart politics to exaggerate terrorist threats” and “fear makes people deferential, docile, and distrustful, and both politicians and marketers have learned to take advantage of this.” Here is a paper comparing the risk of death in a bathtub to death because of a terrorist attack—bathtubs win.
But while fear sells, the desire to appear unafraid also sells—and it conditions people’s behavior much more than we might think. For instance, we often say of surveillance “if you have done nothing wrong, you have nothing to hide”—a bit of meaningless bravado. What does this latter attitude—“I don’t have anything to worry about”—cause in terms of security?
Several attempts at researching this phenomenon have come to the same conclusion: average users will often intentionally not use things they see someone they perceive as paranoid using. According to this body of research, people will not use password managers because using one is perceived as being paranoid in some way. Theoretically, this effect is caused by illusory correlation, where people associate an action with a kind of person (only bad/scared people would want to carry a weapon). Since we don’t want to be the kind of person we associate with that action, we avoid the action—even though it might make sense.
This is just the flip side of fear sells, of course. Just like we overestimate the possibility of a terrorist attack impacting our lives in a direct, personal way, we also underestimate the possibility of more mundane things, like drowning in a tub, because we either think can control it, or because we don’t think we’ll be targeted in that way, or because we want to signal to the world that we “aren’t one of those people.”
Even knowing this is true, however, how can we counter this? How can we convince people to learn to assess risks rationally, rather than emotionally? How can we convince people that the perception of control should not impact your assessment of personal security or safety?
Simplifying design and use of the systems we build would be one—perhaps not-so-obvious—step we can take. The more security is just “automatic,” the more users will become accustomed to deploying security in their everyday lives. Another thing we might be able to do is stop trying to scare people into using these technologies.
In the meantime, just be aware that if you’re an engineer, your use of a technology “as an example” to others can backfire, causing people to not want to use those technologies.
