WRITTEN
The Insecurity of Ambiguous Standards

Why are networks so insecure?
One reason is we don’t take network security seriously. We just don’t think of the network as a serious target of attack. Or we think of security as a problem “over there,” something that exists in the application realm, that needs to be solved by application developers. Or we think the consequences of a network security breach as “well, they can DDoS us, and then we can figure out how to move load around, so if we build with resilience (enough redundancy) we’re already taking care of our security issues.” Or we put our trust in the firewall, which sits there like some magic box solving all our problems.
The problem is–none of this is true. In any system where overall security is important, defense-in-depth is the key to building a secure system. No single part of the system bears the “primary responsibility” for “security.” The network is certainly a part of any defense-in-depth scheme that is going to work.
Which means network protocols need to be secure, at least in some sense, as well. I don’t mean “secure” in the sense of privacy—routes are not (generally) personally identifiable information (there are always exceptions, however). But rather “secure” in the sense that they cannot be easily attacked. On-the-wire encryption should prevent anyone from reading the contents of the packet or stream all the time. Network devices like routers and switches should be difficult to break in too, which means the protocols they run must be “secure” in the fuzzing sense—there should be no unexpected outputs because you’ve received an unexpected input.
I definitely do not mean path security of any sort. Making certain a packet (or update or whatever else) has followed a specified path is a chimera in packet switched networks. It’s like trying to nail your choice of multicolored gelatin desert to the wall. Packet switched networks are designed to adapt to changes in the network by rerouting traffic. Get over it.
So why are protocols and network devices so insecure? I recently ran into an interesting piece of research that provides some of the answer. To wit—
In other words, ambiguous language leads to ambiguous implementations which leads to security flaws in protocols.
The solution for this situation might be just this—specify protocols more rigorously. But simple solutions rarely admit reality within their scope. It’s easy to build more precise specifications—so why aren’t our specifications more precise?
In a word: politics.
For every RFC I’ve been involved in drafting, reviewing, or otherwise getting through the IETF, there are two reasons for each MAY or SHOULD therein. The first is someone has thought of a use-case where an implementor or operator might want to do something that would be otherwise not allowed by MUST. In these cases, everyone looks at the proposed MAY or SHOULD, thinks about how not doing it might be useful, and then thinks … “this isn’t so bad, the available functionality is a good thing, and there’s no real problem I can see with making this a MAY or SHOULD.” In other words, we can think of possible worlds where someone might want to do something, so we allow them to do it. Call this the “freedom principle.”
The second reason is that multiple vendors have multiple customers who want to do things different ways. When the two vendors clash in the realm of standards, the result is often a set of interlocking MAYs and SHOULDs that allow two implementors to build solutions that are interoperable in the main, but not along the edges, that satisfy both of their existing customer’s requirements. Call this the “big check principle.”
The problem with these situations is—the specification has an undetermined set of MAYs and SHOULDs that might interlock in unforeseen ways, resulting in unanticipated variances in implementations that ultimately show up as security holes.
Okay—now that I’ve described the problem, what can you do about it? One thing is to simplify. Stop putting everything into a small set of protocols. The more functionality you pour into a protocol or system, the harder it is to secure. Complexity is the enemy of security (and privacy!).
As for the political problems, these are human-scale, which means they are larger than any network you can ever build—but I’ll ponder this more and get back to you if I come up with any answers.
It is Always Something (RFC1925, Rule 7)
While those working in the network engineering world are quite familiar with the expression “it is always something!,” defining this (often exasperated) declaration is a little trickier. The wise folks in the IETF, however, have provided a definition in RFC1925. Rule 7, “it is always something,” is quickly followed with a corollary, rule 7a, which says: “Good, Fast, Cheap: Pick any two (you can’t have all three).”
You can either quickly build a network which works well and is therefore expensive, or take your time and build a network that is cheap and still does not work well, or… Well, you get the idea. There are many other instances of these sorts of three-way tradeoffs in the real world, such as the (in)famous CAP theorem, which states a database can be consistent, available, and partitionable (or partitioned). Eventual consistency, and problems from microloops to surprise package deliveries (when you thought you ordered one thing, but another was placed in your cart because of a database inconsistency) have resulted. Another form of this three-way tradeoff is the much less famous, but equally true, state, optimization, surface tradeoff trio in network design.
It is possible, however, to build a system which fails at all three measures—a system which is expensive, takes a long time to build, and does not perform well. The fine folks at the IETF have provided examples of such systems.
For instance, RFC1149 describes a system of transporting IPv4 packets over avian carriers, or pigeons. This is particularly useful in areas where electricity and network cabling are not commonly found. To quote the relevant part of the RFC:
The IP datagram is printed, on a small scroll of paper, in hexadecimal, with each octet separated by whitestuff and blackstuff. The scroll of paper is wrapped around one leg of the avian carrier. A band of duct tape is used to secure the datagram’s edges. The bandwidth is limited to the leg length.
The specification of duct tape is quite odd, however; perhaps the its water-resistant properties are required for this mode of transport. This mode of transport has been adapted for use with more the more modern (in relative terms only!) IPv6 transport in RFC6214. For situation where quality of service is critical, RFC2549 describes quality of service extensions to the transport mechanism.
To further prove it is possible to build a network which is slow, expensive, and does not work well, RFC4824 describes the transmission of packets through semaphore flag signaling, or SFSS. Commonly used to signal between two ships when no other form of communication is available, SFSS consists of a sender who positions (waves) flags of particular shape and color in specific positions to signal individual characters. Rather than transmitting letters or other characters, RFC4824 describes using these flags to transmit 0’s, 1’s, and the framing elements required to transmit an IPv4 datagram over long distances. The advantage of such a system is, much like the avian carrier, that it will operate where there is no electricity. The disadvantage is the cost of encoding and decoding the packet’s contents may be many orders of magnitude more difficult than using existing SFSS specifications to signal messages.
Finally, RFC7511 provides an option which allows the most use of resources on any network while providing the slowest possible network performance by allowing senders to include a scenic route header on IPv6 packets. This header notifies routers and other networking devices that this packet would like to take the scenic route to its destination, which will cause paths including avian carriers (as described in RFC6214) to be chosen over any other available path.
Slow Learning and Range
Jack of all trades, master of none.
This singular saying—a misquote of Benjamin Franklin (more on this in a moment)—is the defining statement of our time. An alternative form might be the fox knows many small things, but the hedgehog knows one big thing.
The rules for success in the modern marketplace, particularly in the technical world, are simple: start early, focus on a single thing, and practice hard.
But when I look around, I find these rules rarely define actual success. Consider my life. I started out with three different interests, starting jazz piano lessons when I was twelve, continuing music through high school, college, and for many years after. At the same time, I was learning electronics—just about everyone in my family is in electronic engineering (or computers, when those came along) in one way or another.
I worked as on airfield electronics for a few years in the US Air Force (one of the reasons I tend to be calm is I’ve faced death up close and personal multiple times, an experience that tends to center your mind), including RADAR, radio, and instrument landing systems. Besides these two, I was highly interested in art and illustration, getting to the point of majoring in art in college for a short time, and making a living doing commercial illustration for a time.
You might notice that none of this really has a lot to do with computer networking. That’s the point.
I once thought I was a bit of an anomaly in this—in fact, I’m a bit of an anomaly throughout my life, including coming rather late to deep philosophy and theology (perhaps a bit too late!).
After reading Range by David Epstein, it turns out I’m wrong. I’m not the exception, I’m the rule. My case is so common as to be almost trivial.
Epstein not only destroys the common view—start early, stay focused, and practice hard—with reasoning, he also gives so many examples of people who have succeeded because they “wandered around” for many years before settling into a single “thing”—and sometimes just never “settling” throughout their entire lives. People who experience many different things, experimenting with ideas, careers, and paths, have what Epstein calls range.
He gives several reasons for people with range succeeding. They learn how to fail fast, unlike those who are focused on succeeding at a single thing—he calls this “too much grit.” They also learn to think outside the box—they are not restricted by the “accepted norms” within any field of study. It also turns out that slower learning is much more effective, as shown by multiple experiments.
There are three warnings about becoming a person with range, however—the fox rather than the hedgehog, so-to-speak. First, it takes a long time. Slow learning is, after all slow. Second, range works best in a world full of specialist—like the world we live in right now. In a world full of generalists, specialists are likely to succeed more often than generalist. What is different stands out (both in bad and good ways, by the way). Third, people with range do better with wicked problems—problems that are not easily solved with repetition and linear thought.
Of course, computer networks are clearly wicked problems.
That original quote that bothers me so much? Franklin did not say: jack of all trades, master of non. Instead, he said: jack of all trades, master of one. What a difference a single letter makes.
Complexity Bites Back

What percentage of business-impacting application outages are caused by networks? According to a recent survey by the Uptime Institute, about 30% of the 300 operators they surveyed, 29% have experienced network related outages in the last three years—the highest percentage of causes for IT failures across the period.
A secondary question on the survey attempted to “dig a little deeper” to understand the reasons for network failure; the chart below shows the result.
We can be almost certain the third-party failures, if the providers were queried, would break down along the same lines. Is there a pattern among the reasons for failure?
Configuration change—while this could be somewhat managed through automation, these kinds of failures are more generally the result of complexity. Firmware and software failures? The more complex the pieces of software, the more likely it is to have mission-impacting errors of some kind—so again, complexity related. Corrupted policies and routing tables are also complexity related. The only item among the top preventable causes that does not seem, at first, to relate directly to complexity is network overload and/or congestion problems. Many of these cases, however, might also be complexity related.
The Uptime Institute draws this same lesson, though through a slightly different process, saying: “Networks are complex not only technically, but also operationally.”
For years—decades, even—we have talked about the increasing complexity of networks, but we have done little about it. Yes, we have automated all the things, but automation can only carry us so far in covering complexity up. Automation also adds a large dop of complexity on top of the existing network—sometimes (not always, of course!) automating a complex system without making substantial efforts at simplification is just like trying to put a fire out with a can of gas (or, in one instance I actually saw, trying to put out an electrical fire with a can of soda, with the predictable trip to the local hospital.
We are (finally) starting to be “bit hard” by complexity problems in our networks—and I suspect this is the leading edge of the problem, rather than the trailing edge.
Maybe it’s time to realize making every protocol serve every purpose in the network wasn’t a good idea—we now have protocols that are so complex that they can only be correctly configured by machines, and then only when you narrow the use case enough to make the design parameters intelligible.
Maybe it’s time to realize optimizing for every edge use case wasn’t a good idea. Sometimes it’s just better to throw resources at the problem, rather than throwing state at the control plane to squeeze out just one more ounce of optimization.
Maybe it’s time to stop building networks around “whatever the application developer can dream up.” To start working as a team with the application developers to build a complete system that puts complexity where it most makes sense, and divides complexity from complexity, rather than just assuming “the network can do that.”
Maybe it’s time to stop thinking we can automate our way out of this.
Maybe it’s time to lay our superhero capes down and just start building simpler systems.
You Can Always Add Another Layer of Indirection (RFC1925, Rule 6a)
Many within the network engineering community have heard of the OSI seven-layer model, and some may have heard of the Recursive Internet Architecture (RINA) model. The truth is, however, that while protocol designers may talk about these things and network designers study them, very few networks today are built using any of these models. What is often used instead is what might be called the Infinitely Layered Functional Indirection (ILFI) model of network engineering. In this model, nothing is solved at a particular layer of the network if it can be moved to another layer, whether successfully or not.
For instance, Ethernet is the physical and data link layer of choice over almost all types of physical medium, including optical and copper. No new type of physical transport layer (other than wireless) can succeed unless if can be described as “Ethernet” in some regard or another, much like almost no new networking software can success unless it has a Command Line Interface (CLI) similar to the one a particular vendor developed some twenty years ago. It’s not that these things are necessarily better, but they are well-known.
Ethernet, however, goes far beyond providing physical layer connectivity. Because many applications rely on using Ethernet semantics directly, many networks are built with some physical form of Ethernet (or something claiming to be like Ethernet), with IP on top of this. On top of the IP, there is some other transport protocol, such as VXLAN, UDP, GRE, or perhaps even MPLS over UDP. On top of these layers rides … Ethernet. On which IP runs. On which TCP or UDP, or potentially VXLAN runs. It turns out it is easier to add another layer of indirection to solve many of the problems caused by applications that expect Ethernet than it is to solve them with IP—or any other transport protocol. You’ve heard of turtles all the way down—today we have Ethernet all the way down.
Many other suggestions of this type have been made in network engineering and protocol design across the years, but none of them seem to have been as widely deployed as Ethernet over IP over Ethernet. For instance, RFC3252 notes it has always been difficult to understand the contents of Ethernet, IP, and other packets as they are transmitted from host to host. The eXtensible Markup Language (XML) is, on the other hand, designed to be both machine- and human-readable. A logical solution to the problem of unreadable packets, then, is to add another layer of indirection by formatting all packets, including Ethernet and IP, into XML. Once this is done, there would be no need for expensive or complex protocol analyzers, as anyone could simply capture packets off the wire and read them directly. Adding another layer, in this case, could save many hours of troubleshooting time, and generally reduce the cost of operating a network significantly.
Once the idea of adding another layer has been fully grasped, the range of problems which can be solved becomes almost limitless. Many companies struggle to find some way to provide secure remote access to their employees, contractors, and even customers. The systems designed to solve this problem are often complex, difficult to deploy, and almost impossible to troubleshoot. RFC5514, however, provides an alternate solution: simply layer an IPv6 transport stream on top of the social media networks everyone already uses. Everyone, after all, already has at least one social media account, and can already reach that social media account using at least one device. Creating an IPv6 stream across social media would provide universal cloud-based access to anyone who desires.
Such streams could even be encrypted to ensure the operators and users of the underlying social media network cannot see any private information transmitted across the IPv6 channel created in this way.
On Using the Right Word
A while back, I was sitting in a meeting where the presenter described switching from a “traditional, hierarchical data center fabric” to a spine-and-leaf (while drawing CLOS, in all capital letters, on the whiteboard). He pointed out that the spine-and-leaf design is simpler because it only has two tiers rather than three.
There is so much wrong with this I almost winced in physical pain. Traditional hierarchical designs are not fabrics. Spine-and-leaf fabrics are not CLOS, but Clos, fabrics. Clos fabrics have three stages, not two—even if we draw them “folded” so you only see two apparent levels to the fabric. In fact, all spine-and-leaf fabrics always have an odd number of stages, and they are stages, not tiers.
More recently, I heard someone talking about an operating system that was built using microservices. I thought—“that would be at neat trick.” To build something with microservices does not just mean a piece of software using modules—this would be modular application (or operating system) design. Microservices architectures break the application up into the most basic components possible and then scale each kind of component out (rather than up) by spinning new copies of each service as needed. I cannot imagine scaling an operating system out by spinning multiple copies of the same service, and then providing some sort way to spread load across the various copies. Would you have some sort of anycast IPC? An internal DNS server or load balancer?
You can have an OS that natively participates in a larger microservices-based architecture, but what would microservices within the operating system look like, precisely?
Maybe my recent studies in philosophy make me much more attuned to the way we use language in the network engineering world—or maybe I’m just getting old. Whatever it is, our determination to make every word mean everything is driving me nuts.
What is the difference between a router and a switch? There used to be a simple definition—routers rewrite the L2 header and switches don’t. But now that routers switch packets, and switches route packets, the only difference seems to be … buffer depth? Feature set? The line between router and switch is fuzzy to the point of being meaningless, leaving us with no real term to describe a real switch any longer (a device that doesn’t do routing).
What about software defined networks? We’ve been treated to software defined everything now, of course. And intent? I get the point of intent, but we’re already moving down the path of making the meaning so broad that it can even contain configuring the CLI on an old AGS+. And don’t get me started on artificial intelligence, which is often learned to describe something closer to machine learning. Of course machine learning is often used to describe things that are really nothing more than statistical inference.
Maybe it’s time for a general rebellion against the sloppy use of language in network engineering. Or maybe I’m just tilting at yet another windmill. Wake me up when we’ve gotten to the point that we can use any word interchangeably with any other word in the network engineering dictionary. I await the AI that routes packets by reading your mind (through intent) called a swouter… or something.
Rethinking BGP on the DC Fabric (part 5)
BGP is widely used as an IGP in the underlay of modern DC fabrics. This series argues this is not the best long-term solution to the problem of routing in fabrics because BGP is not ideal for this use case. This post will consider the potential harm we are doing to the larger Internet by pressing BGP into a role it was not originally designed to fulfill—an underlay protocol or an IGP.
My last post described the kinds of configuration required to make BGP work on a DC fabric—it turns out that the configuration of each BGP speaker on the fabric is close to unique. It is possible to automate configuring each speaker—but it would be better if we could get closer to autonomic operation.
To move BGP closer to autonomic operation in a DC fabric, there are several things we can do. First, we can allow a BGP speaker to peer with any other BGP speaker it receives an open message from—this is often called promiscuous mode. While each router in the fabric will still need to be configured with the right autonomous system, at least we won’t need to configure the correct peers on each router (including the remote AS).
Note, however, that using this kind of promiscuous peering does come with a set of tradeoffs (if you’re reading this blog, you know there will be tradeoffs). BGP speakers running in promiscuous mode open a large attack surface on the control plane of the network. We can close this attack surface by configuring authentication on all BGP speakers … but we are now adding complexity to reduce complexity. We could also reduce the scope of the attack surface by never permitting BGP to peer beyond a single hop, and then filtering all BGP packets at the fabric edge. Again, just a bit more complexity to manage—but remember that the road to highly fragile and complex systems is always paved with individual steps that never, on their own, seem to add “too much complexity.”
The second thing we can do to move BGP closer to autonomic operation is to advertise routes to every connected peer without any policy configured. This does, again, introduce some tradeoffs, particularly in the realm of security, but let’s leave that aside for the moment.
Assume we can create a version of BGP that has these modifications—it always accepts any peer from any other AS, and it advertises all routes without any policy configured. Put these features behind a single knob which also includes setting the MRAI to 0 or 1, tightens up the dampening parameters, and adjusts a few other things to make BGP work better in a DC fabric.
As an experiment, let’s enable this DC fabric knob on a BGP speaker at the edge of a dual-homed “enterprise customer.” What will happen?
The enterprise network will automatically peer to any speaker that sends an open message—a huge security hole on the open Internet—and it will advertise every route it learns even though there is no policy configured. This second issue—advertising routes with no policy configured—can cause the enterprise network to become a transit between two much larger provider networks, crashing out some small corner of the Internet.
This might seem like a trivial issue. After all, just don’t ever enable the DC fabric knob on an eBGP peering session upstream into the DFZ, or any other “real” internetwork. Sure, and just don’t ever hit the brakes when you mean to hit the accelerator, or the accelerator when you mean to hit the brakes. If I had a dime for every time we “just don’t ever make that mistake …” Well, I wouldn’t be blogging, I’d be relaxing in the sun someplace (okay, I’m not likely to ever stop working to sit around and “relax” all the time, but you get the picture anyway).
Maybe—just maybe—it would really be better overall to use two different protocols for IGP and EGP work. Maybe—just maybe—it’s better not to mix these two different kinds of functions in a single protocol. Not only is the single resulting protocol bound to be really complex (most BGP implementations are now over 100,000 lines of code, after all), but it will end up being really easy to make really bad mistakes.
No tool is omnicompetent. If you found a tool that was, in fact, omnicompetent, it would also be the most dangerous tool in your toolbox.
