A long time ago, I worked in a secure facility. I won’t disclose the facility; I’m certain it no longer exists, and the people who designed the system I’m about to describe are probably long retired. Soon after being transferred into this organization, someone noted I needed to be trained on how to change the cipher door locks. We gathered up a ladder, placed the ladder just outside the door to the secure facility, popped open one of the tiles on the drop ceiling, and opened a small metal box with a standard, low security key. Inside this box was a jumper board that set the combination for the secure door.
First lesson of security: there is (almost) always a back door.
I was reminded of this while reading a paper recently published about a backdoor attack on certificate authorities. There are, according to the paper, around 130 commercial Certificate Authorities (CAs). Each of these CAs issue widely trusted certificates used for everything from TLS to secure web browsing sessions to RPKI certificates used to validate route origination information. When you encounter these certificates, you assume at least two things: the private key in the public/private key pair has not been compromised, and the person who claims to own the key is really the person you are talking to. The first of these two can come under attack through data breaches. The second is the topic of the paper in question.
How do CAs validate the person asking for a certificate actually is who they claim to be? Do they work for the organization they are obtaining a certificate for? Are they the “right person” within that organization to ask for a certificate? Shy of having a personal relationship with the person who initiates the certificate request, how can the CA validate who this person is and if they are authorized to make this request?
They could do research on the person—check their social media profiles, verify their employment history, etc. They can also send them something that, in theory, only that person can receive, such as a physical letter, or an email sent to their work email address. To be more creative, the CA can ask the requestor to create a small file on their corporate web site with information supplied by the CA. In theory, these electronic forms of authentication should be solid. After all, if you have administrative access to a corporate web site, you are probably working in information technology at that company. If you have a work email address at a company, you probably work for that company.
These electronic forms of authentication, however, can turn out to be much like the small metal box which holds the jumper board that sets the combination just outside the secure door. They can be more security theater than real security.
In fact, the authors of this paper found that some 70% of the CAs could be tricked into issuing a certificate for just about any organization—by hijacking a route. Suppose the CA asks the requestor to place a small file containing some supplied information on the corporate web site. The attacker creates a web server, inserts the file, hijacks the route to the corporate web site so it points at the fake web site, waits for the authentication to finish, and then removes the hijacked route.
The solution recommended in this paper is for the CAs to use multiple overlapping factors when authenticating a certificate requestor—which is always a good security practice. Another solution recommended by the authors is to monitor your BGP tables from multiple “views” on the Internet to discover when someone has hijacked your routes, and take active measures to either remove the hijack, or at least to detect the attack.
These are all good measures—ones your organization should already be taking.
But the larger point should be this: putting a firewall in front of your network is not enough. Trusting that others will “do their job correctly,” and hence that you can trust the claims of certificates or CAs, is not enough. The Internet is a low trust environment. You need to think about the possible back doors and think about how to close them (or at least know when they have been opened).
Having personal relationships with people you do business with is a good start. Being creative in what you monitor and how is another. Firewalls are not enough. Two-factor authentication is not enough. Security is systemic and needs to be thought about holistically.
There are always back doors.
Privacy problems are an area of wide concern for individual users of the Internet—but what about network operators? In this issue of The Internet Protocol Journal, Geoff Huston has an article up about privacy in DNS, and the various attempts to make DNS private on the part of the IETF—the result can be summarized with this long, but entertaining, quote:
The Internet is largely dominated, and indeed driven, by surveillance, and pervasive monitoring is a feature of this network, not a bug. Indeed, perhaps the only debate left today is one over the respective merits and risks of surveillance undertaken by private actors and surveillance by state-sponsored actors. … We have come a very long way from this lofty moral stance on personal privacy into a somewhat tawdry and corrupted digital world, where “do no evil!” has become “don’t get caught!”
Before diving into a full-blown look at the many problems with DNS security, it is worth considering what kinds of information can leak through the DNS system. Let’s ignore the recent discovery that DNS queries can be used to exfiltrate data; instead, let’s look at more mundane data leakage from DNS queries.
For instance, say you work in a marketing department for a company that is just about to release a new product. In order to build the marketing and competitive materials your sales critters will need to stand in front of customers, you do a lot of research around competitor products. In the process, you examine, in detail, each of the competing product’s pages. Or perhaps you work in a company that is determining whether or another purchasing or merging with another company might be a good idea. Or you are working on a new externally facing application, or component in an existing application, that relies on a new connection point into your network.
All of these processes can lead to a lot of DNS queries. For someone who knows what they are looking for, the pattern of queries may be enough to examine strings queried from search engines and other information, ultimately leading to someone being able to guess a lot about that new product, what company your company is thinking about buying or merging with, what your new application is going to do, etc. DNS is a treasure trove of information at a personal and organizational level.
Operators and protocol designers have been working for years to resolve these problems, making DNS queries “more private;” Geoff Huston’s article provides a good overview of many of these attempts. DNS over HTTPS (DoH), a recent (and ongoing) attempt bears a closer look.
DNS is normally sent “in plain text” over the network; anyone who can capture the packets can read not only the query, but also the responses. The simplest way to solve this problem is to encrypt the DNS data in flight using something like TLS—hence DoT, or DNS over TLS. One problem with DoT is it is carried over a unique port number, which means it is probably blocked by default by most packet filters, and can easily be blocked by administrators who either do not know what this traffic is, or do not want it on their network. To solve this, DoH carries TLS encrypted traffic in a way that makes it look just like an HTTPS session. If you block DoH traffic, you will also block access to web servers running HTTPS. This is the logical “end” of carrying everything else over HTTPS to avoid the impact of stateful and stateless packet filters and the impact of middle boxes on Internet traffic.
The good result is, in fact, that DNS traffic can no longer be “spied on” by anyone outside servers in the DNS system itself. Whether or not this is “enough” privacy is a matter of conjecture, however. Servers within the DNS system can still collect information about what queries you are making; if the server has access to other information about you or your organization, combining this data into a profile, or using it to determine some deeper investigation is warranted by looking at other sources of data, is pretty simple. Ultimately, DoH is only really useful if you trust your DNS provider.
Do you? Perhaps more importantly—should you?
DNS providers are like any other business; they must buy hardware, connectivity, and the time of smart people who can make the system work, troubleshoot the system when it fails, and think about ways of improving the system. If the service is free…
DoH, however, has another problem Geoff outlines in his article—DNS is moved up the stack so it no longer runs over TCP and UDP directly, but rather it runs over HTTPS. This means local applications, like browsers, can run DNS queries independently of the operating system. In fact, because these queries are TLS encrypted, the operating system itself cannot even “see” the contents of these DNS queries. This might be a good thing—or might be a bad thing. If nothing else, it means the browser, or any other application, can choose to use a resolver not configured by the local operating system. A browser maker, for instance, can direct their browser to send all DNS queries made within the browser to their DNS server, exposing another source of information about users (and the organizations they work for).
Remember that time you typed an internal hostname incorrectly in your browser? Thankfully, you had a local DNS server configured, so the query did not go out to a resolver on the Internet. With DoH, the query can go out to an open resolver on the Internet regardless of how your local systems are configured. Something to ponder.
The bottom line is this—the nature of DNS makes it extremely difficult to secure. Somehow you have to have someone operate, and pay for, an open database of names which translate to addresses. Somehow you have to have a protocol that allows this database to be queried. All of these “somehows” expose information, and there is no clear way to hide that information. You can solve parts of the problem, but not the whole problem. Solving one part of the problem seems to make another part of the problem worse.
If you haven’t found the tradeoff, you haven’t looked hard enough.
In the end, though, the privacy of DNS queries at a personal and organizational level is something you need to think about.
sarcasm warning—take the following post with a large grain of salt
A thousand years from now, when someone is writing the history of computer networks, one thing they will notice—at least I think they will—is how we tend to reduce our language so as many terms as possible have precisely the same meaning. They might attribute this to marketing, or the hype cycle, or… but whatever the cause this is clearly a trend in the networking world. Some examples might be helpful, so … forthwith, the reduced terminology of the networking world.
Software Defined Networking (SDN): Used to mean a standardized set of interfaces that enabled open access to the forwarding hardware. Came to mean some form of control plane centralization. Now means automated configuration and management of network devices, centralized control planes, traffic engineering, and just about anything else that seems remotely related to these.
Fabric: Used to mean a regular, non-planar, repeating network topology with scale-out characteristics. Now means any vaguely hierarchical topology (not a ring) with a lot of links.
DevOps: Used to mean applying software development processes to the configuration, operation, and troubleshooting of server and network devices. Now means the same thing as SDN.
Clos: Used to mean a three stage fabric in which every device in a prior stage is connected to every device in the next stage, all devices have the same number of ports, all traffic is east/west, and having a scale-out characteristics. Now means the same thing as fabric, and is spelled CLOS because—aren’t all four letter words abbreviations? Now external links are commonly attached to the “core” of the Clos, because… well, it kindof looks hierarchical, after all.
Hierarchical Design: Used to mean a network design with a modular layered design, and specific functions tied to each layer of the network. Generally there were two or three layers, with clear failure domain separation through aggregation and summarization of control plane information. Now means the same thing as fabric.
Cloud: Used to mean the centralization and abstraction of resources to support agile development strategies. Now means… well… the meaning is cloudy at this time, but generally applied to just about anything. Will probably end up meaning the same thing as DevOps, SDN, and fabric.
Network Topology: Used to mean a description of the interconnection system used in building a network. Some kinds of topologies were hub-and-spoke, ring, partial mesh, Clos, Benes, butterfly, full mesh, etc. Now means the same as fabric.
Routing Protocol: Used to mean the protocol, including the semantics and algorithm or heuristic, used to calculate the set of loop-free paths through a network. Includes instances such as IS-IS, EIGRP, and OSPF. Now means BGP, as this is the only protocol used in any production network (except SDN).
Router: Used to mean a device that determines the next hop to which the packet should be forwarded using the layer 3 address, replacing the layer 2 header in the process of forwarding the packet. Now means the same thing as a switch.
Switch: Used to mean a device which determined which port through which a packet should be forwarded based on the layer 2 header, did not modify the packet, etc. Now means any device that forwards packets; has generally replaced “router.”
Security: Used to mean thinking through attack surfaces, understanding protocols and their operation, and how to build a system that is difficult to attack. Now means inserting a firewall into the network.
We used to have a rich set of terms we could use to describe different kinds of topologies, devices, and ways of building networks. We seem to want to insist on merging as many terms as possible so they all mean the same thing; we are quickly reducing ourselves to fabric, switch, SDN, and cloud to describe everything.
Which makes me wonder sometimes—what are they teaching in network engineering classes now-a-days?
The Internet, and networking protocols more broadly, were grounded in a few simple principles. For instance, there is the end-to-end principle, which argues the network should be a simple fat pipe that does not modify data in transit. Many of these principles have tradeoffs—if you haven’t found the tradeoffs, you haven’t looked hard enough—and not looking for them can result in massive failures at the network and protocol level.
Another principle networking is grounded in is the Robustness Principle, which states: “Be liberal in what you accept, and conservative in what you send.” In protocol design and implementation, this means you should accept the widest range of inputs possible without negative consequences. A recent draft, however, challenges the robustness principle—draft-iab-protocol-maintenance.
According to the authors, the basic premise of the robustness principle lies in the problem of updating older software for new features or fixes at the scale of an Internet sized network. The general idea is a protocol designer can set aside some “reserved bits,” using them in a later version of the protocol, and not worry about older implementations misinterpreting them—new meanings of old reserved bits will be silently ignored. In a world where even a very old operating system, such as Windows XP, is still widely used, and people complain endlessly about forced updates, it seems like the robustness principle is on solid ground in this regard.
The argument against this in the draft is implementing the robustness principle allows a protocol to degenerate over time. Older implementations are not removed from service because it still works, implementations are not updated in a timely manner, and the protocol tends to have an ever-increasing amount of “dead code” in the form of older expressions of data formats. Given an infinite amount of time, an infinity number of versions of any given protocol will be deployed. As a result, the protocol can and will break in an infinite number of ways.
The logic of the draft is something along the lines of: old ways of doing things should be removed from protocols which are actively maintained in order to unify and simplify the protocol. At least for actively maintained protocols, reliance on the robustness principle should be toned down a little.
Given the long list of examples in the draft, the authors make a good case.
There is another side to the argument, however. The robustness principle is not “just” about keeping older versions of software working “at scale.” All implementations, no matter how good their quality, have defects (or rather, unintended features). Many of these defects involve failing to release or initialize memory, failing to bounds check inputs, and other similar oversights. A common way to find these errors is to fuzz test code—throw lots of different inputs at it to see if it spits up an error or crash.
The robustness principle runs deeper than infinite versions—it also helps implementations deal with defects in “the other end” that generate bad data. The robustness principle, then, can help keep a network running even in the face of an implementation defect.
Where does this leave us? Abandoning the robustness principle is clearly not a good thing—while the network might end up being more correct, it might also end up simply not running. Ever. The Internet is an interlocking system of protocols, hardware, and software; the robustness principle is the lubricant that makes it all work at all.
Clearly, then, there must be some sort of compromise position that will work. Perhaps a two pronged attack might work. First, don’t discard errors silently. Instead, build logging into software that catches all errors, regardless of how trivial they might seem. This will generate a lot of data, but we need to be clear on the difference between instrumenting something and actually paying attention to what is instrumented. Instrumenting code so that “unknown input” can be caught and logged periodically is not a bad thing.
Second, perhaps protocols need some way to end of life older versions. Part of the problem with the robustness principle is it allows an infinite number of versions of a single protocol to exist in the same network. Perhaps the IETF and other standards organizations should rethink this, explicitly taking older ways of doing things out of specs on a periodic basis. A draft that says “you shouldn’t do this any longer,” or “this is no longer in accordance with the specification,” would not be a bad thing.
For the more “average” network engineer, this discussion around the robustness principle should lead to some important lessons, particularly as we move ever more deeply into an automated world. Be clear about versioning of APIs and components. Deprecate older processes when they should no longer be used.
Control your technical debt, or it will control you.
There is a rule in sports and music about practice—the 10,000 hour rule—which says that if you want to be an expert on something, you need ten thousand hours of intentional practice. The corollary to this rule is: if you want to be really good at something, specialize. In colloquial language, you cannot be both a jack of all trades and a master of one.
Translating this to the network engineering world, we might say something like: it takes 10,000 hours to really know the full range of products from vendor x and how to use them. Or perhaps: only after you have spent 10,000 hours of intentional study and practice in building data center networks will you know how to build these things. We might respond to this challenge by focusing our studies and time in one specific area, gaining one series of certifications, learning one vendor’s gear, or learning one specific kind of work (such as design or troubleshooting).
This line of thinking, however, should immediately raise two questions. First, is it true? Anecdotal evidence seems to abound for this kind of thinking; we have all heard of the child prodigy who spent their entire lives focusing on a single sport. We also all know of people who have “paper skills” instead of “real skills;” the reason we often attribute to this is they have not done enough lab work, or they have not put in hours configuring, troubleshooting, or working on the piece of gear in question. Second, is it healthy for the person or the organization the person works for?
To make matters worse, we often see this show p in the job hunting process. The manager wants someone who can “hit the ground running” on this project, using this piece of equipment, and they want them on board and working tomorrow. In response, we see job descriptions and recruiting drives for specific skill sets, down to individual hardware and software.
I recently ran across two articles that push back on this 10,000 hours10,000 rule way of learning does not work.
Over time, as I delved further into studies about learning and specialisation, I came across more and more evidence that it takes time to develop personal and professional range – and that there are benefits to doing so. I discovered research showing that highly credentialed experts can become so narrow-minded that they actually get worse with experience, even while becoming more confident (a dangerous combination). And I was stunned when cognitive psychologists I spoke with led me to an enormous and too-often ignored body of work demonstrating that learning itself is best done slowly to accumulate lasting knowledge, even when that means performing poorly on tests of immediate progress. That is, the most effective learning looks inefficient – it looks like falling behind.
Re-read that last sentence—what turns out to be the most effective learning strategy often looks just like falling behind. Another recent article pointed out that deep expertise seems to be losing its sway in many workplaces. The author spends time around a new United States Navy littoral ship, which are designed to operate with much smaller crews—one-half to one-third of a comparably sized ship staffed in the traditional way. How do these ships operate? By cross training crew members to be able to do many different tasks.
One of the interesting things this latter article points out is this ability to do many different tasks requires fluid intelligence, which is a completely different set of skills than crystallized intelligence. Fluid intelligence, it seems, becomes stronger over time, peaking much later in life. While the article does not discuss how to develop the kind of fluid intelligence that will serve you well later in life, when this kind of thinking overtakes your narrower skill sets, it makes sense that building a broader set of skills over time is a more likely path than following the 10,000 hour rule.
There is, however, one question that neither author spends a lot of time discussing: if you are not focusing on learning one thing, then how, and on what, should you focus your time spent learning on? For the top athletes in the sports article, it seems like they spent a lot of time in different kinds of physical activity. There was an area of focus, but it was not the kind of narrow focus we normally associate with being excellent at one sport. In the same way, the sailors in the second article were all focused in a broader area—anything required to run a ship. Again, there is focus, but not the kind of narrow focus you might have expect on more standard boats, where one set of sailors just focus on working the lines, while another just focus on navigating, etc. The focus is still there, then—it is just a broader focus.
Why and how does this work? My guess is it works because the skills you learn in dancing, for instance, will help you learn better footwork in boxing and other sports (an example given in the sports article linked above). The skill you learn in handling the lines will help you understand the lay and movement of the boat in ways that are helpful in navigation. These skills, in other words, are somewhat adjacent.
But these skills are more than adjacent. Many of them are also basic, or theoretical, in ways we do not value in the network engineering world. The point I often hear made is: I don’t care about how BGP really works, so long as I can write a script that configures it, and I can troubleshoot it when it breaks. Or: I actually work on vendor x model 1234 all day, what I really need to know to be effective is how to configure it… when I need to replace that piece of gear, I will learn the next one so I can keep doing my job.
My point is this: this way of building skills, this way of working, does not “work” in the long term. There will come a point in your life, and in the life of your company, when point skills will weaken and lose their importance. The research, and experience, shows the better way to learn is to take on the long game, to learn the theory, and to practice the theory in many different settings, rather than focusing too deeply on one thing.
Floating point is not something many network engineers think about. In fact, when I first started digging into routing protocol implementations in the mid-1990’s, I discovered one of the tricks you needed to remember when trying to replicate the router’s metric calculation was always round down. When EIGRP was first written, like most of the rest of Cisco’s IOS, was written for processors that did not perform floating point operations. The silicon and processing time costs were just too high.
What brings all this to mind is a recent article on the problems with floating point performance over at The Next Platform by Michael Feldman. According to the article:
While most programmers use floating point indiscriminately anytime they want to do math with real numbers, because of certain limitations in how these numbers are represented, performance and accuracy often leave something to be desired.
For those who have not spent a lot of time in the coding world, a floating point number is one that has some number of digits after the decimal. While integers are fairly easy to represent and calculate over in the binary processors use, floating point numbers are much more difficult, because floating point numbers are very difficult to represent in binary. The number of bits you have available to represent the number makes a very large difference in accuracy. For instance, if you try to store the number
101.1 in a
float, you will find the number stored is actually
101.099998 To store
101.1, you need a
double, which is twice as long as a
Okay—this is all might be fascinating, but who cares? Scientists, mathematicians, and … network engineers do, as a matter of fact. Fist, carrying around
double floats to store numbers with higher precision means a lot more network traffic. Second, when you start looking at timestamps and large amounts of telemetry data, the efficiency and accuracy of number storage becomes a rather big deal.
Okay, so the current floating point storage format, called IEEE754, is inaccurate and rather inefficient. What should be done about this? According to the article, John Gustafson, a computer scientist, has been pushing for the adoption of a replacement called posits. Quoting the article once again:
It does this by using a denser representation of real numbers. So instead of the fixed-sized exponent and fixed-sized fraction used in IEEE floating point numbers, posits encode the exponent with a variable number of bits (a combination of regime bits and the exponent bits), such that fewer of them are needed, in most cases. That leaves more bits for the fraction component, thus more precision.
Did you catch why this is more efficient? Because it uses a variable length field. In other words, posits replaces a fixed field structure (like what was originally used in OSPFv2) with a variable length field (like what is used in IS-IS). While you must eat some space in the format to carry the length, the amount of "unused space" in current formats overwhelms the space wasted, resulting in an improvement in accuracy. Further, many numbers that require a
double today can be carried in the size of a
float. Not only does using a TLV format increase accuracy, it also increases efficiency.
From the perspective of the State/Optimization/Surface (SOS) tradeoff, there should be some increase in complexity somewhere in the overall system—if you have not found the tradeoffs, you have not looked hard enough. Indeed, what we find is there is an increase in the amount of state being carried in the data channel itself; there is additional state, and additional code that knows how to deal with this new way of representing numbers.
It's always interesting to find situations in other information technology fields where discussions parallel to discussions in the networking world are taking place. Many times, you can see people encountering the same design tradeoffs we see in network engineering and protocol design.
Over at the Communications of the ACM, Micah Beck has an article up about the hourglass model. While the math is quite interesting, I want to focus on transferring the observations from the realm of protocol and software systems development to network design. Specifically, start with the concept and terminology, which is very useful. Taking a typical design, such as this—
The first key point made in the paper is this—
The thin waist of the hourglass is a narrow straw through which applications can draw upon the resources that are available in the less restricted lower layers of the stack.
A somewhat obvious point to be made here is applications can only use services available in the spanning layer, and the spanning layer can only build those services out of the capabilities of the supporting layers. If fewer applications need to be supported, or the applications deployed do not require a lot of “fancy services,” a weaker spanning layer can be deployed. Based on this, the paper observes—
The balance between more applications and more supports is achieved by first choosing the set of necessary applications N and then seeking a spanning layer sufficient for N that is as weak as possible. This scenario makes the choice of necessary applications N the most directly consequential element in the process of defining a spanning layer that meets the goals of the hourglass model.
Beck calls the weakest possible spanning layer to support a given set of applications the minimally sufficient spanning layer (MSSL). There is one thing that seems off about this definition, however—the correlation between the number of applications supported and the strength of the spanning layer. There are many cases where a network supports thousands of applications, and yet the network itself is quite simple. There are many other cases where a network supports just a few applications, and yet the network is very complex. It is not the number of applications that matter, it is the set of services the applications demand from the spanning layer.
Based on this, we can change the definition slightly: an MSSL is the weakest spanning layer that can provide the set of services required by the applications it supports. This might seem intuitive or obvious, but it is often useful to work these kinds of intuitive things out, so they can be expressed more precisely when needed.
First lesson: the primary driver in network complexity is application requirements. To make the network simpler, you must reduce the requirements applications place on the network.
There are, however, several counter-intuitive cases here. For instance, TCP is designed to emulate (or abstract) a circuit between two hosts—it creates what appears to be a flow controlled, error free channel with no drops on top of IP, which has no flow control, and drops packets. In this case, the spanning layer (IP), or the wasp waist, does not support the services the upper layer (the application) requires.
In order to make this work, TCP must add a lot of complexity that would normally be handled by one of the supporting layers—in fact, TCP might, in some cases, recreate capabilities available in one of the supporting layers, but hidden by the spanning layer. There are, as you might have guessed, tradeoffs in this neighborhood. Not only are the mechanisms TCP must use more complex that the ones some supporting layer might have used, TCP represents a leaky abstraction—the underlying connectionless service cannot be completely hidden.
Take another instance more directly related to network design. Suppose you aggregate routing information at every point where you possibly can. Or perhaps you are using BGP route reflectors to manage configuration complexity and route counts. In most cases, this will mean information is flowing through the network suboptimally. You can re-optimize the network, but not without introducing a lot of complexity. Further, you will probably always have some form of leaky abstraction to deal with when abstracting information out of the network.
Second lesson: be careful when stripping information out of the spanning layer in order to simplify the network. There will be tradeoffs, and sometimes you end up with more complexity than what you started with.
A second counter-intuitive case is that of adding complexity to the supporting layers in order to ultimately simplify the spanning layer. It seems, on the model presented in the paper, that adding more services to the spanning layer will always end up adding more complexity to the entire system. MPLS and Segment Routing (SR), however, show this is not always true. If you need traffic steering, for instance, it is easier to implement MPLS or SR in the support layer rather than trying to emulate their services at the application level.
Third lesson: sometimes adding complexity in a lower layer can simplify the entire system—although this might seem to be counter-intuitive from just examining the model.
The bottom line: complexity is driven by applications (top down), but understanding the full stack, and where interactions take place, can open up opportunities for simplifying the overall system. The key is thinking through all parts of the system carefully, using effective mental models to understand how they interact (interaction surfaces), and the considering the optimization tradeoffs you will make by shifting state to different places.