One of the great fears of server virtualization is the concern around copying information from one virtual machine, or one container, to another, through some cover channel across the single processor. This kind of channel would allow an attacker who roots, or otherwise is able to install software, on one of the two virtual machines, to exfiltrate data to another virtual machine running on the same processor. There have been some successful attacks in this area in recent years, most notably meltdown and spectre. These defects have been patched by cloud providers, at some cost to performance, but new vulnerabilities are bound to be found over time. The paper I’m looking at this week explains a new attack of this form. In this case, the researchers use the processor’s cache to transmit data between two virtual machines running on the same physical core.
The processor cache is always very small for several reasons. First, the processor cache is connected to a special bus, which normally has limits in the amount of memory it can address. This special bus avoids reading data through the normal system bus, and this is (from a networking perspective) at least one hop, and often several hops, closer to the processor on the internal bus (which is essentially an internal network). Second, the memory used in the cache is normally faster than the main memory.
The question is: since caches are at the processor level, so multiple virtual processes share the same cache, is it possible for one process to place information in the cache that another process can read? Since the cache is small and fast, it is used to store information that is accessed frequently. As processes, daemons, threads, and pthreads enter and exit, they access difference parts of main memory, causing the contents of the cache to change rapidly. Because of this constant churn, many researchers have assumed you cannot build a covert channel through the cache in this way. In fact, there have been attempts in the past; each of these has failed.
The authors of this paper argue, however, these failures are not because building a covert channel through the cache is not possible, but rather because previous attempts at doing so have operated on bad assumptions, attempting to use standard error correction mechanisms.
The first problem with using standard error correction mechanisms is that entire sections of data can be lost due to a cache entry being deleted. Assume you have two processes running on a single processor; you would like to build a covert channel between these processes. You write some code that inserts information into the cache, ensuring it is written in a particular memory location. This is the “blind drop.” The second process now runs and attempts to read this information. Normally this would work, but between the first and second process running, the information in the cache has been overwritten by some third process you do not know about. Because the entire data block is gone, the second process, which is trying to gather the information from the blind drop location, cannot tell there was ever any information at the drop point at all. There is no data across which the error correction code can run, because the data has been completely overwritten.
A possible solution to this problem is to use something like a TCP window of one; the transmitter resends the information until it receives an acknowledgement of receipt. The problem with this solution, in the case of a cache, is that the sender and receiver have no way to synchronize their clocks. Hence there is no way to form any sense of a serialized channel between the two processes.
To overcome these problems, the researchers use techniques used in wireless networks to ensure reliable delivery over unreliable channels. For instance, they send each symbol (a 0 or a 1) multiple times, using different blind drops (or cache locations), such that the receiver can compare these multiple transmit instances, and decide what the sender intended. The broader the number of blind drops used, the more likely information is to be carried across the process divide through the cache, as there are very few instances where all the cache entries representing blind drops will be invalidated and replaced at once. The researchers increase the rate at which this newly opened covert channel can operate by reverse engineering some aspects of a particular model processor’s caching algorithm. This allows them to guess which lines of cache will be invalidated first, how the cache sets are arranged, etc., and hence to place the blind drops more effectively.
By taking these steps, and using some strong error correction coding, a 42K covert channel was created between two instances running in an Amazon EC2 instance. This might not sound like a like, but it is higher speed than some of the fastest modems in use before DSL and other subscriber lines were widely available, and certainly fast enough to transfer a text-based file of passwords between two processes.
There will probably be some counter to this vulnerability in the future, but for now the main protection against this kind of attack is to prevent unknown or injected code from running on your virtual machines.
Disaggregation, in the form of splitting network hardware from network software, is often touted as a way to save money (as if network engineering were primarily about saving money, rather than adding value—but this is a different soap box). The primary connections between disaggregation and saving money are the ability to deploy white boxes, and the ability to centralize the control plane to simplify the network (think software defined networks here—again, whether or not both of these are true as advertised is a different discussion).
But drivers that focus on cost miss more than half the picture. A better way to drive the value of disaggregation, and the larger value of networks within the larger network technology sphere, is through increased value. What drives value in network engineering? It’s often simplest to return to Tannenbaum’s example of the station wagon full of VHS backup tapes. To bring the example into more modern terms, it is difficult to beat the bandwidth of an overnight box full of USB thumb drives in terms of pure bandwidth.
In this view, networks can primarily be seen as a sop to human impatience. They are a way to get things done more quickly. In the case of networks quantity—speed—often becomes a form of quality—increased value.
But what does disaggregation have to do with speed? The connection is the open API.
When you disaggregate a network device into hardware and software, you necessarily create a stable, openly accessible API between the software and the hardware. Routing protocols, and other control plane elements must be able to build a routing table that is somehow then passed on to the forwarding hardware, so packets can be forwarded through the network. A fortuitous side effect of this kind of open API is that anyone can use it to control the forwarding software.
Enter the new release of ScyllaDB. According to folks who test these things, and should know, ScyllaDB is much faster than Cassandra, another industry leading open source database system. How much faster? Five to ten times faster. A five- to ten-fold improvement in database performance is, potentially, a point of quantity that can easily have a different quality. How much faster could your business handle orders, or customer service calls, or many other things, if you could speed the database end of the problem up even five-fold? How many new ways of processing information to gain insight from that data about business operations, customers, etc.
How does Scylla provide these kinds of improvements over Cassandra? In the first place, the newer database system is written in a faster language, C++ rather than Java. Scylla also shards processing across processor cores more efficiently. It doesn’t rely on the page cache.
None of this has to do with network disaggregation—but there is one way the Scylla developers improved the performance of their database that does relate to network disaggregation: ScyllaDB writes directly to the network interface card using DPDK. The interesting point, from a network engineering point of view, is that simply would not be possible without disaggregation between hardware and software opening up DPDK as an interface for a database to directly push packets to the hardware.
The side effects of disaggregation are only beginning to be felt in the network engineering world; the ultimate effects could reshape the way we think about application performance on the network, and the entire realm of network engineering.
While the network engineering world tends to use the word resilience to describe a system that will support rapid change in the real world, another word often used in computer science is robustness. What makes a system robust or resilient? If you ask a network engineer this question, the most likely answer you will get is something like there is no single point of failure. This common answer, however, does not go “far enough” in describing resilience. For instance, it is at least sometimes the case that adding more redundancy into a network can actually harm MTTR. A simple example: adding more links in parallel can cause the control plane to converge more slowly; at some point, the time to converge can be reduced enough to offset the higher path availability.
In other cases, automating the response to a change in the network can harm MTTR. For instance, we often nail a static route up and redistribute that, rather than redistributing live routing information between protocols. Experience shows that sometimes not reacting automatically is better than reacting automatically.
This post will look at a paper that examines robustness more deeply, Robustness in Complexity Systems,” by Steven Gribble. While this is an older paper—it was written in 2000—it remains a worthwhile read for the lessons in distributed system design. The paper is based on the deployment of a cluster based Distributed Data Structure (DDS). A more convenient way for readers to think of this is as a distributed database. Several problems discovered when building and deploying this DDS are considered, including—
- A problem with garbage collection, which involved timeouts. The system was designed to allocate memory as needed to form and synchronize records. After the record had been synchronized, any unneeded memory would be released, but would not be reallocated immediately by some other process. Rather, a garbage collection routine would coallesce memory into larger blocks where possible, rearranging items and placing memory back into available pools. This process depends on a timer. What the developers discovered is their initial “guess” at a a good timer was ultimately an order of a magnitude too small, causing some of the nodes to “fall behind” other nodes in their synchronization. Once a node fell behind, the other nodes in the system were required to “take up the slack,” causing them to fail at some point in the future. This kind of cascading failure, triggered by a simple timer setting, is common in a distributed system.
- A problem with a leaky abstraction from TCP into the transport. The system was designed to attempt to connect on TCP, and used fairly standard timeouts for building TCP connections. However, a firewall in the network was set to disallow inbound TCP sessions. Another process connecting on TCP was passing through this firewall, causing the TCP session connection to fail, and, in turn, causing the TCP stack on the nodes to block for 15 minutes. This interaction of different components caused nodes to fall out of availability for long periods of time.
Gribble draws several lessons from these, and other, outages in the system.
First, he states that for a system to be truly robust, it must use some form of admission control. The load on the system, in other words, must somehow be controlled so more work cannot be given than the system can support. This has been a contentious issue in network engineering. While circuit switched networks can control the amount of work offered to the network (hence a Clos can be non-blocking in a circuit switched network), admission control in a packet switched network is almost impossible. The best you can do is some form of Quality of Service marking and dropping, such as traffic shaping or traffic policing, along the edge. This does highlight the importance of such controls, however.
Second, he states that systems must be systematically overprovisioned. This comes back to the point about redundant links. The caution above, however, still applies; systematic overprovisioning needs to be balanced against other tools to build a robust system. Far too often, overprovisioning is treated as “the only tool in the toolbox.”
Third, he states introspection must be built into the system. The system must be designed to be monitorable from its inception. In network engineering, this kind of thinking is far too often taken to say “everything must be measureable.” This does not go far enough. Network engineers need to think about not only how to measure, but also what they expect normal to look like, and how to tell when “normal” is no longer “normal.” The system must be designed within limits. Far too often, we just build “as large as possible,” and run it to see what happens.
Fourth, Gribbles says adaptivitiy must be provided through a closed control loop. This is what we see in routing protocols, in the sense that the control plane reacts to topology changes in a specific way, or rather within a specific state machine. Learning this part of the network is a crucial, but often skimmed over, part of network engineering.
This is an excellent paper, well worth reading for those who are interested in classic work around the area of robustness and distributed systems.
Is the seven-layer OSI model really all that useful any longer? Before you answer, it’s worth listening to my latest short take over at the Network Collective.
Many network engineers find the entire world of telecom to be confusing—especially as papers are peppered with a lot of acronyms. If any part of the networking world is more obsessed with acronyms than any other, the telecom world, where the traditional phone line, subscriber access, and network engineering collide, reigns as the “king of the hill.”
Recently, while looking at some documentation for the CORD project, which stands for Central Office Rearchitected as a Data Center, I ran across an acronym I had not seen before—vOLT-HA. An acronym with a dash in the middle—impressive! But what is, exactly? To get there, we must begin in the beginning, with a PON.
There are two kinds of optical networks in the world, Active Optical Networks (AONs), and Passive Optical Networks (PONs). The primary difference between the two is whether the optical gear used to build the network amplifies (or even electronically rebuilds, or repeats) the optical signal as it passes through. In AONs, optical signals are amplified, while ins PONs, optical signals are not amplified. This means that in a PON, the optical equipment can be said to be passive, in that it does not modify the optical signal in any way. Why is this important? Because passive equipment is less complex, and does not require as much power to operate, so a PON is much less expensive to build and maintain than an AON. Hence a PON is often more economically realistic when serving a large number of customers, such as in providing service to residential or small office customers.
A PON uses optical splitters to divide out the signal among the various connected customers. Like any other shared bandwidth medium, every customer receives all the data on the downstream side, switching only traffic destined for the local network onto the copper (usually Ethernet) network beyond the optical termination point (called an OLT, or Optical Line Terminal). In a PON, the upstream signal is divided up into timeslots, so the system uses Time Division Multiplexing (TDM) to provide (a much slower) path from the end device into the provider’s network. As signals from each edn device reach the splitters in the network, the path is reversed, and the splitter ends up becoming a power combiner, which means the signal can “gain power” on the way up towards the central office (CO). These kinds of systems are typically sold as Fiber to the Home, which is abbreviated FTTH (of course!).
Is your head dizzy yet? I hope not, because we are just getting started with the acronyms. 🙂
The Optical Line Terminal, or OLT, must reside in some piece of physical hardware, called an Optical Network Unit (ONU). The OLT, like a server, or an Ethernet port on a router or switch, can be virtualized, so multiple logical OLTs reside on a single physical hardware interface. Just like a VRF or VLAN, this allows a single physical interface to be used for multiple logical connections. In the case, the resulting logical interface is called a vOLT, or a virtual Optical Line Terminal.
Now we are finally getting to the answer to the original question. vOLT must somehow relate to virtualizing the OLT, but how? The answer lies in the idea of disaggregation in passive optical networks (remember, this is a PON). One of the key components of disaggregation is being able to run any software—especially open source software—on any hardware—so-called “white box” hardware in particular. To get to this point, you must have some sort of “open Application Programming Interface,” or API, to connect the software to the hardware. You might think the HA in vOLT-HA stands for “high availability, but then you’d be wrong. 🙂 It actually stands for Hardware Abstraction.
So vOLT-HA, sometimes spelled VOLTHA, is actually a hardware abstraction layer that allows the disaggregation of vOLTs in an ONU in a PON.
Configuring a static route is just like installing an entry directly in the routing table (or the RIB).
I have been told this many times in my work as a network engineer by operations people, coders, designers, and many other folks. The problem is that it is, in some routing table implementations, too true. To understand, it is best to take a short tour through how a typical RIB interacts with a routing protocol. Assume BGP, or IS-IS, learns about a new route that needs to be installed in the RIB:
- The RIB into which the route needs to be installed is somehow determined. This might be through some sort of special tagging, or perhaps each routing process has a separate RIB into which it is installing routes, etc.. In any case, the routing process must determine which RIB the route should be installed in.
- Look the next hop up in the RIB, to determine if it is reachable. A route cannot be installed if there is no next hop through which to forward the traffic towards the described destination.
- Call the RIB interface to install the route.
The last step results in one of two possible reactions. The first is that the local RIB code compares any existing route to the new route, using the administrative distance and other factors (internal IS-IS routes should be preferred over external routes, eBGP routes should be preferred over iBGP routes, etc.) to decide which route should “win.” This process can be quite complex, however, as the rules are different for each protocol, and can change between protocols. In order to prevent long switch statements that need to be maintained in parallel with the routing protocol code, many RIB implementations use a set of call back functions to determine whether the existing route, or the new route, should be preferred.
In this diagram—
- The IS-IS process receives an LSP, calculates a new route based on the information (using SPF), and installs the route into the routing table.
- The RIB calls back to the owner of the current route, BGP, handing the new route to BGP for comparison with the route currently installed in the RIB.
- BGP finds the local copy of the route (rather than the version installed in the RIB) based on the supplied information, and determines the IS-IS route should win over the current BGP route. It sends this information to the RIB.
Using a callback system of this kind allows the “losing” routing protocol to determine if the new route should replace the current route. This might seem to be slower, but the reduced complexity in the RIB code is most often worth the tradeoff in time.
The static route is, in some implementations, and exception to this kind of processing. For instance, in old Cisco IOS code, the static route code was part of the RIB code. When you configured a static route, the code just created a new routing table entry and a few global variables to keep track of the manual configuration. FR Routing’s implementation of the static route is like this today; you can take a look at the
zebra_static.c file in the FR Routing code base to see how static routes are implemented
However, there is current work being done to separate the static route from the RIB code; to create a completely different static process, so static routes are processed in the same way as a route learned from any other process.
Even though many implementations manage static routes as part of the RIB, however, you should still think of the static route as being like any other route, installed by any other process. Thinking about the static route as a “special case” causes many people to become confused about how routing really works. For instance—
Routing is really just another kind of configuration. After all, I can configure a route directly in the routing table with a static route.
I’ve also heard some folks say something like—
Software defined networks are just like DevOps. Configuring static routes through a script is just like using a southbound interface like I2RS or OpenFlow to install routing information.
The confusion here stems from the idea that static routes directly manipulate the RIB. This is an artifact of the way the code is structured, however, rather than a fact. The processing for static routes are contained in the RIB code, but that just means the code for determining which route out of a pair wins, etc., is all part of the RIB code itself, rather than residing in a separate process. The source of the routing information is different—a human configuring the device—but the actuall processing is no different, even if that processing is mixed into the RIB code.
What about a controller that screen scrapes the CLI for link status to discover reachable destinations, calculates a set of best paths through the network based on this information, and then uses static routes to build a routing table in each device? This can be argued to be a “form” of SDN, but the key element is the centralized calculation of loop free paths, rather than the static routes.
The humble static route has caused a lot of confusion for network engineers, but clearly separating the function from the implementation can not only help understand how and why static routes work, but the different components of the routing system, and what differentiates SDNs from distributed control planes.
On this episode of the history of networking, we talk to Tony Li about the origin and history of the Cisco Silicon Switching Engine.
In this short take, recently posted over at the Network Collective, I discuss what a side channel attack is, and why they are important.
On a recent history of networking episode, Alia talked a little about Maximally Redundant Trees (MRTs), and the concept of Depth First Search (DFS) numbering, along with the idea of a low point. While low points are quickly explained in my new book in the context of MRTs, I thought it worthwhile to revisit the concept in a blog post. Take a look at the following network:
On the left side is a small network with the nodes (think of these as routers) being labeled from A through G. On the right side is the same network, only each node has been numbered by traversing the graph, starting at A. This process, in a network, would either require some device which knows about every node and edge (link) in the network, or it would require a distributed algorithm that “walks” the network from one node to another, numbering each node as it is touched, and skipping any node that has already been visited (again, for more details on this, please see the book).
Once this numbering has been done, the numbers now produce this interesting property: if you remove the parent of any node, and the node can still reach a number lower than its own number, the network is two-connected. Take E, numbered as 5, as an example. E’s parent is D, labeled as 3 on the numbered side of the illustration. If you remove D from the network, what is the lowest numbered node E can reach? Start by jumping to the lowest numbered neighbor. In this case, E only has one neighbor remaining, C, which is numbered 6. From here, what is the lowest numbered neighbor of C? It is A, with a number of 1.
D, then, can reach a node which is numbered 1 through some other neighbor than its parent. This means D has some other path to the parent than through its parent, which means D is part of a topology with at least two connections to some other node in the network—it is two connected.
Using this sort of calculation, you can find alternate paths in a network. The problem with using DFS numbering for this is what was stated above—the calculation requires either a “walk through the network” protocol, or it requires some device with a complete view of the network (an LSDB, in link state terms). Neither of these are really conducive to real time calculation during a topology change. MRT solves this by using low points from DFS numbering with Dijkstra’s SPF algorithm to allow the calculation of disjoint paths in near real time in a distributed control plane.
My first short take at The Network Collective is up discussing the Broadcom SDKLT announcement. Does this really mean the end of vendors or network engineering? You can guess my answer, or you can watch the video and hear it for yourself.
Considering the DNS query chain—
- A host queries a local recursive server to find out about
- The server queries the root server, then recursively the authoritative server, looking for this domain name
banana.exampledoes not exist
There are two possible responses in this chain of queries, actually.
.example might not exist at all. In this case, the root server will return a
server not found error. On the other hand,
.example might exist, but
banana.example might not exist; in this case, the authoritative server is going to return an
NXDOMAIN record indicating the subdomain does not exist.
Assume another hosts, a few moments later, also queries for
banana.example. Should the recursive server request the same information all over again for this second query? It will unless it caches the failure of the first query—this is the negative cache. This negative cache reduces load on the overall system, but it can also be considered a bug.
Take, for instance, the case where you set up a new server, assign it banana.example, jump to a host and try to connect to the new server before the new DNS information has been propagated through the system. On the first query, the local recursive server will cache the nonexistence of banana.example, and you will need to wait until this negative cache entry times out before you can reach the newly configured server. If the time required to propagate the new DNS information is two seconds, you query after one second, and the negative cache is sixty seconds, the negative cache will cost you fifty-eight seconds of your time.
How long will a recursive server keep a negative cache entry? The answer depends on the kind of response it received in its initial attempt to resolve the name. If
server not found is the response, then negative cache timeout is locally configured. If an
NXDOMAIN record is returned, the negative cache is set to timeout based on the timeout found in the SOA.
So, first point about negative caching in DNS: if you are dealing with a local DNS server for internal lookups on a data center fabric or campus network, it might improve the performance of applications and the network in general to turn off negative caching for the local domains. DNS turnaround times can be a major performance bottleneck in application performance. In turning off negative caching for local resources, you are trading processing power on your DNS server against reduced turnaround times, particularly when a new server or service is brought up.
The way a negative cache is built, however, seems to allow for a measure of inefficiency. Assume three subdomains exist as part of
A hosts queries for
banana.example, and the recursive server, on receiving an
NXDOMAIN response that this subdomain does not exist, build a negative cache with ,code>banana.example. A few moments later, some other host (or the same host) queries for
cantaloupe.example. Once again, the recursive server discovers this subdomain does not exist, and builds a negative cache entry. If the point of the negative cache is to reduce the workload on the DNS system, it does not seem to be doing its job. A given host, in fact, could use a good deal of processing power by requesting one domain after another, forcing the recursive server to discover whether or not the subdomain exists.
RFC8198 proposes a way to resolve this problem by including more information in the response to the recursive server. Specifically, given DNSSEC signed zones (to ensure no-one is poisoning the cache to force the building of a large negative cache in the recursive server), an answering DNS server can provide a list of the two domain names on either side of the missing queried domain name.
In this case, a host queries for
banana.example, and the server responds with a the pair of subdomains surrounding the request subdomain—
orange.example. Now when the recursive server receives a request for
cantaloupe.example, it can look into its negative cache and immediately see there is no such domain in the place where it should exist. The recursive server can now respond with a “no server found,” without sending queries to any other upstream server.
This aggressive form of negative caching can reduce the workload of upstream servers, and close an attack surface that might be used for denial of service attacks.
Broadcom, to much fanfare, has announced a new open source API that can be used to program and manage their Tomahawk set of chips. As a general refresher, the Tomahawk chip series is the small buffer, moderate forwarding table size hardware network switching platform on which a wide array of 1RU (and some chassis) routers (often called switches, but this is just a bad habit of the networking world) used in large scale data centers. In fact, I cannot think of a single large scale data center operating today that does not somehow involve some version of the Tomahawk chip set.
What does this all mean? While I will probably end up running a number of posts on SDKLT over time, I want to start with just some general observations about the meaning of this move on the part of Broadcom for the overall network engineering world.
This is a strong validation of a bifurcation in the market between disaggregation and hyperconvergence in the networking world. Back when the CCDE was designed and developed, there was a strong sense among the folks working on the certification that design and operations were splitting. This trend is still ongoing, probably ultimately resulting in two related, but different disciplines. In more recent years, there is a clear trend towards the end of the appliance driven networking model. The market is moving away from the appliance model (buy a box, rack it, turn it on, configure it, and —maybe— automate it), to a model where you either buy an entire vertically integrated system, or you buy various software and hardware bits, then put everything together.
Will this happen overnight? No. Will there be some network operators who try to linger in the old model for as long as possible? Yes. Is the old model a real comfort zone for many folks who work in the networking world? A nice, warm blanket in which to wrap yourself (oh, I don’t know IS-IS, but I know all the latest gear from my favorite big vendor)? Yes.
But this world is done. Kaput. Get used to it.
There will still be a place for custom hardware and appliances, but that place will shrink until it holds steady. Appliances will not die, any more than the PC is going to die any time soon (contrary to all expectations for the last, oh, twenty years or so). The real result might be a much larger overall market, but a market where appliance driven networking will play a significant, but not dominating, role.
This is not the end of vendors. I have seen some commentary in the community bewailing or celebrating the end of the big networking gear vendor. Pft—ain’t gonna happen. What this does do, however, is push big vendors into positioning themselves more strongly as software companies. This falls into the general lines of the disaggregated/hyperconverged coloring book just above. While software first is not going to work in every corner of the networking world, it will work in enough that the ongoing disaggregation between software and hardware is going to change the shape of how vendors must do business.
This is not the end of network engineering, nor network engineers. In 1986’ish, I started working on a new fiber backbone for a USAF base. We were looking at Cabletron boxes for the distribution frames (or core touch points), and either Banyan Vines or Novell Netware for the network operating system. Either one, whichever we chose, combined with the Cabletron’s ability to accept line cards from a wide array of vendors, was going to end the networking wars. We would be able to build any network we wanted, any time, without any real configuration. The end of network engineering was nigh.
Fast forward just shy of ten’ish years, just before I started at Cisco. I was working in the advanced technology group of a large “enterprise shop.” We were going to deploy this thing called IP “like turning on a water faucet.” Then we were going to build an internal ‘web on top of it, and turn off Novell Netware. Then we would have the network to end all networks, and we wouldn’t need heavy duty network engineers any longer. An administrator could run it all.
Is all of this starting to sound familiar? You can laugh because I’m old. That’s okay, because I also have some perspective. 🙂 My experience is this—network engineering has become more complex, not less. I’m constantly tilting at the windmill of network complexity because I have seen this increasing complexity eat engineers in real time.
No, Broadcom opening their chip API is not going to end network engineering. It might end it as we know it, but that has already happened so many times in the last twenty-odd years that this event should almost pass without remark. Is this healthy for humans? I am bothered by this question (hence my taking on a PhD in philosophy). But the question does not fall within the domain of this blog, so I will spare you…
Overall, then… Some interesting changes ahead—changes you should have anticipated. Changes which are going to require all of us, including me, to be a little more serious about acquiring new skills. Some things will remain the same, as well. But the biggest change is going to be on the vendor front, where software is the “new normal” in much of the network.
In simple terms Meltdown and Spectre are simple vulnerabilities to understand. Imagine a gang of thieves waiting for a stage coach carrying a month’s worth of payroll.
There are two roads the coach could take, and a fork, or a branch, where the driver decides which one to take. The driver could take either one. What is the solution? Station robbers along both sides of the branch, and wait to see which one the driver chooses. When you know, pull the resources from one branch to the other, so you can effectively rob the stage. This is much the same as a modern processor handling a branch—the user could have put anything into some field, or retrieved anything from a database, that might cause the software to run one of two sets of instructions. There is no way for the processor to know, so it runs both of them.
To run both sets of instructions, the processor will pull in the contents of specific memory locations, and begin executing code across these memory locations. Some of these memory locations might not be pieces of memory the currently running software is supposed to be able to access, but this is not checked until the branch is chosen. Hence a piece of software can force the processor to load memory it should not have access to by calling the right instructions in a speculative branch, exposing those bits of memory to be read by the software.
But my point here is not to consider the problem itself. What is more interesting is the thinking that leads to this kind of software defect being placed into the code. There are, in all designs, tradeoffs. For instance, in the real (physical) world, there is the tradeoff between fast, cheap, and quality. In the database world, there is the tradeoff among consistency, accessability, and partitionability. I have, for many years, maintained that in network design there is a tradeoff between state, optimization, and surfaces.
What meltdown and spectre respresent is the unintended consequence of a strong drive towards enhancing performance. It’s not that the engineers who designed speculative execution, and put it into silicon, are dumb. In fact, they are brilliant engineers who have helped drive the art of computing ever faster forward in ways probably unimaginable even twenty years ago. There are known tradeoffs when using speculative execution, such as:
- Power—some code is going to be run, and the contents of some memory fetched, that will not be used. Fetching these memory locations, and running this code, is not free; there is some amount of power used, and heat generated, in speculative execution. This was actually a point of discussion early in the life of speculative execution, but the performance gains were so solid that the power and heat concerns were eventually set aside.
- Real Estate—speculative execution requires physical real estate in the processor. It makes processors larger, and uses silicon gates that could be used for something else. Overall, the most performance enhancing use of the available real estate was shown to be the most economically useful, and thus speculative execution became an important part of chip design.
- State—speculative execution drives the amount of state, and the speed at which that state is changing, much higher than it would otherwise be. Again, the performance gains were strong enough to make the added state worth the effort.
There was one more tradeoff, we now know, that was not considered during the initial days and years when speculative execution was being discussed—security.
So maybe it is time to take stock, and think about lessons learned. First, it is always the unexpected consequence that will come back to bite you in the end. Second, there is almost always an unexpected consequence. The value of experience is in being bitten by unexpected consequences enough times to learn to know what to look for in the future.
Well, in theory, anyway.
Finally, if you haven’t found the tradeoffs, you haven’t looked hard enough. Any time you think you have come up with a way to do things that will outperform any other way, you need to find all the tradeoffs. Don’t just find one tradeoff, and say, “see, I have that covered.”
A single minded focus on performance, at the cost of all else, will normally cost you more than you think, in the end. Overoptimization can sometimes cause meltdowns. And spectres.
It’s a lesson well worth learning.
Network Engineering and coding, like many other things in the information technology world, share overlapping concepts—even if we don’t often recognize the overlap because we are too busy making up new names to describe the same thing. For this week’s video, I turn my attention to the Application Programming Interface, or the API.
When deploying IPv6, one of the fundamental questions the network engineer needs to ask is: DHCPv6, or SLAAC? As the argument between these two has reached almost political dimensions, perhaps a quick look at the positive and negative attributes of each solution are. Originally, the idea was that IPv6 addresses would be created using stateless configuration (SLAAC). The network parts of the address would be obtained by listening for a Router Advertisement (RA), and the host part would be built using a local (presumably unique) physical (MAC) address. In this way, a host can be connected to the network, and come up and run, without any manual configuration. Of course, there is still the problem of DNS—how should a host discover which server it should contact to resolve domain names? To resolve this part, the DHCPv6 protocol would be used. So in IPv6 configuration, as initially conceived, the information obtained from RA would be combined with DNS information from DHCPv6 to fully configure an IPv6 host when it is attached to the network.
There are several problems with this scheme, as you might expect. The most obvious is that most network operators do not want to deploy two protocols to solve a single problem—configuring IPv6 hosts. What might not be so obvious, however, is that many network operators care a great deal about whether hosts are configured statelessly or through a protocol like DHCPv6.
Why would an operator want stateful configuration? Primarily because they want to control which devices can receive an IPv6 address, and hence communicate with other devices on the network. When using DHCPv6, just like DHCP with IPv4, the operator can set parameters around what kinds of devices, or perhaps even which specific devices, will be able to receive an IPv6 address. Further, the DHCPv6 server can be tied to the DNS server, so each host which connects to the network can also be given a DNS entry. Proper DNS entries are often a requirement for many applications. There are Dynamic DNS (DDNS) implementations that can solve this problem, but they are not often considered secure enough for a controlled network environment.
Why would an operator want stateless autoconfiguration? First, because they want any random user who can successfully connect to the network to be able to get an IPv6 address without any other configuration, and without the provider needing run any sort of special protocol or configuration to allow this. In fact, DHCPv6, in some environments, at least, can be seen as an attack surface, or rather a hole through which attacks can potentially be driven. Second, stateful configuration also has a failover problem; if the DHCPv6 server fails, then hosts can no longer obtain an IPv6 address, and the network no longer works. This could be, to say the least, problematic for service providers. Finally, SLAAC has a set of privacy extensions outlined in RFC4941 that (theoretically) prevent a host from being tracked based on its IPv6 address over time. This is a very attractive property for edge facing service providers.
The original set of drafts, however, only provided for DNS information to be carried through DHCPv6, and had no failover mechanism for DHCPv6. These two things, together, made it impossible to use just one of these two options. More recent work, however, has remedied both parts of this problem, making either option able to stand on its own. RFC6106, which is a bit older (2010), provides for DNS advertisement in the RA protocol. This allows an operator who would like to run everything completely stateless to do so, including hosts learning which DNS resolver to use. On the other side, RFC8156, which was just ratified in July of 2017, allows a pair of DHCPv6 servers to act as a failover pair. While this is more complex than simple DHCPv6, it does solve the problem of a host failing to operate correctly simply because the DHCPv6 server has failed.
Which of the two is now the best choice? If you do not have any requirement to restrict the hosts that can attach to the network using IPv6, then SLAAC, combined with DNS advertisement in the RA, and possibly with DDNS (if needed), would be the right choice. However, if the environment must be more secure, then DHCPv6 is likely to be the better solution.
A word of warning, though—using DHCPv6 to ensure each host received an IPv6 address that can be used anyplace in the network, and then stretching layer 2 to allow any host to roam “anywhere,” is really just not a good idea. I have worked on networks where this kind of thing has been taken to a global scale. It might seem cute at first, but this kind of solution will ultimately become a monster when it grows up.
Since Facebook has released their Open/R routing platform, there has been a lot of chatter around whether or not it will be a commercial success, whether or not every hyperscaler should use the protocol, whether or not this obsoletes everything in routing before this day in history, etc., etc. I will begin with a single point.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
Design is about tradeoffs. Protocol design is no different than any other design. Hence, we should expect that Open/R makes some tradeoffs. I know this might be surprising to some folks, particularly in the crowd that thinks every new routing system is going to be a silver bullet that solved every problem from the past, that the routing singularity has now occurred, etc. I’ve been in the world of routing since the early 1990’s, perhaps a bit before, and there is one thing I know for certain: if you understand the basics, you would understand there is no routing singularity, and there never will be—at least not until someone produces a quantum wave routing protocol.
Ther reality is you always face one of two choices in routing: build a protocol specifically tuned to a particular set of situations, which means application requirements, topologies, etc., or build a general purpose protocol that “solves everything,” at some cost. BGP is becoming the latter, and is suffering for it. Open/R is an instance of the former.
Which means the interesting question is: what are they solving for, and how? Once you’ve answered this question, you can then ask: would this be useful in my network?
A large number of the points, or features, highlighted in the first blog post are well known routing constructions, so we can safely ignore them. For instance: IPv6 link local only, graceful restart, draining and undraining nodes, exponential backoff, carrying random information in the protocol, and link status monitoring. These are common features of many protocols today, so we don’t need to discuss them. There are a couple of interesting features, however, worth discussing.
Dynamic Metrics. EIGRP once had dynamic metrics, and they were removed. This simple fact always makes me suspicious when I see dynamic metrics touted as a protocol feature. Looking at the heritage of Open/R, however, dynamic metrics were probably added for one specific purpose: to support wireless networks. This functionality is, in fact, provided through DLEP, and supported in OLSR, MANET extended OSPF, and a number of other MANET control planes. Support DLEP and dynamic metrics based on radio information was discussed at the BABEL working group at the recent Singapore IETF, in fact, and the BABEL folks are working on integration dynamic metrics for wireless. So this feature not only makes sense in the wireless world, it’s actually much more widespread than might be apparent if you are looking at the world from an “Enterprise” point of view.
But while this is useful, would you want this in your data center fabric? I’m not certain you would. I would argue dynamic metrics are actually counter productive in a fabric. What you want, instead, is basic reacbility provided by the distributed control plane (routing protocol), and some sort of controller that sits on top using an overlay sort of mechanism to do traffic engineering. You don’t want this sort of policy stuff in a routing protocol in a contained envrionment like a fabric.
Which leads us to our second point: The API for the controller. This is interesting, but not strictly new. Openfabric, for instance, already postulates such a thing, and the entire I2RS working group in the IETF was formed to build such an interface (though it has strayed far from this purpose, as usual with IETF working groups). The really interesting thing, though, is this: this southbound interface is built into the routing protocol itself. This design decision makes a lot of sense in a wireless network, but, again, I’m not certain it does in a fabric.
Why not? It ties the controller architecture, including the southbound interface, to the routing protocol. This reduced component flexibility, which means it is difficult to replace one piece without replacing the other. If you wanted to replace the basic functionality of Open/R without replacing the controller architecture at some point int he future, you must hack your way around this problem. In a monolithic system like Facebook, this might be okay, but in most other network environments, it’s not. In other words, this is a rational decision for Open/R, but I’m not certain it can, or should, be generalized.
This leads to a third observation: This is a monolithic architecture. While in most implementations, there is a separate RIB, FIB, and interface into the the forwarding hardware, Open/R combines all these things into a single system. In any form of Linux based network operating system, for instance, the routing processes install routes into Zebra, which then installs routes into the kernel and notifies processes about routes through the Forwarding Plane Manager (FPM). Some external process (switchd in Cumulus Linux, SWSS in SONiC), then carry this routing information into the hardware.
Open/R, from the diagrams in the blog post, pushes all of this stuff, including the southbound interface from the controller, into a different set of processes. The traditional lines are blurred, which means the entire implemention acts as a single “thing.” You are not going to take the BGP implementation from snaproute or FR Routing and run it on top of Open/R without serious modification, nor are you going to run Open/R on ONL or SONiC or Cumulus Linux without serious modification (or at least a lot of duplication of effort someplace).
This is probably an intentional decision on the part of Open/R’s designers—it is designed to be an “all in one solution.” You RPM it to a device, with nothing else, and it “just works.” This makes perfect sense in the wrieless environment, particularly for Facebook. Whether or not it makes perfect sense in a fabric depends—does this fit into the way you manage boxes today? Do you plan on using boxex Faebook will support, or roll your own drivers as needed for different chipsets, or hope the SAI support included in Open/R is enough? Will you ever need segment routing, or some other capability? How will those be provided for in the Open/R model, given it is an entire stack, and does not interact with any other community efforts?
Finally, there are a number of interesting points that are not discussed in the publicly available information. For instance, this controller—what does it look like? What does it do? How would you do traffic engineering with this sytem? Segment routing, MPLS—none of the standard ways of providing virtualization are mentioned at all. Dynamic metrics simply are not enough in a fabric. How is the flooding of information actually done? In the past, I’ve been led to believe this is based on ZeroMQ—is this still true? How optimal is ZeroMQ for flooding information? What kind of telemetry can you get out of this, and is it carried in the protocol, or in a separate system? I assume they want to carry telemtry as opaque information flooded by the protocol, but does it really make sense to do this?
Overall, Open/R is interesting. It’s a single protocol designed to opperate optimally in a small range of environments. As such, it has some interesting features, and it makes some very specific design choices. Are those design choices optimal for more general cases, or even other specific problem spaces? I would argue the architecture, in particular, is going to be problematic in terms of long term maintenance and growth. This can modified over time, of course, but then we are left with a collection of ideas that are available in many other protocols, making the idea much less interesting.
Is it interesting? Yes. Is it the routing singularity? No. As engineers, we should take it for what it is worth—a chance to see how other folks are solving the problems they are facing in day-to-day operation, and thinking about how some of those lessons might be applied in our own world. I don’t think the folks at Facebook would argue any more than this, either.