Research: Covert Cache Channels in the Public Cloud

One of the great fears of server virtualization is the concern around copying information from one virtual machine, or one container, to another, through some cover channel across the single processor. This kind of channel would allow an attacker who roots, or otherwise is able to install software, on one of the two virtual machines, to exfiltrate data to another virtual machine running on the same processor. There have been some successful attacks in this area in recent years, most notably meltdown and spectre. These defects have been patched by cloud providers, at some cost to performance, but new vulnerabilities are bound to be found over time. The paper I’m looking at this week explains a new attack of this form. In this case, the researchers use the processor’s cache to transmit data between two virtual machines running on the same physical core.

The processor cache is always very small for several reasons. First, the processor cache is connected to a special bus, which normally has limits in the amount of memory it can address. This special bus avoids reading data through the normal system bus, and this is (from a networking perspective) at least one hop, and often several hops, closer to the processor on the internal bus (which is essentially an internal network). Second, the memory used in the cache is normally faster than the main memory.

The question is: since caches are at the processor level, so multiple virtual processes share the same cache, is it possible for one process to place information in the cache that another process can read? Since the cache is small and fast, it is used to store information that is accessed frequently. As processes, daemons, threads, and pthreads enter and exit, they access difference parts of main memory, causing the contents of the cache to change rapidly. Because of this constant churn, many researchers have assumed you cannot build a covert channel through the cache in this way. In fact, there have been attempts in the past; each of these has failed.

The authors of this paper argue, however, these failures are not because building a covert channel through the cache is not possible, but rather because previous attempts at doing so have operated on bad assumptions, attempting to use standard error correction mechanisms.

The first problem with using standard error correction mechanisms is that entire sections of data can be lost due to a cache entry being deleted. Assume you have two processes running on a single processor; you would like to build a covert channel between these processes. You write some code that inserts information into the cache, ensuring it is written in a particular memory location. This is the “blind drop.” The second process now runs and attempts to read this information. Normally this would work, but between the first and second process running, the information in the cache has been overwritten by some third process you do not know about. Because the entire data block is gone, the second process, which is trying to gather the information from the blind drop location, cannot tell there was ever any information at the drop point at all. There is no data across which the error correction code can run, because the data has been completely overwritten.

A possible solution to this problem is to use something like a TCP window of one; the transmitter resends the information until it receives an acknowledgement of receipt. The problem with this solution, in the case of a cache, is that the sender and receiver have no way to synchronize their clocks. Hence there is no way to form any sense of a serialized channel between the two processes.

To overcome these problems, the researchers use techniques used in wireless networks to ensure reliable delivery over unreliable channels. For instance, they send each symbol (a 0 or a 1) multiple times, using different blind drops (or cache locations), such that the receiver can compare these multiple transmit instances, and decide what the sender intended. The broader the number of blind drops used, the more likely information is to be carried across the process divide through the cache, as there are very few instances where all the cache entries representing blind drops will be invalidated and replaced at once. The researchers increase the rate at which this newly opened covert channel can operate by reverse engineering some aspects of a particular model processor’s caching algorithm. This allows them to guess which lines of cache will be invalidated first, how the cache sets are arranged, etc., and hence to place the blind drops more effectively.

By taking these steps, and using some strong error correction coding, a 42K covert channel was created between two instances running in an Amazon EC2 instance. This might not sound like a like, but it is higher speed than some of the fastest modems in use before DSL and other subscriber lines were widely available, and certainly fast enough to transfer a text-based file of passwords between two processes.

There will probably be some counter to this vulnerability in the future, but for now the main protection against this kind of attack is to prevent unknown or injected code from running on your virtual machines.

Reaction: The Power of Open APIs

Disaggregation, in the form of splitting network hardware from network software, is often touted as a way to save money (as if network engineering were primarily about saving money, rather than adding value—but this is a different soap box). The primary connections between disaggregation and saving money are the ability to deploy white boxes, and the ability to centralize the control plane to simplify the network (think software defined networks here—again, whether or not both of these are true as advertised is a different discussion).

But drivers that focus on cost miss more than half the picture. A better way to drive the value of disaggregation, and the larger value of networks within the larger network technology sphere, is through increased value. What drives value in network engineering? It’s often simplest to return to Tannenbaum’s example of the station wagon full of VHS backup tapes. To bring the example into more modern terms, it is difficult to beat the bandwidth of an overnight box full of USB thumb drives in terms of pure bandwidth.

In this view, networks can primarily be seen as a sop to human impatience. They are a way to get things done more quickly. In the case of networks quantity—speed—often becomes a form of quality—increased value.

But what does disaggregation have to do with speed? The connection is the open API.

When you disaggregate a network device into hardware and software, you necessarily create a stable, openly accessible API between the software and the hardware. Routing protocols, and other control plane elements must be able to build a routing table that is somehow then passed on to the forwarding hardware, so packets can be forwarded through the network. A fortuitous side effect of this kind of open API is that anyone can use it to control the forwarding software.

Enter the new release of ScyllaDB. According to folks who test these things, and should know, ScyllaDB is much faster than Cassandra, another industry leading open source database system. How much faster? Five to ten times faster. A five- to ten-fold improvement in database performance is, potentially, a point of quantity that can easily have a different quality. How much faster could your business handle orders, or customer service calls, or many other things, if you could speed the database end of the problem up even five-fold? How many new ways of processing information to gain insight from that data about business operations, customers, etc.

How does Scylla provide these kinds of improvements over Cassandra? In the first place, the newer database system is written in a faster language, C++ rather than Java. Scylla also shards processing across processor cores more efficiently. It doesn’t rely on the page cache.

None of this has to do with network disaggregation—but there is one way the Scylla developers improved the performance of their database that does relate to network disaggregation: ScyllaDB writes directly to the network interface card using DPDK. The interesting point, from a network engineering point of view, is that simply would not be possible without disaggregation between hardware and software opening up DPDK as an interface for a database to directly push packets to the hardware.

The side effects of disaggregation are only beginning to be felt in the network engineering world; the ultimate effects could reshape the way we think about application performance on the network, and the entire realm of network engineering.

Whatever is vOLT-HA?

Many network engineers find the entire world of telecom to be confusing—especially as papers are peppered with a lot of acronyms. If any part of the networking world is more obsessed with acronyms than any other, the telecom world, where the traditional phone line, subscriber access, and network engineering collide, reigns as the “king of the hill.”

Recently, while looking at some documentation for the CORD project, which stands for Central Office Rearchitected as a Data Center, I ran across an acronym I had not seen before—vOLT-HA. An acronym with a dash in the middle—impressive! But what is, exactly? To get there, we must begin in the beginning, with a PON.

There are two kinds of optical networks in the world, Active Optical Networks (AONs), and Passive Optical Networks (PONs). The primary difference between the two is whether the optical gear used to build the network amplifies (or even electronically rebuilds, or repeats) the optical signal as it passes through. In AONs, optical signals are amplified, while ins PONs, optical signals are not amplified. This means that in a PON, the optical equipment can be said to be passive, in that it does not modify the optical signal in any way. Why is this important? Because passive equipment is less complex, and does not require as much power to operate, so a PON is much less expensive to build and maintain than an AON. Hence a PON is often more economically realistic when serving a large number of customers, such as in providing service to residential or small office customers.

A PON uses optical splitters to divide out the signal among the various connected customers. Like any other shared bandwidth medium, every customer receives all the data on the downstream side, switching only traffic destined for the local network onto the copper (usually Ethernet) network beyond the optical termination point (called an OLT, or Optical Line Terminal). In a PON, the upstream signal is divided up into timeslots, so the system uses Time Division Multiplexing (TDM) to provide (a much slower) path from the end device into the provider’s network. As signals from each edn device reach the splitters in the network, the path is reversed, and the splitter ends up becoming a power combiner, which means the signal can “gain power” on the way up towards the central office (CO). These kinds of systems are typically sold as Fiber to the Home, which is abbreviated FTTH (of course!).

Is your head dizzy yet? I hope not, because we are just getting started with the acronyms. 🙂

The Optical Line Terminal, or OLT, must reside in some piece of physical hardware, called an Optical Network Unit (ONU). The OLT, like a server, or an Ethernet port on a router or switch, can be virtualized, so multiple logical OLTs reside on a single physical hardware interface. Just like a VRF or VLAN, this allows a single physical interface to be used for multiple logical connections. In the case, the resulting logical interface is called a vOLT, or a virtual Optical Line Terminal.

Now we are finally getting to the answer to the original question. vOLT must somehow relate to virtualizing the OLT, but how? The answer lies in the idea of disaggregation in passive optical networks (remember, this is a PON). One of the key components of disaggregation is being able to run any software—especially open source software—on any hardware—so-called “white box” hardware in particular. To get to this point, you must have some sort of “open Application Programming Interface,” or API, to connect the software to the hardware. You might think the HA in vOLT-HA stands for “high availability, but then you’d be wrong. 🙂 It actually stands for Hardware Abstraction.

So vOLT-HA, sometimes spelled VOLTHA, is actually a hardware abstraction layer that allows the disaggregation of vOLTs in an ONU in a PON.

Got it?

Reaction: DNS Complexity Lessons

Recently, Bert Hubert wrote of a growing problem in the networking world: the complexity of DNS. We have two systems we all use in the Internet, DNS and BGP. Both of these systems appear to be able to handle anything we can throw at them and “keep on ticking.”

this article was crossposted to CircleID

But how far can we drive the complexity of these systems before they ultimately fail? Bert posted this chart to the APNIC blog to illustrate the problem—

I am old enough to remember when the entire Cisco IOS Software (classic) code base was under 150,000 lines; today, I suspect most BGP and DNS implementations are well over this size. Consider this for a moment—a single protocol implementation that is larger than an entire Network Operating System ten to fifteen years back.

What really grabbed my attention, though, was one of the reasons Bert believes we have these complexity problems—

DNS developers frequently see immense complexity not as a problem but as a welcome challenge to be overcome. We say ‘yes’ to things we should say ‘no’ to. Less gifted developer communities would have to say no automatically since they simply would not be able to implement all that new stuff. We do not have this problem. We’re also too proud to say we find something (too) hard.

How often is this the problem in network design and deployment? “Oh, you want a stretched Ethernet link between two data centers 150 miles apart, and you want an eVPN control plane on top of the stretched Ethernet to support MPLS Traffic Engineering, and you want…” All the while the equipment budget is ringing up numbers in our heads, and the realyl cool stuff we will be able to play with is building up on the list we are writing in front of us. Then you hear the ultimate challenge—”if you were a real engineer, you could figure out how to do this all with a pair of routers I can buy down at the local office supply store.”

Some problems just do not need to be solved in the current system. Some problems just need to have their own system built for them, rather than reusing the same old stuff because, well, “we can.”

The real engineer is the one who knows how to say “no.”

Policing, Shaping, and Performance

Policing traffic and shaping traffic are two completely different things, but it is hard to know, in the wild, what the impact of one or the other will have on a particular traffic flow, or on the performance of applications in general. While the paper under review here, An Internet-Wide Analysis of Traffic Policing, is largely focused on the global ‘net, specifically from a content provider’s perspective, it contains lessons for just about every network operator who needs to manage Quality of Service (QoS) in a sane and meaningful way.

Flach, Tobias, Pavlos Papageorge, Andreas Terzis, Luis Pedrosa, Yuchung Cheng, Tayeb Karim, Ethan Katz-Bassett, and Ramesh Govindan. 2016. “An Internet-Wide Analysis of Traffic Policing.” In Proceedings of the 2016 ACM SIGCOMM Conference, 468–482. SIGCOMM ’16. New York, NY, USA: ACM. https://doi.org/10.1145/2934872.2934873.

Traffic policing involves setting up a queue with a pool of tokens. For some unit of traffic—assume a packet here—received, a token is consumed. When a packet is transmitted, the token is added back to the pool. If the pool is sized correctly, short bursts in the traffic stream will be allowed through, but if the application attempts to establish a session using more bandwidth than the policer allows, the packets will be dropped. The idea sounds good in theory, but it does not seem to work out well in practice.

To understand why, it is important to examine the behavior of TCP, the stream protocol used by most applications. TCP uses a slow start mechanism that attempts to find the largest window, and hence the highest bandwidth utilization, possible between the transmitter and receiver. The window size is increased fairly rapidly until a packet is dropped. The transmitter then backs off the window size, slowly increasing again until the transmitter reaches a point where bandwidth is maximized, and only a minimal number of packets are dropped (ideally none). The chart below illustrates this process.

Policing is supposed to emulate the effect of a link that is lower bandwidth than the actual link by dropping traffic that exceeds what the policer is configured to allow. The problem is found in the initial description of what a policer does: it allows the stream to burst until the tokens run out. When the TCP stream first starts, then, a policer will allow the TCP slow start process to open the window much higher than the policer’s configured bandwidth. Once the tokens run out, the policer will drop packets until there are tokens in the pool again, which effectively lowers the effective bandwidth. From TCP’s perspective, a policer is a link with a constantly changing link bandwidth.

TCP, in effect, attempts to burst in order to find the maximum bandwidth. A policer treats this burst as a temporary condition, bringing the flow back under its bandwidth limit after some amount of “reasonable burst.” What TCP treats as an attempt to find the optimal flow rate (window size), the policer interprets as a set of bursts requiring dropped packets. The result of this rather bad interaction is very poor performance. The paper reports that up to 20% of packets can be dropped in a policed flow, causing major performance problems. Given the measurements were taken from video servers, the authors note there is a discernible impact on the quality of video across policed links. Impact levels of this kind indicate policing will probably have a bad effect on just about any sort of application that relies on TCP transport services.

Given this, should network operators configure policing? Is it counterproductive? The answer cannot be to eliminate policing, as operators need some way to manage which applications receive particular percentages of the available bandwidth. If the problem is the interaction of TCP and the policer, perhaps some other QoS mechanism can be combined with policing to provide a better balance between controlling load and allowing TCP to operate effectively.

The authors tried various mechanisms to this end, including modifying the TCP stack in the server. This is not going to be a generally available solution, so the authors sought out some other solution that could be implemented in “standard hardware.” What they found is that by placing a traffic shaper in line with a policer, the bad interaction between TCP and the policer could largely be mitigated. The shaper smooths the bursts out, so the policer does not end up taking such drastic action. If policing is being configured, the burst size should be small, rather than large, so TCP is more effective at finding the right window size in the face of the inevitable packet drops.