Research: Robustness in Complex Systems

While the network engineering world tends to use the word resilience to describe a system that will support rapid change in the real world, another word often used in computer science is robustness. What makes a system robust or resilient? If you ask a network engineer this question, the most likely answer you will get is something like there is no single point of failure. This common answer, however, does not go “far enough” in describing resilience. For instance, it is at least sometimes the case that adding more redundancy into a network can actually harm MTTR. A simple example: adding more links in parallel can cause the control plane to converge more slowly; at some point, the time to converge can be reduced enough to offset the higher path availability.

In other cases, automating the response to a change in the network can harm MTTR. For instance, we often nail a static route up and redistribute that, rather than redistributing live routing information between protocols. Experience shows that sometimes not reacting automatically is better than reacting automatically.

This post will look at a paper that examines robustness more deeply, Robustness in Complexity Systems,” by Steven Gribble. While this is an older paper—it was written in 2000—it remains a worthwhile read for the lessons in distributed system design. The paper is based on the deployment of a cluster based Distributed Data Structure (DDS). A more convenient way for readers to think of this is as a distributed database. Several problems discovered when building and deploying this DDS are considered, including—

  • A problem with garbage collection, which involved timeouts. The system was designed to allocate memory as needed to form and synchronize records. After the record had been synchronized, any unneeded memory would be released, but would not be reallocated immediately by some other process. Rather, a garbage collection routine would coallesce memory into larger blocks where possible, rearranging items and placing memory back into available pools. This process depends on a timer. What the developers discovered is their initial “guess” at a a good timer was ultimately an order of a magnitude too small, causing some of the nodes to “fall behind” other nodes in their synchronization. Once a node fell behind, the other nodes in the system were required to “take up the slack,” causing them to fail at some point in the future. This kind of cascading failure, triggered by a simple timer setting, is common in a distributed system.
  • A problem with a leaky abstraction from TCP into the transport. The system was designed to attempt to connect on TCP, and used fairly standard timeouts for building TCP connections. However, a firewall in the network was set to disallow inbound TCP sessions. Another process connecting on TCP was passing through this firewall, causing the TCP session connection to fail, and, in turn, causing the TCP stack on the nodes to block for 15 minutes. This interaction of different components caused nodes to fall out of availability for long periods of time.

Gribble draws several lessons from these, and other, outages in the system.

First, he states that for a system to be truly robust, it must use some form of admission control. The load on the system, in other words, must somehow be controlled so more work cannot be given than the system can support. This has been a contentious issue in network engineering. While circuit switched networks can control the amount of work offered to the network (hence a Clos can be non-blocking in a circuit switched network), admission control in a packet switched network is almost impossible. The best you can do is some form of Quality of Service marking and dropping, such as traffic shaping or traffic policing, along the edge. This does highlight the importance of such controls, however.

Second, he states that systems must be systematically overprovisioned. This comes back to the point about redundant links. The caution above, however, still applies; systematic overprovisioning needs to be balanced against other tools to build a robust system. Far too often, overprovisioning is treated as “the only tool in the toolbox.”

Third, he states introspection must be built into the system. The system must be designed to be monitorable from its inception. In network engineering, this kind of thinking is far too often taken to say “everything must be measureable.” This does not go far enough. Network engineers need to think about not only how to measure, but also what they expect normal to look like, and how to tell when “normal” is no longer “normal.” The system must be designed within limits. Far too often, we just build “as large as possible,” and run it to see what happens.

Fourth, Gribbles says adaptivitiy must be provided through a closed control loop. This is what we see in routing protocols, in the sense that the control plane reacts to topology changes in a specific way, or rather within a specific state machine. Learning this part of the network is a crucial, but often skimmed over, part of network engineering.

This is an excellent paper, well worth reading for those who are interested in classic work around the area of robustness and distributed systems.

Weekend Reads 051818: Botnets and Throwhammer

The Facebook freak-out provides an outlet for fears regarding the digital environment we inhabit. A few companies control most channels of information. The gadgets that we use for convenience and entertainment also create the mechanisms for near-total surveillance, from tracking devices in our pockets to wiretaps in our homes—hi, Alexa! Someone besides Santa is watching and knows whether you have been naughty or nice. —Nathanael Blake @Public Discourse

Within just 10 days of the disclosure of two critical vulnerabilities in GPON router at least 5 botnet families have been found exploiting the flaws to build an army of million devices. Security researchers from Chinese-based cybersecurity firm Qihoo 360 Netlab have spotted 5 botnet families, including Mettle, Muhstik, Mirai, Hajime, and Satori, making use of the GPON exploit in the wild. —Swati Khandelwal @The Hacker News

Exploitation of Rowhammer attack just got easier. Dubbed ‘Throwhammer,’ the newly discovered technique could allow attackers to launch Rowhammer attack on the targeted systems just by sending specially crafted packets to the vulnerable network cards over the local area network. Known since 2012, Rowhammer is a severe issue with recent generation dynamic random access memory (DRAM) chips in which repeatedly accessing a row of memory can cause “bit flipping” in an adjacent row, allowing anyone to change the contents of computer memory. —Mohit Kumar @The Hacker News

Research: Bridging the Air Gap

Way back in the old days, the unit I worked at in the US Air Force had a room with a lot of equipment used for processing classified information. Among this equipment was a Zenith Z-250 with an odd sort of keyboard and a very low resolution screen. A fine metal mesh embedded in a semi-clear substrate was glued to the surface of the monitor. This was our TEMPEST rated computer, on which we could type up classified memos, read classified email, and the like. We normally connected it to the STU-3 through a modem (remember those) to send and receive various kinds of classified information.

Elovici, Mordechai Guri, Yuval. “Bridgeware: The Air-Gap Malware.” Accessed May 13, 2018.

The idea of TEMPEST begins way back in 1985, when a Dutch researcher demonstrated “reading” the screen of a computer using some relatively cheap, and easy to assemble, equipment, from several feet away. The paper I’m looking at today provides a good overview of the many ways which have been discovered since this initial demonstration to transfer data from one computer to another across what should be an “air gap.” For instance, the TEMPEST rated computer described above was air gapped; the only time it was connected to any communications device was when it was connected to one of the STU-3’s, and then only once the secure connection had already been established.

The paper begins by defining the general problem of communication with air-gapped systems. In this initial section is a very helpful discussion of the difference between covert channels and side channels. A side channel is an unintended side effect of processing data that reveals something about the data itself; I covered these in a short take over at the Network Collective. A covert channel, however, is a communications channel set up between two systems intentionally designed to carry information between them, but without their owner or administrator knowing about the channel. Cover channels are, by their nature, often difficult to detect and block. With this definition in hand, the authors then consider various channels that can be created between two systems to transfer information.

Acoustic channels largely focus on ultrasonic sounds encoding information being transmitted through the system speakers. This is a method apparently employed by advertisers and marketing firms to carry information about the human attached to a particular computer, in order to correlate activity between multiple systems. For instance, when a cellphone is in “hearing range” of computer, some piece of software may send a noise which an application on the cellphone can use to determine the proximity of the two devices. This allows the tracking from one device to be correlated with the tracking from the second device, building a larger “picture” of the person’s activities. Speakerless computers are one common solution to this kind of problem, but bridging air gaps through the sounds made by a hard drive of the computer’s fan is also possible.

Electromagnetic attacks involve some form of antenna and some form of receiver. The easiest way to transmit something is, of course, to find some way to attach an antenna to a system; for instance, there is a physical attack where a small antenna is embedded into a USB connector, and used to transmit information to a locally configured receiver. In this way, keystrokes, information transferred onto and off the USB device, etc., can be transmitted off an air gapped system. Other ways have been discovered to use the monitor cable as an antenna, or simply injecting a complete cellular antenna system as a backdoor hardware channel.

Thermal and optical methods have been used, as well, such as through the sounds made by an air conditioning system.

While many of these methods might seem fantastic, the lesson of this research is that if you are going to air gap a computer for security, make certain you air gap it in the most complete way possible. Disconnecting the Ethernet cable, removing the WiFi antenna, and placing the system in a separate room may not be enough to prevent information from leaking.

May 2018

April 2018

The Universal Fat Tree

Have you ever wondered why spine-and-leaf networks are the "standard" for data center networks? While the answer has a lot

Whatever is vOLT-HA?

Many network engineers find the entire world of telecom to be confusing—especially as papers are peppered with a lot of