Lessons in Location and Identity through Remote Peering
We normally encounter four different kinds of addresses in an IP network; we tend to think about each of these as:
- The MAC address identifies an interface on a physical or virtual wire
- The IP address identifies an interface on a host
- The DNS name identifies a host
- The port number identifies an application or service running on the host
There are other address-like things, of course, such as the protocol number, a router ID, an MPLS label, etc. But let’s stick to these four for the moment. Looking through this list, the first thing you should notice is we often use the IP address as if it identified a host—which is generally not a good thing. There have been some efforts in the past to split the locator from the identifier, but the IP protocol suite was designed with a separate locator and identifier already: the IP address is the location and the DNS name is the identifier.
Even if you split up the locator and the identifier, however, the word locator is still quite ambiguous because we often equate the geographical and topological locations. In fact, old police procedural shows used to include scenes where a suspect was tracked down because they were using an IP address “assigned to them” in some other city… When the topic comes up this way, we can see the obvious flaw. In other situations, conflating the IP address with the location of the device is less obvious, and causes more subtle problems.
Consider, for instance, the concept of remote peering. Suppose you want to connect to a cloud provider who has a presence in an IXP that’s just a few hundred miles away. You calculate the costs of placing a router on the IX fabric, add it to the cost of bringing up a new circuit to the IX, and … well, there’s no way you are ever going to get that kind of budget approved. Looking around, though, you find there is a company that already has a router connected to the IX fabric you want to be on, and they offer a remote peering solution, which means they offer to build an Ethernet tunnel across the public Internet to your edge router. Once the tunnel is up, you can peer your local router to the cloud provider’s network using BGP. The cloud provider thinks you have a device physically connected to the local IX fabric, so all is well, right?
In a recent paper, a group of researchers looked at the combination of remote peering and anycast addresses. If you are not familiar with anycast addresses, the concept is simple: take a service which is replicated across multiple locations and advertise every instance of the service using a single IP address. This is clever because when you send packets to the IP address representing the service, you will always reach the closest instance of the service. So long as you have not played games with stretched Ethernet, that is.
In the paper, the researchers used various mechanisms to figure out where remote peering was taking place, and another to discover services being advertised using anycast (normally DNS or CDN services). Using the intersection of these two, they determined if remote peering was impacting the performance of any of these services. I shocked, shocked, to tell you the answer is yes. I would never have expected stretched Ethernet to have a negative impact on performance. 😊
To quote the paper directly:
…we found that 38% (126/332) of RTTs in traceroutes towards anycast prexes potentially aected by remote peering are larger than the average RTT of prexes without remote peering. In these 126 traceroute probes, the average RTT towards prexes potentially aected by remote peering is 119.7 ms while the average RTT of the other prexes is 84.7 ms.
The bottom line: “An average latency increase of 35.1 ms.” This is partially because the two different meanings of the word location come into play when you are interacting with services like CDNs and DNS. These services will always try to serve your requests from a physical location close to you. When you are using Ethernet stretched over IP, however, your topological location (where you connect to the network) and your geographical location (where you are physically located on the face of the Earth) can be radically different. Think about the mental dislocation when you call someone with an area code that is normally tied to an area of the west coast of the US, and yet you know they now live around London, say…
We could probably add in a bit of complexity to solve these problems, or (even better) just include your GPS coordinates in the IP header. After all, what’s the point of privacy? … 🙂 The bottom line is this: remote peering might a good idea when everything else fails, of course, but if you haven’t found the tradeoffs, you haven’t looked hard enough. It might be that application performance across a remote peering session is low enough that paying for the connection might turn out cheaper.
In the meantime, wake me up when we decide that stretching Ethernet over IP is never a good thing.
Research: Securing Linux with a Faster and Scalable IPtables
If you haven’t found the trade-offs, you haven’t looked hard enough.
A perfect illustration is the research paper under review, Securing Linux with a Faster and Scalable Iptables. Before diving into the paper, however, some background might be good. Consider the situation where you want to filter traffic being transmitted to and by a virtual workload of some kind, as shown below.
To move a packet from the user space into the kernel, the packet itself must be copied into some form of memory that processes on “both sides of the divide” can read, then the entire state of the process (memory, stack, program execution point, etc.) must be pushed into a local memory space (stack), and control transferred to the kernel. This all takes time and power, of course.
In the current implementation of packet filtering, netfilter performs the majority of filtering within the kernel, while iptables acts as a user frontend as well as performing some filtering actions in the user space. Packets being pushed from one interface to another must make the transition between the user space and the kernel twice. Interfaces like XDP aim to make the processing of packets faster by shortening the path from the virtual workload to the PHY chipset.
What if, instead of putting the functionality of iptables in the user space you could put it in the kernel space? This would make the process of switching packets through the device faster, because you would not need to pull packets out of the kernel into a user space process to perform filtering.
But there are trade-offs. According to the authors of this paper, there are three specific challenges that need to be addressed. First, users expect iptables filtering to take place in the user process. If a packet is transmitted between virtual workloads, the user expects any filtering to take place before the packet is pushed to the kernel to be carried across the bridge, and back out into user space to the second process, Second, a second process, contrack, checks the existence of a TCP connection, which iptables then uses to determine whether a packet that is set to drop because there no existing connection. This give iptables the ability to do stateful filtering. Third, classification of packets is very expensive; classifying packets could take too much processing power or memory to be done efficiently in the kernel.
To resolve these issues, the authors of this paper propose using an in-kernel virtual machine, or eBPF. They design an architecture which splits iptables into to pipelines, and ingress and egress, as shown in the illustration taken from the paper below.
As you can see, the result is… complex. Not only are there more components, with many more interaction surfaces, there is also the complexity of creating in-kernel virtual machines—remembering that virtual machines are designed to separate out processing and memory spaces to prevent cross-application data leakage and potential single points of failure.
That these problems are solvable is not in question—the authors describe how they solved each of the challenges they laid out. The question is: are the trade-offs worth it?
The bottom line: when you move filtering from the network to the host, you are not moving the problem from a place where it is less complex. You may make the network design itself less complex, and you may move filtering closer to the application, so some specific security problems are easier to solve, but the overall complexity of the system is going way up—particularly if you want a high performance solution.
IPv6 Backscatter and Address Space Scanning
Backscatter is often used to detect various kinds of attacks, but how does it work? The paper under review today, Who Knocks at the IPv6 Door, explains backscatter usage in IPv4, and examines how effectively this technique might be used to detect scanning of IPv6 addresses, as well. The best place to begin is with an explanation of backscatter itself; the following network diagram will be helpful—
Assume A is scanning the IPv4 address space for some reason—for instance, to find some open port on a host, or as part of a DDoS attack. When A sends an unsolicited packet to C, a firewall (or some similar edge filtering device), C will attempt to discover the source of this packet. It could be there is some local policy set up allowing packets from A, or perhaps A is part of some domain none of the devices from C should be connecting to. IN order to discover more, the firewall will perform a reverse lookup. To do this, C takes advantage of the PTR DNS record, looking up the IP address to see if there is an associated domain name (this is explained in more detail in my How the Internet Really Works webinar, which I give every six months or so). This reverse lookup generates what is called a backscatter—these backscatter events can be used to find hosts scanning the IP address space. Sometimes these scans are innocent, such as a web spider searching for HTML servers; other times, they could be a prelude to some sort of attack.
Kensuke Fukuda and John Heidemann. 2018. Who Knocks at the IPv6 Door?: Detecting IPv6 Scanning. In Proceedings of the Internet Measurement Conference 2018 (IMC ’18). ACM, New York, NY, USA, 231-237. DOI: https://doi.org/10.1145/3278532.3278553
Scanning the IPv6 address space is much more difficult because there are 2128 addresses rather than 232. The paper under review here is one of the first attempts to understand backscatter in the IPv6 address space, which can lead to a better understanding of the ways in which IPv6 scanners are optimizing their search through the larger address space, and also to begin understanding how backscatter can be used in IPv6 for many of the same purposes as it is in IPv4.
The researchers begin by setting up a backscatter testbed across a subset of hosts for which IPv4 backscatter information is well-known. They developed a set of heuristics for identifying the kind of service or host performing the reverse DNS lookup, classifying them into major services, content delivery networks, mail servers, etc. They then examined the number of reverse DNS lookups requested versus the number of IP packets each received.
It turns out that about ten times as many backscatter incidents are reported for IPv4 than IPv6, which either indicates that IPv6 hosts perform reverse lookup requests about ten times less often than IPv4 hosts, or IPv6 hosts are ten times less likely to be monitored for backscatter events. Either way, this result is not promising—it appears, on the surface, that IPv6 hosts will be less likely to cause backscatter events, or IPv6 backscatter events are ten times less likely to be reported. This could indicate that widespread deployment of IPv6 will make it harder to detect various kinds of attacks on the DFZ. A second result from this research is that using backscatter, the researchers determined IPv6 scanning is increasing over time; while the IPv6 space is not currently a prime target for attacks, it might become more so over time, if the scanning rate is any indicator.
The bottom line is—IPv6 hosts need to be monitored as closely, or more closely than IPv6 hosts, for scanning events. The techniques used for scanning the IPv6 address space are not well understood at this time, either.
The Floating Point Fix
Floating point is not something many network engineers think about. In fact, when I first started digging into routing protocol implementations in the mid-1990’s, I discovered one of the tricks you needed to remember when trying to replicate the router’s metric calculation was always round down. When EIGRP was first written, like most of the rest of Cisco’s IOS, was written for processors that did not perform floating point operations. The silicon and processing time costs were just too high.
What brings all this to mind is a recent article on the problems with floating point performance over at The Next Platform by Michael Feldman. According to the article:
While most programmers use floating point indiscriminately anytime they want to do math with real numbers, because of certain limitations in how these numbers are represented, performance and accuracy often leave something to be desired.
For those who have not spent a lot of time in the coding world, a floating point number is one that has some number of digits after the decimal. While integers are fairly easy to represent and calculate over in the binary processors use, floating point numbers are much more difficult, because floating point numbers are very difficult to represent in binary. The number of bits you have available to represent the number makes a very large difference in accuracy. For instance, if you try to store the number
101.1 in a
float, you will find the number stored is actually
101.099998 To store
101.1, you need a
double, which is twice as long as a
Okay—this is all might be fascinating, but who cares? Scientists, mathematicians, and … network engineers do, as a matter of fact. Fist, carrying around
double floats to store numbers with higher precision means a lot more network traffic. Second, when you start looking at timestamps and large amounts of telemetry data, the efficiency and accuracy of number storage becomes a rather big deal.
Okay, so the current floating point storage format, called IEEE754, is inaccurate and rather inefficient. What should be done about this? According to the article, John Gustafson, a computer scientist, has been pushing for the adoption of a replacement called posits. Quoting the article once again:
It does this by using a denser representation of real numbers. So instead of the fixed-sized exponent and fixed-sized fraction used in IEEE floating point numbers, posits encode the exponent with a variable number of bits (a combination of regime bits and the exponent bits), such that fewer of them are needed, in most cases. That leaves more bits for the fraction component, thus more precision.
Did you catch why this is more efficient? Because it uses a variable length field. In other words, posits replaces a fixed field structure (like what was originally used in OSPFv2) with a variable length field (like what is used in IS-IS). While you must eat some space in the format to carry the length, the amount of "unused space" in current formats overwhelms the space wasted, resulting in an improvement in accuracy. Further, many numbers that require a
double today can be carried in the size of a
float. Not only does using a TLV format increase accuracy, it also increases efficiency.
From the perspective of the State/Optimization/Surface (SOS) tradeoff, there should be some increase in complexity somewhere in the overall system—if you have not found the tradeoffs, you have not looked hard enough. Indeed, what we find is there is an increase in the amount of state being carried in the data channel itself; there is additional state, and additional code that knows how to deal with this new way of representing numbers.
It's always interesting to find situations in other information technology fields where discussions parallel to discussions in the networking world are taking place. Many times, you can see people encountering the same design tradeoffs we see in network engineering and protocol design.
Design Intelligence from the Hourglass Model
Over at the Communications of the ACM, Micah Beck has an article up about the hourglass model. While the math is quite interesting, I want to focus on transferring the observations from the realm of protocol and software systems development to network design. Specifically, start with the concept and terminology, which is very useful. Taking a typical design, such as this—
The first key point made in the paper is this—
The thin waist of the hourglass is a narrow straw through which applications can draw upon the resources that are available in the less restricted lower layers of the stack.
A somewhat obvious point to be made here is applications can only use services available in the spanning layer, and the spanning layer can only build those services out of the capabilities of the supporting layers. If fewer applications need to be supported, or the applications deployed do not require a lot of “fancy services,” a weaker spanning layer can be deployed. Based on this, the paper observes—
The balance between more applications and more supports is achieved by first choosing the set of necessary applications N and then seeking a spanning layer sufficient for N that is as weak as possible. This scenario makes the choice of necessary applications N the most directly consequential element in the process of defining a spanning layer that meets the goals of the hourglass model.
Beck calls the weakest possible spanning layer to support a given set of applications the minimally sufficient spanning layer (MSSL). There is one thing that seems off about this definition, however—the correlation between the number of applications supported and the strength of the spanning layer. There are many cases where a network supports thousands of applications, and yet the network itself is quite simple. There are many other cases where a network supports just a few applications, and yet the network is very complex. It is not the number of applications that matter, it is the set of services the applications demand from the spanning layer.
Based on this, we can change the definition slightly: an MSSL is the weakest spanning layer that can provide the set of services required by the applications it supports. This might seem intuitive or obvious, but it is often useful to work these kinds of intuitive things out, so they can be expressed more precisely when needed.
First lesson: the primary driver in network complexity is application requirements. To make the network simpler, you must reduce the requirements applications place on the network.
There are, however, several counter-intuitive cases here. For instance, TCP is designed to emulate (or abstract) a circuit between two hosts—it creates what appears to be a flow controlled, error free channel with no drops on top of IP, which has no flow control, and drops packets. In this case, the spanning layer (IP), or the wasp waist, does not support the services the upper layer (the application) requires.
In order to make this work, TCP must add a lot of complexity that would normally be handled by one of the supporting layers—in fact, TCP might, in some cases, recreate capabilities available in one of the supporting layers, but hidden by the spanning layer. There are, as you might have guessed, tradeoffs in this neighborhood. Not only are the mechanisms TCP must use more complex that the ones some supporting layer might have used, TCP represents a leaky abstraction—the underlying connectionless service cannot be completely hidden.
Take another instance more directly related to network design. Suppose you aggregate routing information at every point where you possibly can. Or perhaps you are using BGP route reflectors to manage configuration complexity and route counts. In most cases, this will mean information is flowing through the network suboptimally. You can re-optimize the network, but not without introducing a lot of complexity. Further, you will probably always have some form of leaky abstraction to deal with when abstracting information out of the network.
Second lesson: be careful when stripping information out of the spanning layer in order to simplify the network. There will be tradeoffs, and sometimes you end up with more complexity than what you started with.
A second counter-intuitive case is that of adding complexity to the supporting layers in order to ultimately simplify the spanning layer. It seems, on the model presented in the paper, that adding more services to the spanning layer will always end up adding more complexity to the entire system. MPLS and Segment Routing (SR), however, show this is not always true. If you need traffic steering, for instance, it is easier to implement MPLS or SR in the support layer rather than trying to emulate their services at the application level.
Third lesson: sometimes adding complexity in a lower layer can simplify the entire system—although this might seem to be counter-intuitive from just examining the model.
The bottom line: complexity is driven by applications (top down), but understanding the full stack, and where interactions take place, can open up opportunities for simplifying the overall system. The key is thinking through all parts of the system carefully, using effective mental models to understand how they interact (interaction surfaces), and the considering the optimization tradeoffs you will make by shifting state to different places.
DORA, DevOps, and Lessons for Network Engineers
DevOps Research and Assessment (DORA) released their 2018 Accelerate report on the state of DevOps at the end of 2018; I’m a little behind in my reading, so I just got around to reading it, and trying to figure out how to apply their findings to the infrastructure (networking) side of the world.
DORA found organizations that outsource entire functions, such as building an entire module or service, tend to perform more poorly than organizations that outsource by integrating individual developers into existing internal teams (page 43). It is surprising companies still think outsourcing entire functions is a good idea, given the many years of experience the IT world has with the failures of this model. Outsourced components, it seems, too often become a bottleneck in the system, especially as contracts constrain your ability to react to real-world changes. Beyond this, outsourcing an entire function not only moves the work to an outside organization, but also the expertise. Once you have lost critical mass in an area, and any opportunity for employees to learn about that area, you lose control over that aspect of your system.
DORA also found a correlation between faster delivery of software and reduced Mean Time To Repair (MTTR) (page 19). On the surface, this makes sense. Shops that delivery software continuously are bound to have faster, more regularly exercised processes in place for developing, testing, and rolling out a change. Repairing a fault or failure requires change; anything that improves the speed of rolling out a change is going to drive MTTR down.
Organizations that emphasize monitoring and observability tended to perform better than others (page 55). This has major implications for network engineering, where telemetry and management are often “bolted on” as an afterthought, much like security. This is clearly not optimal, however—telemetry and network management need to be designed and operated like any other application. Data sources, stores, presentation, and analysis need to be segmented into separate services, so new services can be tried out on top of existing data, and new sources can feed into existing services. Network designers need to think about how telemetry will flow through the management system, including where and how it will originate, and what it will be used for.
These observations about faster delivery and observability should drive a new way of thinking about failure domains; while failure domains are often primarily thought of as reducing the “blast radius” when a router or link fails, they serve two much larger roles. First, failure domain boundaries are good places to gather telemetry because this is where information flows through some form of interaction surface between two modules. Information gathered at a failure domain boundary will not tend to change as often, and it will often represent the operational status of the entire module.
Second, well places failure domain boundaries can be used to stake out areas where “new things” can be put in operation with some degree of confidence. If a network has well-designed failure domain boundaries, it is much easier to deploy new software, hardware, and functionality in a controlled way. This enables a more agile view of network operations, including the ability to roll out changes incrementally through a canary process, and to use processes like chaos monkey to understand and correct unexpected failure modes.
Another interesting observation is the j-curve of adoption (page 3):
This j-curve shows the “tax” of building the underlying structures needed to move from a less automated state to a more automated one. Keith’s Law:
In a complex system, the cumulative effect of a large number of small optimizations is externally indistinguishable from a radical leap.
…operates in part because of this j-curve. Do not be discouraged if it seems to take a lot of work to make small amounts of progress in many stages of system development—the results will come later.
The bottom line: it might seem like a report about software development is too far outside the realm of network engineering to be useful—but the reality is network engineers can learn a lot about how to design, build, and operate a network from software engineers.
Why You Should Block Notifications and Close Your Browser
Every so often, while browsing the web, you run into a web page that asks if you would like to allow the site to push notifications to your browser. Apparently, according to the paper under review, about 12% of the people who receive this notification allow notifications. What, precisely, is this doing, and what are the side effects?
Allowing notifications allows the server to kick off one of two different kinds of processes on the local computer, a service worker. There are, in fact, two kinds of worker apps that can run “behind” a web site in HTML5; the web worker and the service worker. The web worker is designed to calculate or locally render some object that will appear on the site, such as unencrypting a downloaded audio file for local rendition. This moves the processing load (including the power and cooling use!) from the server to the client, saving money for the hosting provider, and (potentially) rendering the object in question more quickly.
A service worker, on the other hand, is designed to support notifications. For instance, say you keep a news web site open all day in your browser. You do not necessarily want to reload the page ever few minutes; instead, you would rather the site send you a notification through the browser when some new story has been posted. Since the service worker is designed to cause an action in the browser on receiving a notification from the server, it has direct access to the network side of the host, and it can run even when the tab showing the web site is not visible.
In fact, because service workers are sometimes used to coordinate the information on multiple tabs, a service worker can both communicate between tabs within the same browser and stay running in the browser’s context even though the tab that started the service worker is closed. To make certain other tabs do not block while the server worker is running, they are run in a separate thread; they can consume resources from a different core in your processor, so you are not aware (from a performance perspective) they are running. To sweeten the pot, a service worker can be restarted after your browser has restarted by a special push notification from the server.
If a service worker sounds like a perfect setup for running code that can mine bitcoins or launch DDoS attacks from your web browser, then you might have a future in computer security. This is, in fact, what MarioNet, a proof-of-concept system described in this paper, does—it uses a service worker to consume resources off as many hosts as it can install itself on to do just about anything, including launching a DDoS attack.
Given the above, it should be simple enough to understand how the attack works. When the user lands on a web page, ask for permission to push notifications. A lot of web sites that do not seem to need such permission ask now, particularly ecommerce sites, so the question does not seem out of place almost anywhere any longer. Install a service worker, using the worker’s direct connection to the host’s network to communicate to a controller. The controller can then install code to be run into the service worker and direct the execution of that code. If the user closes their browser, randomly push notifications back to the browser, in case the user opens it again, thus recreating the service worker.
Since the service worker runs in a separate thread, the user will not notice any impact on web browsing performance from the use of their resources—in fact, MarioNet’s designers use fine-grained tracking of resources to ensure they do not consume enough to be noticed. Since the service worker runs between the browser and the host operating system, no defenses built into the browser can detect the network traffic to raise a flag. Since the service worker is running in the context of the browser, most anti-virus software packages will give the traffic and processing a pass.
First, making something powerful from a compute perspective will always open holes like this. There will never be any sort of system that both allows the transfer of computation from one system to another that will not have some hole which can be exploited.
Second, abstraction hides complexity, even the complexity of an attack or security breach, nicely. Abstraction is like anything else in engineering: if you haven’t found the tradeoffs, you haven’t looked hard enough.
Third, close your browser when you are done. The browser is, in many ways, an open door to the outside world through which all sorts of people can make it into your computer. I have often wanted to create a VM or container in which I can run a browser from a server on the ‘net. When I’m done browsing, I can shut the entire thing down and restore the state to “clean.” No cookies, no java stuff, no nothing. A nice fresh install each time I browse the web. I’ve never gotten around to building this, but I should really put it on my list of things to do.
Fourth, don’t accept inbound connection requests without really understanding what you are doing. A notification push is, after all, just another inbound connection request. It’s like putting a hole in your firewall for that one FTP server that you can’t control. Only it’s probably worse.