In the realm of network design—especially in the realm of security—we often react so strongly against a perceived threat, or so quickly to solve a perceived problem, that we fail to look for the tradeoffs. If you haven’t found the tradeoffs, you haven’t looked hard enough—or, as Dr. Little says, you have to ask what is gained and what is lost, rather than just what is gained. This failure to look at both sides often results in untold amounts of technical debt and complexity being dumped into network designs (and application implementations), causing outages and failures long after these decisions are made.
A 2018 paper on DDoS attacks, A First Joint Look at DoS Attacks and BGP Blackholing in the Wild provides a good example of causing more damage to an attack than the attack itself. Most networks are configured to allow the operator to quickly configure a remote triggered black hole (RTBH) using BGP. Most often, a community is attached to a BGP route that points the next-hop to a local discard route on each eBGP speaker. If used on the route advertising the destination of the attack—the service under attack—the result is the DDoS attack traffic no longer has a destination to flow to. If used on the route advertising the source of the DDoS attack traffic, the result is the DDoS traffic will no pass any reverse-path forwarding policies at the edge of the AS, and hence be dropped. Since most DDoS attacks are reflected, blocking the source traffic still prevents access to some service, generally DNS or something similar.
In either case, then, stopping the DDoS through an RTBH causes damage to services rather than just the attacker. Because of this, remote triggered black holes should really only be used in the most extreme cases, where no other DDoS mitigation strategy will work.
The authors of the Joint Look use publicly avaiable information to determine the answers to several questions. First, what scale of DDoS attacks are RTBHs used against? Second, how long after an attack begins is the RTBH triggered? Third, for how long is the RTBH left in place after the attack has been mitigated?
The answer to the first question should be—the RTBH is only used against the largest-scale attacks. The answer to the second question should be—the RTBH should be put in place very quickly after the attack is detected. The answer to the third question should be—the RTBH should be taken down as soon as the attack has stopped. The researchers found that RTBHs were most often used to mitigate the smallest of DDoS attacks, and almost never to mitigate larger ones. The authors also found that RTBHs were often left in place for hours after a DDoS attack had been mitigated. Both of these imply that current use of RTBH to mitigate DDoS attacks is counterproductive.
How many more design patterns do we follow that are simply counterproductive in the same way? This is not a matter of “following the data,” but rather one of really thinking through what it is you are trying to accomplish, and then how to accomplish that goal with the simplest set of tools available. Think through what it would mean to remove what you have put in, whether you really need to add another layer or protocol, how to minimize configuration, etc.
If you want your network to be less complex, examine the tradeoffs realistically.
The RPKI, for those who do not know, ties the origin AS to a prefix using a certificate (the Route Origin Authorization, or ROA) signed by a third party. The third party, in this case, is validating that the AS in the ROA is authorized to advertise the destination prefix in the ROA—if ROA’s were self-signed, the security would be no better than simply advertising the prefix in BGP. Who should be able to sign these ROAs? The assigning authority makes the most sense—the Regional Internet Registries (RIRs), since they (should) know which company owns which set of AS numbers and prefixes.
The general idea makes sense—you should not accept routes from “just anyone,” as they might be advertising the route for any number of reasons. An operator could advertise routes to source spam or phishing emails, or some government agency might advertise a route to redirect traffic, or block access to some web site. But … if you haven’t found the tradeoffs, you haven’t looked hard enough. Security, in particular, is replete with tradeoffs.
Every time you deploy some new security mechanism, you create some new attack surface—sometimes more than one. Deploy a stateful packet filter to protect a server, and the device itself becomes a target of attack, including buffer overflows, phishing attacks to gain access to the device as a launch-point into the private network, and the holes you have to punch in the filters to allow services to work. What about the RPKI?
When the RKI was first proposed, one of my various concerns was the creation of new attack services. One specific attack surface is the control a single organization—the issuing RIR—has over the very existence of the operator. Suppose you start a new content provider. To get the new service up and running, you sign a contract with an RIR for some address space, sign a contract with some upstream provider (or providers), set up your servers and service, and start advertising routes. For whatever reason, your service goes viral, netting millions of users in a short span of time.
Now assume the RIR receives a complaint against your service for whatever reason—the reason for the complaint is not important. This places the RIR in the position of a prosecutor, defense attorney, and judge—the RIR must somehow figure out whether or not the charges are true, figure out whether or not taking action on the charges is warranted, and then take the action they’ve settled on.
In the case of a government agency (or a large criminal organization) making the complaint, there is probably going to be little the RIR can do other than simply revoke your certificate, pulling your service off-line.
Overnight your business is gone. You can drag the case through the court system, of course, but this can take years. In the meantime, you are losing users, other services are imitating what you built, and you have no money to pay the legal fees.
A true story—without the names. I once knew a man who worked for a satellite provider, let’s call them SATA. Now, SATA’s leadership decided they had no expertise in accounts receivables, and they were spending too much time on trying to collect overdue bills, so they outsourced the process. SATB, a competing service, decided to buy the firm SATA outsourced their accounts receivables to. You can imagine what happens next… The accounting firm worked as hard as it could to reduce the revenue SATA was receiving.
Of course, SATA sued the accounting firm, but before the case could make it to court, SATA ran out of money, laid off all their people, and shut their service down. SATA essentially went out of business. They won some money later, in court, but … whatever money they won was just given to the investors of various kinds to make up for losses. The business itself was gone, permanently.
Herein lies the danger of giving a single entity like an RIR, even if they are friendly, honest, etc., control over a critical resource.
A recent paper presented at the ANRW at APNIC caught my attention as a potential way to solve this problem. The idea is simple—just allow (or even require) multiple signatures on a ROA. To be more accurate, each authorizing party issues a “partial certificate;” if “enough” pieces of the certificate are found and valid, the route will be validated.
The question is—how many signatures (or parts of the signature, or partial attestations) should be enough? The authors of the paper suggest there should be a “Threshold Signature Module” that makes this decision. The attestations of the various signers are combined in the threshold module to produce a single signature that is then used to validate the route. This way the validation process on the router remains the same, which means the only real change in the overall RPKI system is the addition of the threshold module.
If one RIR—even the one that allocated the addresses you are using—revokes their attestation on your ROA, the remaining attestations should be enough to convince anyone receiving your route that it is still valid. Since there are five regions, you have at least five different choices to countersign your ROA. Each RIR is under the control of a different national government; hence organizations like governments (or criminals!) would need to work across multiple RIRs and through other government organizations to have a ROA completely revoked.
An alternate solutions here, one that follows the PGP model, might be to simply have the threshold signature model consider the number and source of ROAs using the existing model. Local policy could determine how to weight attestations from different RIRs, etc.
This multiple or “shared” attestation (or signature) idea seems like a neat way to work around one of (possibly the major) attack surfaces introduced by the RPKI system. If you are interested in Internet core routing security, you should take a read through the post linked above, and then watch the video.
Should the network be dumb or smart? Network vendors have recently focused on making the network as smart as possible because there is a definite feeling that dumb networks are quickly becoming a commodity—and it’s hard to see where and how steep profit margins can be maintained in a commodifying market. Software vendors, on the other hand, have been encroaching on the network space by “building in” overlay network capabilities, especially in virtualization products. VMWare and Docker come immediately to mind; both are either able to, or working towards, running on a plain IP fabric, reducing the number of services provided by the network to a minimum level (of course, I’d have a lot more confidence in these overlay systems if they were a lot smarter about routing … but I’ll leave that alone for the moment).
How can this question be answered? One way is to think through what sorts of things need to be done in processing packets, and then think through where it makes most sense to do those things. Another way is to measure the accuracy or speed at which some of these “packet processing things” can be done so you can decide in a more empirical way. The paper I’m looking at today, by Anirudh et al., takes both of these paths in order to create a baseline “rule of thumb” about where to place packet processing functionality in a network.
Sivaraman, Anirudh, Thomas Mason, Aurojit Panda, Ravi Netravali, and Sai Anirudh Kondaveeti. “Network Architecture in the Age of Programmability.” ACM SIGCOMM Computer Communication Review 50, no. 1 (March 23, 2020): 38–44. https://doi.org/10.1145/3390251.3390257.
The authors consider six different “things” networks need to be able to do: measurement, resource management, deep packet inspection, network security, network virtualization, and application acceleration. The first of these they measure by setting introducing errors into a network and measuring the dropped packet rate using various edge and in-network measurement tools. What they found was in-network measurement has a lower error rate, particularly as time scales become shorter. For instance, Pingmesh, a packet loss measurement tool that runs on hosts, is useful for measuring packet loss in the minutes—but in-network telemetry can often measure packet loss in the seconds or milliseconds. They observe that in-network telemetry of all kinds (not just packet loss) appears to be more accurate when application performance is more important—so they argue telemetry largely belongs in the network.
Resource management, such as determining which path to take, or how quickly to transmit packets (setting the window size for TCP or QUIC, for instance), is traditionally performed entirely on hosts. The authors, however, note that effective resource management requires accurate telemetry information about flows, link utilization, etc.—and these things are best performed in-network rather than on hosts. For resource management, then, they prefer a hybrid edge/in-network approach.
The argue deep packet inspection and network virtualization are both best done at the edge, in hosts, because these are processor intensive tasks—often requiring more processing power and time than network devices have available. Finally, they argue network security should be located on the host, because the host has the fine-grained service information required to perform accurate filtering, etc.
Based on their arguments, the authors propose four rules of thumb. First, tasks that leverage data only available at the edge should run at the edge. Second, tasks that leverage data naturally found in the network should be run in the network. Third, tasks that require large amounts of processing power or memory should be run on the edge. Fourth, tasks that run at very short timescales should be run in the network.
I have, of course, some quibbles with their arguments … For instance, the argument that security should run on the edge, in hosts, assumes a somewhat binary view of security—all filters and security mechanisms should be “one place,” and nowhere else. A security posture that just moves “the firewall” from the edge of the network to the edge of the host, however, is going to (eventually) face the same vulnerabilities and issues, just spread out over a larger attack surface (every host instead of the entry to the network). Security shouldn’t work this way—the network and the host should work together to provide defense in depth.
The rules of thumb, however, seem to be pretty solid starting points for thinking about the problem. An alternate way of phrasing their result is through the principle of subsidiarity—decisions should be made as close as possible to the information required to make them. While this is really a concept that comes out of ethics and organizational management, it succinctly describes a good rule of thumb for network architecture.
Latency is a big deal for many modern applications, particularly in the realm of machine learning applied to problems like determining if someone standing at your door is a delivery person or a … robber out to grab all your smart toasters and big screen television. The problem is networks, particularly in the last mile don’t deal with latency very well. In fact, most of the network speeds and feeds available in anything outside urban areas kindof stinks. The example given by Bagchi et al. is this—
A fixed video sensor may generate 6Mbps of video 24/7, thus producing nearly 2TB of data per month—an amount unsustainable according to business practices for consumer connections, for example, Comcast’s data cap is at 1TB/month and Verizon Wireless throttles traffic over 26GB/month. For example, with DOCSIS 3.0, a widely deployed cable Internet technology, most U.S.-based cable systems deployed today support a maximum of 81Mbps aggregated over 500 home—just 0.16Mbps per home.
Bagchi, Saurabh, Muhammad-Bilal Siddiqui, Paul Wood, and Heng Zhang. “Dependability in Edge Computing.” Communications of the ACM 63, no. 1 (December 2019): 58–66. https://doi.org/10.1145/3362068.
The authors claim a lot of the problem here is just that edge networks have not been built out, but there is a reason these edge networks aren’t built out large enough to support pulling this kind of data load into a centrally located data center: the network isn’t free.
This is something so obvious to network engineers that it almost slips under our line of thinking unnoticed—except, of course, for the constant drive to make the network cost less money. For application developers, however, the network is just a virtual circuit data rides over… All the complexity of pulling fiber out to buildings or curbs, all the work of physically connecting things to the fiber, all the work of figuring out how to make routing scale, it’s all just abstracted away in a single QUIC or TCP session.
If you can’t bring the data to the compute, which is typically contained in some large-scale data center, then you have to bring the computing power to the data. The complexity of bringing the computing power to the data is applications, especially modern micro-services based applications optimized for large-scale, low latency data center fabrics, just aren’t written to be broken into components and spread all over the world.
Let’s consider the case of the smart toaster—the case used in the paper in hand. Imagine a toaster with little cameras to sense the toastiness of the bread, electronically controlled heating elements, an electronically controlled toast lifter, and some sort of really nice “bread storage and moving” system that can pull bread out of a reservoir, load them into the toaster, and make it all work. Imagine being able to get up in the morning to a fresh cup of coffee and a nice bagel fresh and hot just as you hit the kitchen…
But now let’s look at the complexity required to do such a thing. We must have local processing power and storage, along with some communication protocol that periodically uploads and downloads data to improve the toasting process. You have to have some sort of handling system that can learn about new kinds of bread and adapt to them automatically—this is going to require data, as well. You have to have a bread reservoir that will keep the bread fresh for a few days so you don’t have refill it constantly.
Will you save maybe five minutes every morning? Maybe.
Will you spend a lot of time getting this whole thing up and running? Definitely.
What will the MTBF be, precisely? What about the MTTR?
All to save five minutes in the morning? Of course the authors chose a trivial—perhaps even silly—example to use, just to illustrate the kinds of problems IoT devices combined with edge computing are going to encounter. But still … in five years you’re going to see advertisements for this smart toaster out there. There are toasters that already have a few of these features, and refrigerators that go far beyond this.
Sometimes we have to remember the cost of the network is telling us something—just because we can do a thing doesn’t mean we should. If the cost of the network forces us to consider the tradeoffs, that’s a good thing.
And remember that if your toaster makes your bread at the same time every morning, you have to adjust to the machine’s schedule, rather than the machine adjusting to yours…
I’s fnny, bt yu cn prbbly rd ths evn thgh evry wrd s mssng t lst ne lttr. This is because every effective language—or rather every communication system—carried enough information to reconstruct the original meaning even when bits are dropped. Over-the-wire protocols, like TCP, are no different—the protocol must carry enough information about the conversation (flow data) and the data being carried (metadata) to understand when something is wrong and error out or ask for a retransmission. These things, however, are a form of data exhaust; much like you can infer the tone, direction, and sometimes even the content of conversation just by watching the expressions, actions, and occasional word spoken by one of the participants, you can sometimes infer a lot about a conversation between two applications by looking at the amount and timing of data crossing the wire.
The paper under review today, Off-Path TCP Exploit, uses cleverly designed streams of packets and observations about the timing of packets in a TCP stream to construct an off-path TCP injection attack on wireless networks. Understanding the attack requires understanding the interaction between the collision avoidance used in wireless systems and TCP’s reaction to packets with a sequence number outside the current window.
Beginning with the TCP end of things—if a TCP packet is received with a window falling outside the current window, TCP implementations will send a duplicate of the last ACK it sent back to the transmitter. From the Wireless network side of things, only one talker can use the channel at a time. If a device begins transmitting a packet, and then hears another packet inbound, it should stop transmitting and wait some random amount of time before trying to transmit again. These two things can be combined to guess at the current window size.
Assume an attacker sends a packet to a victim which must be answered, such as a probe. Before the victim can answer, the attacker than sends a TCP segment which includes a sequence number the attacker thinks might be within the victim’s receive window, sourcing the packet from the IP address of some existing TCP session. Unless the IP address of some existing session is used in this step, the victim will not answer the TCP segment. Because the attacker is using a spoofed source address, it will not receive the ACK from this segment, so it must find some other way to infer if an ACK was sent by the victim.
How can the attacker infer this? After sending this TCP sequence, the attacker sends another probe of some kind to the victim which must be answered. If the TCP segment’s sequence number is outside the current window, the victim will attempt to send a copy of its previous ACK. If the attacker times things correctly, the victim will attempt to send this duplicate ACK while the attacker is transmitting the second probe packet; the two packets will collide, causing the victim to back off, slowing the receipt of the probe down a bit from the attacker’s perspective.
If the answer to the second probe is slower than the answer to the first probe, the attacker can infer the sequence number of the spoofed TCP segment is outside the current window. If the two probes are answered in close to the same time, the attacker can infer the sequence number of the spoofed TCP segment is within the current window.
Combining this information with several other well-known aspects of widely deployed TCP stacks, the researchers found they could reliably inject information into a TCP stream from an attacker. While these injections would still need to be shaped in some way to impact the operation of the application sending data over the TCP stream, the ability to inject TCP segments in this way is “halfway there” for the attacker.
There probably never will be a truly secure communication channel invented that does not involve encryption—the data required to support flow control and manage errors will always provide enough information to an attacker to find some clever way to break into the channel.
QUIC is a relatively new data transport protocol developed by Google, and currently in line to become the default transport for the upcoming HTTP standard. Because of this, it behooves every network engineer to understand a little about this protocol, how it operates, and what impact it will have on the network. We did record a History of Networking episode on QUIC, if you want some background.
In a recent Communications of the ACM article, a group of researchers (Kakhi et al.) used a modified implementation of QUIC to measure its performance under different network conditions, directly comparing it to TCPs performance under the same conditions. Since the current implementations of QUIC use the same congestion control as TCP—Cubic—the only differences in performance should be code tuning in estimating the round-trip timer (RTT) for congestion control, QUIC’s ability to form a session in a single RTT, and QUIC’s ability to carry multiple streams in a single connection. The researchers asked two questions in this paper: how does QUIC interact with TCP flows on the same network, and does UIC perform better than TCP in all situations, or only some?
To answer the first question, the authors tried running QUIC and TCP over the same network in different configurations, including single QUIC and TCP sessions, a single QUIC session with multiple TCP sessions, etc. In each case, they discovered that QUIC consumed about 50% of the bandwidth; if there were multiple TCP sessions, they would be starved for bandwidth when running in parallel with the QUIC session. For network folk, this means an application implemented using QUIC could well cause performance issues for other applications on the network—something to be aware of. This might mean it is best, if possible, to push QUIC-based applications into a separate virtual or physical topology with strict bandwidth controls if it causes other applications to perform poorly.
Does QUIC’s ability to consume more bandwidth mean applications developed on top of it will perform better? According to the research in this paper, the answer is how many balloons fit in a bag? In other words, it all depends. QUIC does perform better when its multi-stream capability comes into play and the network is stable—for instance, when transferring variably sized objects (files) across a network with stable jitter and delay. In situations with high jitter or delay, however, TCP consistently outperforms QUIC.
TCP outperforming QUIC is a bit of a surprise in any situation; how is this possible? The researchers used information from their additional instrumentation to discover QUIC does not tolerate out-of-order packet delivery very well because of its fast packet retransmission implementation. Presumably, it should be possible to modify these parameters somewhat to make QUIC perform better.
This would still leave the second problem the researchers found with QUIC’s performance—a large difference between its performance on desktop and mobile platforms. The difference between these two comes down to where QUIC is implemented. Desktop devices (and/or servers) often have smart NICs which implement TCP in the ASIC to speed packet processing up. QUIC, because it runs in user space, only runs on the main processor (it seems hard to see how a user space application could run on a NIC—it would probably require a specialized card of some type, but I’ll have to think about this more). The result is that QUIC’s performance depends heavily on the speed of the processor. Since mobile devices have much slower processors, QUIC performs much more slowly on mobile devices.
QUIC is an interesting new transport protocol—one everyone involved in designing or operating networks is eventually going to encounter. This paper gives good insight into the “soul” of this new protocol.
When I think of complexity, I mostly consider transport protocols and control planes—probably because I have largely worked in these areas from the very beginning of my career in network engineering. Complexity, however, is present in every layer of the networking stack, all the way down to the physical. I recently ran across an interesting paper on complexity in another part of the network I had not really thought about before: the physical plant of a data center fabric.
The paper begins by defining what complexity in the physical infrastructure of a DC fabric looks like. They focus on packaging, or the layout of the switches in the fabric, the bundles of cabling required to wire the topology, and the number and locations of patch panels required. The packaging and patch panels impact the length and complexity of the cable runs (whether optical or copper), which represents a base complexity for the entire topology.
The second thing they consider is the lifecycle of the physical fabric infrastructure. What steps are required to upgrade the fabric from a smaller configuration to a larger one? Or from a lower speed (higher oversubscription) to a higher speed (lower oversubscription)? The result is the ability to put a number on the overall complexity of each topology.
The first class of topologies they consider are spine-and-leaf, such as the Clos, Benes, and butterfly fabrics. They call all kinds of spine-and-leaf fabrics Clos fabrics. Spine-and-leaf fabrics, they note generally have very low cabling complexity because their symmetry encourages consistent bundling and hardware placement. They call the second kind of topology expander fabrics; the most common fabric in this class is the dragonfly. These topologies are more difficult to wire but simpler to scale out because they can be expanded largely by modifying just the edge of the fabric. Their analysis shows these classes of fabric rate equally on their complexity scale.
A side note they don’t consider in the paper—their complexity computation implies that if you are building a fabric with a somewhat fixed range of sizes, and you can preplan the location of spines leaving enough room for the maximum sized fabric on the first day, spine-and-leaf fabrics are less complex than the fancier topologies you might hear about from time to time. Since most data center fabrics do, in fact, fall into these kinds of constraints (given a good day one designer!), this seems to validate the widespread use of butterfly and Clos fabrics for most applications. This feels like a significant result for most common data center fabric designs.
Finally, they describe an interesting topology they call FatClique, which is an interesting blend of spine-and-lead and edge expander topologies; I’ve screen grabbed the image from the paper below.
Overall, it’s well worth spending the time to read the entire paper if you have an in-depth interest in fabric design.The way this topology is described feels very much like a Benes to me, or a butterfly where the fabric routers are replaced by fabrics (making a seven-stage fabric). It’s hard to tell how useful this topology would be in real deployments—but that researchers are looking into alternatives other than the venerable spine-and-leaf is interesting in its own right.