Technical Debt (or Is Future Proofing Even a Good Idea?)

What, really, is “technical debt?” It’s tempting to say “anything legacy,” but then why do we need a new phrase to describe “legacy stuff?” Even the prejudice against legacy stuff isn’t all that rational when you think about it. Something that’s old might also just be well-tested, or well-worn but still serviceable. Let’s try another tack.

Technical debt, in the software world, can be defined as working on a piece of software for long periods of time by only adding features, and never refactoring or reorganizing the code to meet current conditions. The general idea is that as new features are added on top of the old, two things happen. First, the old stuff becomes a sort of opaque box that no-one understands. Second, the stuff being added to the old increasingly relies on public behavior that might be subject to unintended consequences or leaky abstractions.

To resolve this problem in the software world, software is “refactored.” In refactoring, every use of a public API is examined, including what information is being drawn out, or what the expected inputs and outputs are. The old code is then “discarded,” in a sense, and a new underlying function written that meets the requirements discovered in the existing codebase. The refactoring process allows the calling functions and called functions, the clients and the servers, to “move together.” As the overall understanding of the system changes, the system itself can change with that understanding. This brings the implementation closer to the understanding of the current engineering team.

So technical debt is really the mismatch between the understanding of the current engineering team imposed onto the implementation of a prior engineering team with a different understanding of the requirements (even if the older engineering team is the same people at a different time).

How can we apply this to networking? In the networking world, we actively work against refactoring by future proofing.

Future proofing: building a network that will never need to be replaced. Or, perhaps, never needing to go to your manager and say “I need to buy new hardware,” because the hardware you bought last week does not have the functionality you need any longer. This sounds great in theory, but the problem with theory is it often does not hold up when the glove hits the nose (every has a plan until they are punched in the face). The problem with future proofing is you can’t see the punch that’s coming until the future actually throws it.

If something is future proofed, it either cannot be refactored, or there will be massive resistence to the refactoring process.

Maybe its time we tried to change our view of things so we can refactor networks. What would refactoring a network look like? Maybe examining all the configurations within a particular module, figuring what they do, and then trying to figure out what application or requirement led to that particular bit of configuration. Or, one step higher, looking at every protocol or “feature” (whatever a feature might be), figuring out what purpose it might serve, and then intentionally setting about finding some other way, perhaps a simpler way, to provide that same service.

One key point in this process is to begin by refusing to look at the network as a set of appliances, and starting to see it as a system made up of hardware and software. Mentally disaggregating the software and hardware can allow you to see what can be changed and what cannot, and inject some flexibility into the network refactoring process.

When you refactor a bit of code, what you typically end up with a simpler piece of software that more closely matches the engineering team’s understanding of current requirements and conditions. Aren’t simplicity and coherence goals for operational networks, too?

If so, ehen was the last time you refactored your network?

The Dangers of Flying Pigs (RFC1925, rule 3)

There are many times in networking history, and in the day-to-day operation of a network, when an engineer has been asked to do what seems to be impossible. Maybe installing a circuit faster than a speeding bullet or flying over tall buildings to make it to a remote site faster than any known form of conveyance short of a transporter beam (which, contrary to what you might see in the movies, has not yet been invented).

One particular impossible assignment in the early days of network engineering was the common request to replicate the creation of the works of Shakespeare making use of the infinite number of monkeys (obviously) connected to the Internet. The creation of appropriate groups of monkeys, the herding of these groups, and the management of their output were once considered a nearly impossible task, similar to finding a token dropped on the floor or lost in the ether.

This problem proved so intractable that the IETF finally created an entire suite of management tools for managing the infinite monkeys used for these experiments, which is described in RFC2795. This RFC describes the Infinite Monkey Protocol Suite (IMPS), which runs on top of the Internet Protocol, the Infinite Threshold Accounting Gadget (I-TAG), and the KEEPER specification, which provides a series of messages to manage the infinite monkeys. The problem raised a number of problems about the construction of the experiment, such as whether the compilation of works should take place on a letter-by-letter or word-by-word basis. Ultimately, the problem was apparently solved through the creation of infinite monkey simulators, such as this one.

For those situations, such as assembling and managing an infinite suite of monkeys gathered for test, when a network engineer is asked to perform something which is apparently impossible, the first thing that is required is a lot of hot, caffeinated beverage. And there is no better way to make such beverages than through a hypertext-controlled hot beverage device. This device is so important, in fact, that the IETF described the interface and protocols for it fairly early, in RFC2324. While having a hypertext control interface to such devices is important, sometimes the making of caffeinated beverages should be automated; an interface which can be used for automation is described in RFC2325. If the engineer prefers some form of caffeine other than coffee, the procedures in RFC7168 should be followed.

Another common problem posed to network engineers is to make pigs fly. While it has often been reported that pigs cannot, in fact, fly, those who report this are apparently not well acquainted with engineers who have been given large amounts of a hot, caffeinated beverage. In fact, that which is probable, and yet impossible, is often more likely to occur than that which is possible, and yet improbable, once a network engineer has been given enough of this kind of beverage.

There is a danger, however, with attempting to perform the possible, no matter how good the intentions or plan. As RFC1925 states in rule 3: “With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea. It is hard to be sure where they are going to land, and it could be dangerous sitting under them as they fly overhead.” Network engineers plus hot caffeinated beverages may just achieve the impossible.

Or you might end up with a pig on your head. It’s hard to tell, so be careful what you ask for.

Strong Reactions and Complexity

In the realm of network design—especially in the realm of security—we often react so strongly against a perceived threat, or so quickly to solve a perceived problem, that we fail to look for the tradeoffs. If you haven’t found the tradeoffs, you haven’t looked hard enough—or, as Dr. Little says, you have to ask what is gained and what is lost, rather than just what is gained. This failure to look at both sides often results in untold amounts of technical debt and complexity being dumped into network designs (and application implementations), causing outages and failures long after these decisions are made.

A 2018 paper on DDoS attacks, A First Joint Look at DoS Attacks and BGP Blackholing in the Wild provides a good example of causing more damage to an attack than the attack itself. Most networks are configured to allow the operator to quickly configure a remote triggered black hole (RTBH) using BGP. Most often, a community is attached to a BGP route that points the next-hop to a local discard route on each eBGP speaker. If used on the route advertising the destination of the attack—the service under attack—the result is the DDoS attack traffic no longer has a destination to flow to. If used on the route advertising the source of the DDoS attack traffic, the result is the DDoS traffic will no pass any reverse-path forwarding policies at the edge of the AS, and hence be dropped. Since most DDoS attacks are reflected, blocking the source traffic still prevents access to some service, generally DNS or something similar.

In either case, then, stopping the DDoS through an RTBH causes damage to services rather than just the attacker. Because of this, remote triggered black holes should really only be used in the most extreme cases, where no other DDoS mitigation strategy will work.

The authors of the Joint Look use publicly avaiable information to determine the answers to several questions. First, what scale of DDoS attacks are RTBHs used against? Second, how long after an attack begins is the RTBH triggered? Third, for how long is the RTBH left in place after the attack has been mitigated?

The answer to the first question should be—the RTBH is only used against the largest-scale attacks. The answer to the second question should be—the RTBH should be put in place very quickly after the attack is detected. The answer to the third question should be—the RTBH should be taken down as soon as the attack has stopped. The researchers found that RTBHs were most often used to mitigate the smallest of DDoS attacks, and almost never to mitigate larger ones. The authors also found that RTBHs were often left in place for hours after a DDoS attack had been mitigated. Both of these imply that current use of RTBH to mitigate DDoS attacks is counterproductive.

How many more design patterns do we follow that are simply counterproductive in the same way? This is not a matter of “following the data,” but rather one of really thinking through what it is you are trying to accomplish, and then how to accomplish that goal with the simplest set of tools available. Think through what it would mean to remove what you have put in, whether you really need to add another layer or protocol, how to minimize configuration, etc.

If you want your network to be less complex, examine the tradeoffs realistically.

The Hedge 58: Michael Kehoe and eBPF

Most packet processing in Linux “wants” to be in the kernel. The problem is that adding code to the kernel is a painstaking process because a single line of bad code can cause havoc for millions of Linux hosts. How, then, can new functionality be pushed into the kernel, particularly for packet processing, with reduced risk? Enter eBPF, which allows functions to be inserted into the kernel through a sort of “lightweight container.”

Michael Kehoe joins Tom Ammon and Russ White to discuss eBPF technology and its importance.

download

Random Thoughts on IoT

Let’s play the analogy game. The Internet of Things (IoT) is probably going end up being like … a box of chocolates, because you never do know what you are going to get? a big bowl of spaghetti with a serious lack of meatballs? Whatever it is, the IoT should have network folks worried about security. There is, of course, the problem of IoT devices being attached to random places on the network, exfiltrating personal data back to a cloud server you don’t know anything about. Some of these devices might be rogue, of course, such as Raspberry Pi attached to some random place in the network. Others might be more conventional, such as those new exercise machines the company just brought into the gym that’s sending personal information in the clear to an outside service.

While there is research into how to tell the difference between IoT and “larger” devices, the reality is spoofing and blurred lines will likely make such classification difficult. What do you do with a virtual machine that looks like a Raspberry Pi running on a corporate laptop for completely legitimate reasons? Or what about the Raspberry Pi-like device that can run a fully operational Windows stack, including “background noise” applications that make it look like a normal compute platform? These problems are, unfortunately, not easy to solve.

To make matters worse, there are no standards by which to judge the security of an IoT device. Even if the device manufacturer–think about the new gym equipment here–has the best intentions towards security, there is almost no way to determine if a particular device is designed and built with good security. The result is that IoT devices are often infected and used as part of a botnet for DDoS, or other, attacks.

What are our options here from a network perspective? The most common answer to this is segmentation–and segmentation is, in fact, a good start on solving the problem of IoT. But we are going to need a lot more than segmentation to avert certain disaster in our networks. Once these devices are segmented off, what do we do with the traffic? Do we just allow it all (“hey, that’s an IoT device, so let it send whatever it wants to… after all, it’s been segmented off the main network anyway”)? Do we try to manage and control what information is being exfiltrated from our networks? Is machine learning going to step in to solve these problems? Can it, really?

To put it another way–the attack surface we’re facing here is huge, and the smallest mistake can have very bad ramifications in individual lives. Take, for instance, the problem of data and IoT devices in abusive relationships. Relationships are dynamic; how is your company going to know when an employee is in an abusive relationship, and thus when certain kinds of access should be shut off? There is so much information here it seems almost impossible to manage it all.

It looks, to me, like the future is going to be a bit rough and tumble as we learn to navigate this new realm. Vendors will have lots of good ideas (look at Mists’ capabilities in tracking down the location of rogue devices, for instance), but in the end it’s going to be the operational front line that is going to have to figure out how to manage and deploy networks where there is a broad blend of ultimately untrustable IoT devices and more traditional devices.

Now would be the time to start learning about security, privacy, and IoT if you haven’t started already.

The Hedge 57: Brian Trammell and PANRG

Brian Trammell joins Alvaro Retana and Russ White to discuss the Path Aware Research Group in the IRTF. According to the charter page, PANRG “aims to support research in bringing path awareness to transport and application layer protocols, and to bring research in this space to the attention of the Internet engineering and protocol design community.”

download

Technologies that Didn’t: Network Operating Systems

For those with a long memory—no, even longer than that—there were once things called Network Operating Systems (NOS’s). These were not the kinds of NOS’s we have today, like Cisco IOS Software, or Arista EOS, or even SONiC. Rather, these were designed for servers. The most common example was Novell’s Netware. These operating systems were the “bread and butter” of the networking world for many years. I was a Certified Netware Expert (CNE) version 4.0, and then 4.11, before I moved into the routing and switching world. I also deployed Banyan’s Vines, IBM’s OS/2, and a much simpler system called LANtastic, among others.

What were these pieces of software? They were largely built around providing a complete environment for the network user. These systems began with file sharing and directory services and included a small driver that would need to be installed on each host accessing the file share. This small driver was actually a network stack for a proprietary set of protocols. For Vines, this was VIP; for Netware, it was IPX. Over time, these systems began to include email, and then, as a natural outgrowth of file sharing and email, directory services. For some time, there was a serious race on to push ever more features into these network operating systems. For instance, a Vines server could not only act as an email server, a file server, and a directory server, it could also act as a router, connecting two Ethernet segments and pushing traffic between them.

What happened? Why and how did these kinds of systems disappear—almost overnight it seems? After all, they provided a lot of very interesting services. You could use one of these systems as a corporate directory, adding each person’s contact information directly into the system itself. Once the person was there, you could assign them rights to file shares, individual files, and even services running on one of the servers. For instance, you could build an application on a framework within Vines that would run across multiple Vines server—the distribution of the data and the application were all handled in the Vines operating system itself, so long as you built it to their framework—and then simply give people access to it. Lotus Notes, which is still in use today from what I understand was an overlay service of the same style. You didn’t need to worry about access control, the difference between authentication and authorization, etc.; these were all built into the system

Why don’t we see these in widespread use today? The “official” reason, if there is such a thing, is the standardization of the IP protocol stack, and its widespread deployment, caused all of these operating systems to be replaced by a federation of other protocols and applications. For instance, FTP opened up the ability to upload and download files across an IP network, and SMTP standardized the various email clients so email gateways were no longer needed.

A more unofficial answer might be this: these systems tried to do too much. Rather than being a series of smaller systems, each of which solved a particular problem, each of these systems tried to solve every problem, from access control to routing. These systems became bloated and difficult to operate over time. The resource tree, which was grounded in X.500, in Netware 4.11 was a thing of beauty, if you like staring at Mandelbrot patterns. If there were some access control problem, it could take hours to work through the various layers of permissions, and how each was being inherited from the level above.

Further, these large scale, monolithic servers eventually could not keep up with the smaller tools that were being iteratively improved in the IP and OSI protocol suites. As networks grew, and routing became more important, these operating systems struggled to keep up with complex wide area networks.

Network operating systems are another story of complex, multifaceted, monolithic solutions to a lot of different problems. As with all such solutions, smaller, simpler systems simply overrun the capabilities of the monolithic systems through quick iteration of a smaller problem space. While these systems started out simple, they quickly took on too much, and ended up being difficult to deploy and maintain.