What, really, is “technical debt?” It’s tempting to say “anything legacy,” but then why do we need a new phrase to describe “legacy stuff?” Even the prejudice against legacy stuff isn’t all that rational when you think about it. Something that’s old might also just be well-tested, or well-worn but still serviceable. Let’s try another tack.
Technical debt, in the software world, can be defined as working on a piece of software for long periods of time by only adding features, and never refactoring or reorganizing the code to meet current conditions. The general idea is that as new features are added on top of the old, two things happen. First, the old stuff becomes a sort of opaque box that no-one understands. Second, the stuff being added to the old increasingly relies on public behavior that might be subject to unintended consequences or leaky abstractions.
To resolve this problem in the software world, software is “refactored.” In refactoring, every use of a public API is examined, including what information is being drawn out, or what the expected inputs and outputs are. The old code is then “discarded,” in a sense, and a new underlying function written that meets the requirements discovered in the existing codebase. The refactoring process allows the calling functions and called functions, the clients and the servers, to “move together.” As the overall understanding of the system changes, the system itself can change with that understanding. This brings the implementation closer to the understanding of the current engineering team.
So technical debt is really the mismatch between the understanding of the current engineering team imposed onto the implementation of a prior engineering team with a different understanding of the requirements (even if the older engineering team is the same people at a different time).
How can we apply this to networking? In the networking world, we actively work against refactoring by future proofing.
Future proofing: building a network that will never need to be replaced. Or, perhaps, never needing to go to your manager and say “I need to buy new hardware,” because the hardware you bought last week does not have the functionality you need any longer. This sounds great in theory, but the problem with theory is it often does not hold up when the glove hits the nose (every has a plan until they are punched in the face). The problem with future proofing is you can’t see the punch that’s coming until the future actually throws it.
If something is future proofed, it either cannot be refactored, or there will be massive resistence to the refactoring process.
Maybe its time we tried to change our view of things so we can refactor networks. What would refactoring a network look like? Maybe examining all the configurations within a particular module, figuring what they do, and then trying to figure out what application or requirement led to that particular bit of configuration. Or, one step higher, looking at every protocol or “feature” (whatever a feature might be), figuring out what purpose it might serve, and then intentionally setting about finding some other way, perhaps a simpler way, to provide that same service.
One key point in this process is to begin by refusing to look at the network as a set of appliances, and starting to see it as a system made up of hardware and software. Mentally disaggregating the software and hardware can allow you to see what can be changed and what cannot, and inject some flexibility into the network refactoring process.
When you refactor a bit of code, what you typically end up with a simpler piece of software that more closely matches the engineering team’s understanding of current requirements and conditions. Aren’t simplicity and coherence goals for operational networks, too?
If so, ehen was the last time you refactored your network?
In the realm of network design—especially in the realm of security—we often react so strongly against a perceived threat, or so quickly to solve a perceived problem, that we fail to look for the tradeoffs. If you haven’t found the tradeoffs, you haven’t looked hard enough—or, as Dr. Little says, you have to ask what is gained and what is lost, rather than just what is gained. This failure to look at both sides often results in untold amounts of technical debt and complexity being dumped into network designs (and application implementations), causing outages and failures long after these decisions are made.
A 2018 paper on DDoS attacks, A First Joint Look at DoS Attacks and BGP Blackholing in the Wild provides a good example of causing more damage to an attack than the attack itself. Most networks are configured to allow the operator to quickly configure a remote triggered black hole (RTBH) using BGP. Most often, a community is attached to a BGP route that points the next-hop to a local discard route on each eBGP speaker. If used on the route advertising the destination of the attack—the service under attack—the result is the DDoS attack traffic no longer has a destination to flow to. If used on the route advertising the source of the DDoS attack traffic, the result is the DDoS traffic will no pass any reverse-path forwarding policies at the edge of the AS, and hence be dropped. Since most DDoS attacks are reflected, blocking the source traffic still prevents access to some service, generally DNS or something similar.
In either case, then, stopping the DDoS through an RTBH causes damage to services rather than just the attacker. Because of this, remote triggered black holes should really only be used in the most extreme cases, where no other DDoS mitigation strategy will work.
The authors of the Joint Look use publicly avaiable information to determine the answers to several questions. First, what scale of DDoS attacks are RTBHs used against? Second, how long after an attack begins is the RTBH triggered? Third, for how long is the RTBH left in place after the attack has been mitigated?
The answer to the first question should be—the RTBH is only used against the largest-scale attacks. The answer to the second question should be—the RTBH should be put in place very quickly after the attack is detected. The answer to the third question should be—the RTBH should be taken down as soon as the attack has stopped. The researchers found that RTBHs were most often used to mitigate the smallest of DDoS attacks, and almost never to mitigate larger ones. The authors also found that RTBHs were often left in place for hours after a DDoS attack had been mitigated. Both of these imply that current use of RTBH to mitigate DDoS attacks is counterproductive.
How many more design patterns do we follow that are simply counterproductive in the same way? This is not a matter of “following the data,” but rather one of really thinking through what it is you are trying to accomplish, and then how to accomplish that goal with the simplest set of tools available. Think through what it would mean to remove what you have put in, whether you really need to add another layer or protocol, how to minimize configuration, etc.
If you want your network to be less complex, examine the tradeoffs realistically.
While software design is not the same as network design, there is enough overlap for network designers to learn from software designers. A recent paper published by Butler Lampson, updating a paper he wrote in 1983, is a perfect illustration of this principle. The paper is caleld Hints and Principles for Computer System Design. I’m not going to write a full review here–you should really go read the paper for yourself–but rather just point out some useful bits of the paper.
The first really useful point of this paper is Lampson breaks down the entire field of software design into three basic questions: What, How, and When (or who)? Each of these corresponds to the goals, techniques, and processes used to design and develop software. These same questions and answers apply to network design–if you are missing one of these three areas, then you are probably missing some important set of questions you have not answered yet. Each of these is also represented by an acronym: what? is STEADY, how? is AID, and when? is ART. Let’s look at a couple of these in a little more detail to see how Lampson’s system works.
STEADY stands for simple, timely, efficient, adaptable, dependable, and yummy. Simple is just what it sounds like –reduce complexity. I’m not entirely on board with Lampson’s description of simplicity, which seems to focus on abstraction–abstraction is one useful tool, but anyone who reads my work regularly knows I’m rather more careful about abstraction than most because it involves often-unexamined tradeoffs. Timely primarily relates to “is there a market for this,” in software design; for networks it might be better put as “does the business need this now or later?” Efficient is one of those tradeoffs involved in abstraction–what I might call one of the various ways of optimizing a system. Adaptable means just what it sounds like–are you creating technial debt that must be resolved later? Dependable could be translated to resilience in network design, but it would also relate to many aspects of security, and even the jitter and delay elements in application support.
Yummy is one many network engineers will not be familiar with, but is worth considering. If I’m reading Lampson right here, another way to say this might be “easy to consume.” Why do you want your customers to be able to consume the network easily? Because you do not want them running off and using the cloud (for instance) because they find committing and understanding resources in your network so difficult. We have, for far to long, assumed that “easy to consume” in the network design world means “just plug it into the wall.” It’s not that simple.
The second one, AID, stands for approximate, incremental, and divide & conquer. These are, again, easily adaptable to network design. You don’t need to make the design perfect the first time. In fact, as a young artist one thing that was drilled into my head was that the perfect was the enemy of the good–it’s better to get it approximately right, right now, than perfectly right ten years down the road (when no-one cares any longer). Incremental speaks to modularization, scale-out, and lifecycle management, for instance.
While not every principle here can be applied, a lot of them can. Having them listed out in an easy-to-remember format like this is a great design aid–learn these, and use them.
A recent paper on network control and management (which includes Jennifer Rexford on the author list—anything with Jennifer on the author list is worth reading) proposes a clean slate 4d approach to solving much of the complexity we encounter in modern networks. While the paper is interesting, it’s very unlikely we will ever see a clean slate design like the one described, not least because there will always be differences between what the proper splits are—what should go where.
There is one section of the paper that eloquently speaks to current architecture, however. The authors describe a situation where routing and packet filters are used together to prevent one set of hosts from reaching another set of hosts. Changes in the network, however, cause the packet filters to be bypassed, opening up communications between these two sets of hosts.
This is exactly the problem we so often face in network engineering today—overlapping systems used to solve a single problem do not pay attention to the same signals or information to do their jobs. So here’s a thought about an obvious way to reduce the complexity of your network—try to use one tool to do one job. Before the days of automation, this was much harder to do. There was no way to distribute QoS configurations, for instance, or access lists, much less what might be considered an “easy way.” Because of this, it made some kind of sense to use routing protocols as a sort of distributed database and policy engine to move filters and the like around.
Today, however, we have automation. Because of this, it makes more sense to use automation to manage as much data plane policy as you can, leaving the routing protocol to do its job—provide reachability across an ever-changing network. There are still things, like traffic steering and prefix distribution rules, which should stay inside routing. But when you put routing filters in place to solve a data plane problem, it might be worth thinking about whether that is the right thing to do any longer.
Automation, in this case, can change everything.
This week is very busy for me, so rather than writing a single long, post, I’m throwing together some things that have been sitting in my pile to write about for a long while.
From Dalton Sweeny:
A physicist loses half the value of their physics knowledge in just four years whereas an English professor would take over 25 years to lose half the value of the knowledge they had at the beginning of their career. . . Software engineers with a traditional computer science background learn things that never expire with age: data structures, algorithms, compilers, distributed systems, etc. But most of us don’t work with these concepts directly. Abstractions and frameworks are built on top of these well studied ideas so we don’t have to get into the nitty-gritty details on the job (at least most of the time).
This is precisely the way network engineering is. There is value in the kinds of knowledge that expire, such as individual product lines, etc.—but the closer you are to the configuration, the more ephemeral the knowledge is. This is one of the entire points of rule 11 is your friend. Learn the foundational things that make learning the ephemeral things easier. There are only four problems (really) in moving data from one place to another. There are only around four solutions for each of those problems. Each of those solutions is bounded into a small set (again, about four for each) sub-solutions, or ways of implementing the solution, etc.
I’m going to spend some time talking about this in the Livelesson I’m currently recording, so watch this space for an announcement sometime early next year about publication.
From Ivan P:
What I’m pointing out in this rant is the reverse reasoning along the lines “vendor X is doing something, which confirms there’s a real market need for it”. I’ve been in IT too long, and seen how the startup/VC sausage is made, to believe that fairy tale… and even when it’s true, it doesn’t necessarily imply that you need whatever vendor X is selling.
There are two ways to look at this. Either vendors should lead the market in building solutions, or they should follow whatever the customer wants. From my perspective, one of the problems we have right now is everything is a massive mish-mash of these two things. The operator’s design team thinks of a neat way to do X, and then promises the account team a big check if its implemented. It doesn’t matter that X could be solved some other way that might be simpler, etc.—all that matters is the check. In this case, the vendor stops challenging the customer to build things better, and starts just acting like a commodity provider, rather than an innovative partner.
The interaction between the customer and the vendor needs to be more push-pull than it currently is—right now, it seems like either the operator simply dictates terms to the vendor, or the vendor pretty much dictates architecture to the operator. We need to find a middle ground. The vendor does need to have a solid solution architecture, but the architecture needs to be flexible, as well, and the blocks used to build that architecture need to be usable in ways not anticipated by the vendor’s design folks.
On the other hand, we need to stop chasing features. This isn’t just true of vendors, this is true of operators as well. You get feature lists because that’s what you ask for. Often, operators ask for feature lists because that’s the easiest thing to measure, or because they already have a completely screwed up design they are trying to brownfield around. The worst is—“we have this brownfield that we just can’t get rid of, so we want to build yet another overlay on top, which will make everything simpler.” After about the twentieth overlay a system crash becomes a matter of when rather than if.
The word on the street is that everyone—especially network engineers—must learn to code. A conversation with a friend and an article passing through my RSS reader brought this to mind once again—so once more into the breach. Part of the problem here is that we seem to have a knack for asking the wrong question. When we look at network engineer skill sets, we often think about the ability to configure a protocol or set of features, and then the ability to quickly troubleshoot those protocols or features using a set of commands or techniques.
This is, in some sense, what various certifications have taught us—we have reached the expert level when we can configure a network quickly, or when we can prove we understand a product line. There is, by the way, a point of truth in this. If you claim your expertise is with a particular vendor’s gear, then it is true that you must be able to configure and troubleshoot on that vendor’s gear to be an expert. There is also a problem of how to test for networking skills without actually implementing something, and how to implement things without actually configuring them. This is a problem we are discussing in the new “certification” I’ve been working on, as well.
This is also, in some sense, what the hiring processes we use have taught us. Computers like to classify things in clear and definite ways. The only clear and definite way to classify networking skills is by asking questions like “what protocols do you understand how to configure and troubleshoot?” It is, it seems, nearly impossible to test design or communication skills in a way that can be easily placed on a resume.
Coding, I think, is one of those skills that is easy to appear to measure accurately, and it’s also something the entire world insists is the “coming thing.” No coding skills, no job. So it’s easy to ask the easy question—what languages do you know, how many lines of code have you written, etc. But again, this is the wrong question (or these are the wrong questions).
What is the right question? In terms of coding skills, more along the lines of something like, “do you know how to build and use tools to solve business problems?” I phrase it this way because the one thing I have noticed about every really good coder I have known is they all spend as much time building tools as they do building shipping products. They build tools to test their code, or to modify the code they’ve already written en masse, etc. In fact, the excellent coders I know treat functions like tools—if they have to drive a nail twice, they stop and create a hammer rather than repeating the exercise with some other tool.
So why is coding such an important skill to gain and maintain for the network engineer? This paragraph seems to sum it up nicely for me—
“Coding is not the fundamental skill,” writes startup founder and ex-Microsoft program manager Chris Granger. What matters, he argues, is being able to model problems and use computers to solve them. ”We don’t want a generation of people forced to care about Unicode and UI toolkits. We want a generation of writers, biologists, and accountants that can leverage computers.”
It’s not the coding that matters, it’s “being able to model problems and use computers to solve them.” This is the essence of tool building or engineering—seeing the problem, understanding the problem, and then thinking through (sometimes by trial and error) how to build a tool that will solve the problem in a consistent, easy to manage way. I fear that network engineers are taking their attitude of configuring things and automating it to make the configuration and troubleshooting faster. We seem to end up asking “how do I solve the problem of making the configuration of this network faster,” rather than asking “what business problem am I trying to solve?”
To make effective use of the coding skills we’re telling everyone to learn, we need to go back to basics and understand the problems we’re trying to solve—and the set of possible solutions we can use to solve those problems. Seen this way, the routing protocol becomes “just another tool,” just like a function call, that can be used to solve a specific set of problems—instead of a set of configuration lines that we invoke like a magic incantation to make things happen.
Coding skills are important—but they require the right mindset if we’re going to really gain the sorts of efficiencies we think are possible.
Have you ever looked at your wide area network and wondered … what would the traffic flows look like if this link or that router failed? Traffic modeling of this kind is widely available in commercial tools, which means it’s been hard to play with these kinds of tools, learn how they work, and understand how they can be effective. There is, however, an open source alternative—pyNTM.