It’s not unusual in the life of a network engineer to go entire weeks, perhaps even months, without “getting anything done.” This might seem odd for those who do not work in and around the odd combination of layer 1, layer 3, layer 7, and layer 9 problems network engineers must span and understand, but it’s normal for those in the field. For instance, a simple request to support a new application might require the implementation of some feature, which in turn requires upgrading several thousand devices, leading to the discovery that some number of these devices simply do not support the new software version, requiring a purchase order and change management plan to be put in place to replace those devices, which results in … The chain of dominoes, once it begins, never seems to end.
Or, as those who have dealt with these problems many times might say, it is more complicated than you think. This is such a useful phrase, in fact, it has been codified as a standard rule of networking in RFC1925 (rule 8, to be precise).
Take, for instance, the problem of sending documents through electronic mail—in the real world, there are various mechanisms available to group documents, so the recipient understands what documents go together as a set, which ones are separate—staples, paperclips, binders, folders, etc. In the virtual world, however, documents are just a big blob of bits. How does anyone know which documents go with which in this situation? The obvious solution is to create electronic versions of staples and paperclips, as described in RFC1927. This only seems simple, however; it is more complicated than you think.
For instance, how do you know someone along the document transmission path has not altered the staples and/or paper clips? To prevent staple tampering, electronic staples must be cryptographically signed in some way. In the real world, paper clips (in particular) are removed from documents and re-used to save money and resources. Likewise, there must be some process to discover unused digital document sets so the paper clips may be removed and placed in some form of storage for reuse. Some people like to use differently colored staples or paperclips; how should these be represented in the digital world? RFC1927 describes MIME labels to resolve most of these problems, but there is one final problem that brings the complexity of grouping electronic documents to an entirely new level: metadata creep. What happens when the amount of data required to describe the staple or paperclip becomes larger than the documents being grouped?
Something as simple as representing characters in a language can often be more complex than it might initially seem. RFC5242 attempts to resolve the complexity of the many available encoding schemes with a single coding scheme. Rather than assigning each symbol within a language to a single number within a number space, like ASCII and UNICODE do, however, RFC5242 suggests creating a set of codes which describe how a character looks, rather than what it stands for. This allows the authors to use four principles—if it looks alike, it is alike; if it is the same thing, it is the same thing; san-serif is preferred; combine characters rather than creating new ones where possible—to create a simplified way to describe any possible character in virtually any “Latin” language. The result requires a bit more space to store in some cases, and is more difficult to process, but it is simpler at least from some perspective.
RFC5242 reminds me of a protocol custom-developed for an application I once had to troubleshoot—the entire protocol was sent in actual ASCII text. At least it was simpler to read on the network packet capture tool. There are, of course, many other examples of things being more complex than initially thought in the networking world—which is probably a good thing, because it means those many reports of the demise of the network engineer are probably greatly exaggerated.
Fear sells. Fear of missing out, fear of being an imposter, fear of crime, fear of injury, fear of sickness … we can all think of times when people we know (or worse, a people in the throes of madness of crowds) have made really bad decisions because they were afraid of something. Bruce Schneier has documented this a number of times. For instance: “it’s smart politics to exaggerate terrorist threats” and “fear makes people deferential, docile, and distrustful, and both politicians and marketers have learned to take advantage of this.” Here is a paper comparing the risk of death in a bathtub to death because of a terrorist attack—bathtubs win.
But while fear sells, the desire to appear unafraid also sells—and it conditions people’s behavior much more than we might think. For instance, we often say of surveillance “if you have done nothing wrong, you have nothing to hide”—a bit of meaningless bravado. What does this latter attitude—“I don’t have anything to worry about”—cause in terms of security?
Several attempts at researching this phenomenon have come to the same conclusion: average users will often intentionally not use things they see someone they perceive as paranoid using. According to this body of research, people will not use password managers because using one is perceived as being paranoid in some way. Theoretically, this effect is caused by illusory correlation, where people associate an action with a kind of person (only bad/scared people would want to carry a weapon). Since we don’t want to be the kind of person we associate with that action, we avoid the action—even though it might make sense.
This is just the flip side of fear sells, of course. Just like we overestimate the possibility of a terrorist attack impacting our lives in a direct, personal way, we also underestimate the possibility of more mundane things, like drowning in a tub, because we either think can control it, or because we don’t think we’ll be targeted in that way, or because we want to signal to the world that we “aren’t one of those people.”
Even knowing this is true, however, how can we counter this? How can we convince people to learn to assess risks rationally, rather than emotionally? How can we convince people that the perception of control should not impact your assessment of personal security or safety?
Simplifying design and use of the systems we build would be one—perhaps not-so-obvious—step we can take. The more security is just “automatic,” the more users will become accustomed to deploying security in their everyday lives. Another thing we might be able to do is stop trying to scare people into using these technologies.
In the meantime, just be aware that if you’re an engineer, your use of a technology “as an example” to others can backfire, causing people to not want to use those technologies.
I cannot count the number of times I’ve heard someone ask these two questions—
- What are other people doing?
- What is the best common practice?
While these questions have always bothered me, I could never really put my finger on why. I ran across a journal article recently that helped me understand a bit better. The root of the problem is this—what does best common mean, and how can following the best common produce a set of actions you can be confident will solve your problem?
Bellman and Oorschot say best common practice can mean this is widely implemented. The thinking seems to run something like this: the crowd’s collective wisdom will probably be better than my thinking… more sets of eyes will make for wiser or better decisions. Anyone who has studied the madness of crowds will immediately recognize the folly of this kind of state. Just because a lot of people agree it’s a good idea to jump off a cliff does not mean it is, in fact, a good idea to jump off a cliff.
Perhaps it means something closer to this is no worse than our competitors. If that’s the meaning, though, it’s a pretty cynical result. It’s saying, “I don’t mind condemning myself to mediocrity so long as I see everyone else doing the same thing.” It doesn’t sound like much of a way to grow a business.
The authors do provide their definition—
For a given desired outcome, a “best practice” is a means intended to achieve that outcome, and that is considered to be at least as “good” as the best of other broadly considered means to achieve that same outcome.
The thinking seems to run something like this—it’s likely that everyone else has tried many different ways of doing this; that they have all settled on doing this, this way, means all those other methods are probably not as good as this one for some reason.
Does this work? There’s no way to tell without further investigation. How many of the other folks doing “this” spent serious time trying alternatives, and how many just decided the cheapest way was the best no matter how poor the result might be? In fact, how can we know what the results of doing things “this way” have in all those other networks? Where would we find this kind of information?
In the end, I can’t ever make much sense out of the question, “what is everyone else doing?” Discovering what everyone else is doing might help me eliminate possibilities (that didn’t work for them, so I certainly don’t want to try it), or it might help me understand the positive and negative attributes of a given solution. Still, I don’t understand why “common” should infer “best.”
The best solution for this situation is simply going to be the best solution. Feel free to draw on many sources, but don’t let other people determine what you should be doing.
Last week I began discussing why AS Path Prepend doesn’t always affect traffic the way we think it will. Two other observations from the research paper I’m working off of are:
- Adding two prepends will move more traffic than adding a single prepend
- It’s not possible to move traffic incrementally by prepending; when it works, prepending will end up moving most of the traffic from one inbound path to another
A slightly more complex network will help explain these two observations.
Assume AS65000 would like to control the inbound path for 100::/64. I’ve added a link between AS65001 and 65002 here, but we will still find prepending a single AS to the path won’t make much difference in the path used to reach 100::/64. Why?
Because most providers will have a local policy configured—using local preference—that causes them to choose any available customer connection over other paths. AS65001, on receiving the route to 100::/64 from AS65000, will set the local preference so it will prefer this route over any other route, including the one learned from AS65002. So while the cause is a little different in this case than the situation covered in the first post, the result is the same.
We can, of course, prepend twice onto the AS Path rather than once. What impact would that have here? It still won’t impact the traffic originating in 65005 because AS65001 is the only path available towards 100::64 from their perspective. Prepending cannot change anything if there’s only one path.
However, if most of the traffic destined to 100::/64 coming from AS65006, 7, and 8 rather than from AS65005, prepending two times will allow AS65000 to shift the traffic from the path through AS65002 to the path through AS65001. This example might seem a little contrived. Still, it’s pretty similar to networks that have one connection to some local provider (a cable company or something similar) and one connection to a more prominent national or international provider. Any time you are connected to two different providers who have different ranges of connectivity, prepending two autonomous systems on the AS Path will probably be able to shift traffic from one inbound link to another.
What about prepending more than two hops to the AS Path? Each additional prepend going to shift smaller amounts of traffic. It makes sense that increasing the number of prepends doesn’t shift much more because the further away you get from the edge of the Internet, the more fully connected the autonomous systems are, and the more likely you are to run into some other policy that will override the AS Path in determining the best path. The average length of the AS Path in the Internet is around four; prepending more than this normally won’t have much of an effect on traffic flow
The second question above can also be answered by looking at this network. Why can’t you shift traffic incrementally by prepending onto the AS Path? Because the connectivity close to the edge is probably not meshy enough. You can’t shift over just the traffic from one AS or another; you can only shift traffic from the entire set of autonomous systems behind your upstream from one inbound link to another. You can adjust traffic on a per-prefix basis, however, which can be useful for balancing between two inbound links.
What can you do to control inbound traffic with more certainty? Take a look at this older post for thoughts on using communities and de-aggregation to steer traffic.
Just about everyone prepends AS’ to shift inbound traffic from one provider to another—but does this really work? First, a short review on prepending, and then a look at some recent research in this area.
What is prepending meant to do?
Looking at this network diagram, the idea is for AS6500 (each router is in its own AS) to steer traffic through AS65001, rather than AS65002, for 100::/64. The most common method to trying to accomplish this is AS65000 can prepend its own AS number on the AS Path Multiple times. Increasing the length of the AS Path will, in theory, cause a route to be less preferred.
In this case, suppose AS65000 prepends its own AS number on the AS Path once before advertising the route towards AS65001, and not towards AS65002. Assuming there is no link between AS65001 and AS65002, what would we expect to happen? What we would expect is AS65001 will receive one route towards 100::/64 with an AS Path of 2 and use this route. AS65002 will, likewise, receive one route towards 100::/64 with an AS Path of 1 and use this route.
AS65003, however, will receive two routes towards 100::/64, one with an AS Path of 3 through AS65001, and one with an AS Path of 2 through AS65002. All other things being equal (local preference, etc.), AS65003 will prefer the route with the shorter AS Path through AS65002, and select that path to reach 100::/64. AS65004 will only receive one path towards 100::/64, the one through AS65002, because AS65003 will only advertise its best path to AS65004.
The obvious question—how much good does this really do? The only impact on the best path is two hops away, as AS65003, and beyond. The route chosen by AS65001 and AS65002 will not be affected by the prepending.
A recent paper found—
You might expect As Path prepending to have a much more consistent effect on inbound traffic. Why doesn’t it?
What might not be obvious (the danger of simplified diagrams): if autonomous systems directly attached to AS65001 originate most of the traffic destined to 100::/64, no amount of prepending is going to make any difference in the inbound traffic flow. Assume AS5001 has a connection to some cloud service, AS65002 does not have a connection to the same cloud service, and 100::64 is a local server that communicates with this cloud service on a regular basis. Since AS65001 is the only AS transiting traffic from the cloud service to the server located on the 100::/64 subnet, and AS65001 only has one route to 100::/64, you are not going to be able to shift traffic off that single path no matter how many times you prepend.
The first rule of prepending is location matters. You have to know where the traffic you want to shift is originating, and whether or not it can be shifted.
In my next post on this topic, I’ll continue exploring AS path prepending more in light of the results of the research paper above.
Recent research into the text of RFCs versus the security of the protocols described came to this conclusion—
This should come as no surprise to network engineers—after all, complexity is the enemy of security. Beyond the novel ways the authors use to understand the shape of the world of RFCs (you should really read the paper; it’s really interesting), this desire to increase security by decreasing the ambiguity of specifications is fascinating. We often think that writing better specifications requires having better requirements, but down this path only lies despair.
Better requirements are the one thing a network engineer can never really hope for.
It’s not just that networks are often used as a sort of “complexity sink,” the place where every hard problem goes to be solved. It’s also the uncertainty of the environment in which the network must operate. What new application will be stuffed on top of the network this week? Will anyone tell the network folks about this new application, or just open a ticket when it doesn’t work right? What about all the changes developers are making to applications right now, and their impact on the network? There are link failures, software failures, hardware failures, and the mean time between mistakes. There is the pace of innovation (which I tend to think is a bit overblown—rule11, after all—we are often talking about new products rather than new ideas).
What the network is supposed to do—just provide IP transport between two devices—turns out to be hard. It’s hard because “just transporting packets” isn’t ever enough. These packets must be delivered consistently (jitter and drops) across an ever-changing landscape.
To this end—
[C]omplexity is most succinctly discussed in terms of functionality and its robustness. Specifically, we argue that complexity in highly organized systems arises primarily from design strategies intended to create robustness to uncertainty in their environments and component parts.
Uncertainty is the key word here. What can we do about all of this?
We can reduce uncertainty. There are three ways to reduce uncertainty. First, you can obfuscate it—this is harmful. Second, you can reduce the scope of the job at hand, throwing some of the uncertainty (and therefore complexity) over the cubicle way. This can be useful in some situations, but remember that the less work you’re doing, the less value you add. Beware of self-commodifying.
Finally, you can manage the uncertainty. This generally means using modularization intelligently to partition off problems into smaller sets. It’s easier to solve a set of well-scope problems with little uncertainty than to solve one big problem with unknowable uncertainty.
This might all sound great in theory, but how do we do this in real life? Where does the rubber hit the road? This is what Ethan and I tried to show in Problems and Solutions—how to understand the problems that need to be solved, and then how to solve each of those problems within a larger system. This is also what many parts of The Art of Network Architecture are about, and then again what Jeff and I wrote about in Navigating Network Complexity.
I know it often seems like it’s not worth learning the theory; it’s so much easier to focus on the day-to-day, the configuration of this device, or the shiny thing that vendor just created. It’s easier to assume that if I can just hide all the complexity behind intent or automation, I can get my weekends back.
The truth is that we’re paid to solve hard problems, and solving hard problems involves complexity. We can either try to cover that up, or we can learn to manage it.
One of the big movements in the networking world is disaggregation—splitting the control plane and other applications that make the network “go” from the hardware and the network operating system. This is, in fact, one of the movements I’ve been arguing in favor of for many years—and I’m not about to change my perspective on the topic. There are many different arguments in favor of breaking the software from the hardware. The arguments for splitting hardware from software and componentizing software are so strong that much of the 5G transition also involves the open RAN, which is a disaggregated stack for edge radio networks.
If you’ve been following my work for any amount of time, you know what comes next: If you haven’t found the tradeoffs, you haven’t looked hard enough.
This article on hardening Linux (you should go read it, I’ll wait ’til you get back) exposes some of the complexities and tradeoffs involved in disaggregation in the area of security. Some further thoughts on hardening Linux here, as well. Two points.
First, disaggregation has serious advantages, but disaggregation is also hard work. With a commercial implementation you wouldn’t necessarily think about these kinds of supply chain issues. This is an example of the state/optimization/surfaces tradeoff. You can optimize your network more fully using disaggregation techniques, but there are going to be more interaction surfaces, and there’s going to be more state to deal with (for instance, the security state on individual devices).
There are several items on this list that also illustrate the state/optimization/surfaces tradeoff. For instance, eBPF is on the list of things to disable … but eBPF is probably going to be crucial to many future network-facing implementations. Anything that’s useful is going to inherently create attack surfaces you need to deal with. Get over it.
Second, just because you don’t think about these issues with a commercial implementation does not mean you don’t need to think about these things—it just means these kinds of things are opaque to you. Rather than trying to do the “right thing” yourself, you are outsourcing this work to a vendor. This is often a rational decision, and even might often be the right decision, but it’s a decision. We often “bury” these kinds of decisions in our thinking, not realizing we are making tradeoffs.