WRITTEN
The Resilience Problem
This quote set me to thinking about how efficiency and resilience might interact, or trade off against one another, in networks. The most obvious extreme cases are two routers connected via a single long-haul link and the highly parallel data center fabrics we build today. Obviously adding a second long-haul link would improve resilience—but at what cost in terms of efficiency? Its also obvious highly meshed data center fabrics have plenty of resilience—and yet they still sometimes fail. Why?
These cases can be described as efficiency extremes. The single link between two distant points is extremely efficient at minimizing cost and complexity; there is only one link to pay for, only one pair of devices to configure, etc. The highly meshed data center fabric, on the other hand, is extremely efficient at rapidly carrying large amounts of data between vast numbers of interconnected devices (east/west traffic flows). Have these optimizations towards one goal resulted in tradeoffs in resilience?
Consider the case of the single long-haul link between two routers. In terms of the state/optimization/surfaces (SOS) tirade, this single pair of routers and single link minimize the amount of control plane state and the breadth of surfaces (there is only point at which the control plane and the physical network intersect, for instance). The tradeoff, however, is a single link failure causes all traffic through the network to stop flowing—the network completely fails to do the work its designed to do. To create resiliency, or rather add a second dimension of optimization to the network, a second link and a second pair of routers need to be added. Adding these, however, will increase the amount of state and the number of interaction surfaces in the network. Another way to put this is the overall system becomes more complex to solve a harder set of problems—inexpensive traffic flow versus minimal cost traffic flow with resilience.
The second case is a little harder to understand—we assume all those parallel links necessarily make the network more resilient. If this is the case, then why do data center fabrics ever fail? In fact, DC fabrics are plagued by one of the hardest kinds of failure to understand and repair—grey failures. Going back to the SOS triad, the massive number of parallel links and devices in a DC fabric, designed to optimize the network for carrying massive amounts of traffic, also add lots of state and interaction surfaces to the network. Increasing the amount of state and interaction surfaces should, in theory, reduce some other form of optimization—in this case resilience through overwhelmed control planes and grey failures.
In the case of a DC fabric, simplification can increase resilience. Since you cannot reduce the number of links and devices, you must think through how and where to abstract information to reduce state. Reducing state, in turn, is bound to reduce the efficiency of traffic flows through the network, so you immediately run into a domino effect of optimization tradeoffs. Highly turned optimization for traffic carrying causes a lack of optimization in resilience; optimizing for resilience reduces the optimization of traffic flow through the network. These kinds of chain reactions are common in the network engineering world. How can you optimize against grey failures? Perhaps simplifying design by using a single kind of optic, rather than having multiple kinds, or finding other ways to cope with the complexity in physical design.
Returning to the original quote—we often build a lot of resilience into network designs, so we do not face the same sorts of problems software designers and implementors do. Quite often the hyper-focus on resilience in network design is a result of a lack of resilience in software design—software designers have thrown the complexity of resilient design over the cubicle wall into the network operator’s lap. This clearly does not seem to be the most efficient way to handle things, as network are vastly more complex because of the absolute resilience they are expected to provide; looking at the software and network as a system might produce a more resilient, and yet simpler, system.
The key, in the meantime, is for network engineers to learn how to ply the tradeoffs, understanding precisely what their goals are—or what they are optimizing for—and how those optimizations trade off against one another.
Reflections on Intent
No, not that kind. 🙂
BGP security is a vexed topic—people have been working in this area for over twenty years with some effect, but we continuously find new problems to address. Today I am looking at a paper called BGP Communities: Can of Worms, which analyses some of the security problems caused by current BGP community usage in the ‘net. The point I want to think about here, though, is not the problem discussed in the paper, but rather some of the larger problems facing security in routing.

Assume there is some traffic flow passing from 101::47/64 and 100::46/64 in this network. AS65003 has helpfully set up community string-based policies that allow a peer to advertise a route with a specified AS Path prepend. In this case, if AS65003 receives a route with 3:65004x to prepend the route advertised towards 65004 with x number of additional AS Path entries, and 3:65005x to prepend the route advertised towards 65005 with x number of additional AS Path entries.
Assuming community strings set by AS65002 are carried with the 100::46/64 route through the rest of the network, AS65002 can:
- Advertise 100::/46 towards AS65003 with 3:650045, causing the route received at AS65006 from AS65004 to have a longer AS Path than the route received through AS65005, causing the traffic to flow through AS65005
- Advertise 100::/46 towards AS65003 with 3:650055, causing the route received at AS65006 from AS65005 to have a longer AS Path than the route received through AS65004, causing the traffic to flow through AS65004
A lot of abuse is possible because of this situation. For instance, AS65002 might know the cost of the link between AS65006 and AS65004 is very expensive, so directing large amounts of traffic across that link will cause financial harm to AS65004 or AS65006. A malicious actor at AS65002 could also determine it can overwhelm this link, causing a sort of denial of service against anyone connected to AS65004 or AS65006.
The potential problem, then, is real.
The problem is, however, how do we solve this? The most obvious way is to block communities from being transmitted beyond one hop past the point in the network where they are set. There are, however, two problems with this solution. First, how can anyone tell which AS set a community on a route? There is no originator code in the community string, and there’s no particular way to protect this kind of information from being forged or modified short of carrying a cryptographic hash in the update—which is probably not going to be acceptable from a performance perspective.
But the technical problem here is just the “tip of the iceberg.” Even if we could determine who modified the route to include the community, there is no particular way for anyone receiving the community to determine the originator’s intent. AS65002 may well install some system which measures, in near-real time, the delay across multiple paths to determine which performs the best. Such a system could be programmed with the correct community strings to impact traffic, and then left to run some sort of machine learning process to figure out how to mark routes to improve performance. If the operator at AS65002 does not realize the cost of the AS65004->AS65006 link is prohibitive, any sort of financial burden imposed by this system could be an unintended, rather than intended, consequence.
This, it turns out, is often the problem with security. It might be that person is bypassing building security to save a life, or it could be they are doing so to steal corporate secrets. There is simply no way to know without meeting the person in question, listening to their reasoning, and allowing a human to decide which course of action is appropriate.
In the case of BGP, we’re dealing with “spooky action at a distance;” the source of the problem is several steps removed from the result of the problem, there’s no clear way to connect the two, and there’s no clear way to resolve the problem other than “picking up the phone” even if one of these operators can figure out what is going on.
The problem of intent is what RFC3514’s evil bit is poking a bit of fun at—if we only knew the attacker’s intent, we could often figure out what to actually do. Not knowing intent, however, puts a major crimp in many of the best-laid security plans.
Learning from Failure at Scale

One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.
To produce the study, the authors took data from Facebook’s ticket logging system over 6 years, from 2011 through 2018. They used language-based systems to classify each event based on severity, kind of remediation, and root cause. Once the events were classified, the researchers plotted and tried to understand the results. For instance, table 2 lists the most common root causes of data center fabric incidents: 17% were maintenance, 13% misconfiguration, 13% hardware, and 12% software defects (bugs).
Given Facebook’s network is completely automated, with a full code review/canary process for validating changes before they are put into production, misconfiguration failures should lower than a manually operated network. That 13% of failures are still accounted for by misconfiguration shows even the best automation program cannot eliminate failures from misconfiguration. This number is also interesting because it implies networks without this degree of automation must have much higher failure rates due to misconfiguration. While the raw number of failures are not given, this seems to provide both an idea of how much improvement automation can create, as well as a sort of “cap” on how much improvement operators can expect by automating.
If misconfiguration causes 13% of all failures, and software defects cause 12%, then 25% of all failures are caused by human error. I don’t know of any other studies of this kind, but 25% sounds about right based on years of experience. Whether this 25% is spread across failures in vendor code and operator configuration, or across operator created code and operator configuration, the percentage of failure seems to remain about the same. It is not likely you can eliminate failures caused by human error, nor are you likely to drive it down more than a couple of percentage points.
Another interesting finding here is larger networks increase the time humans take to resolve incidents. As the size of the network scales up, the MTTR scales up with it. This is intuitive—larger networks tend to have more complex configurations, leading to more time spent trying to chase down and understand a problem. One thing the paper does not discuss, but might be interesting, is how modularization impacts these numbers. Intuitively, containing failures within a module (whether horizontally along topological lines or vertically through virtualization) should decrease the scope in which a network engineer needs to search to find a problem and resolve it. This is, on the other hand, likely to be offset somewhat by the increased complexity and reduction in visibility caused by segmentation—so it’s hard to determine what the overall effect of deeper segmentation in a network might be.
Overall, this is an interesting paper to parse through and understand—there are lots of great insights here for network operators at any scale.
Understanding Internet Peering
The world of provider interconnection is a little … “mysterious” … even to those who work at transit providers. The decision of who to peer with, whether such peering should be paid, settlement-free, open, and where to peer is often cordoned off into a separate team (or set of teams) that don’t seem to leak a lot of information. A recent paper on current interconnection practices published in ACM SIGCOMM sheds some useful light into this corner of the Internet, and hence is useful for those just trying to understand how the Internet really works.
To write the paper, the authors sent requests to fill out a survey through a wide variety of places, including NOG mailing lists and blogs. They ended up receiving responses from all seven regions (based on the RIRs, who control and maintain Internet numbering resources like AS numbers and IP addresses), 70% from ISPs, 14% from content providers, and 7% from “Enterprise” and infrastructure operators. Each of these kinds of operators will have different interconnection needs—I would expect ISPs to engage in more settlement-free peering (with roughly equal traffic levels), content providers to engage in more open (settlement-free connections with unequal traffic levels), IXs to do mostly local peering (not between regions), and “enterprises” to engage mostly in paid peering. The survey also classified respondents by their regional footprint (how many regions they operate in) and size (how many customers they support).
The survey focused on three facets of interconnection: time required to form a connection, the reasons given for interconnecting, and parameters included in the peering agreement. These largely describe the status quo in peering—interconnections as they are practiced today. As might be expected, connections at IXs are the quickest to form. Since IXs are normally set up to enable peering; it makes sense that the preset processes and communications channels enabled by an IX would make the peering process a lot faster. According to the survey results, the most common timeframe to complete peering is days, with about a quarter taking weeks.
Apparently, the vast majority (99%!) of peering arrangements are by “handshake,” which means there is no legal contract behind them. This is one reason Network Operator Groups (NOGs) are so important (a topic of discussion in the Hedge 31, dropping next week); the peering workshops are vital in building and keeping the relationships behind most peering arrangements.
On-demand connectivity is a new trend in inter-AS peering. For instance, interxion recently worked with LINX and several other IXs to develop a standard set of APIs allowing operators to peer with one another in a standard way, often reducing the technical side of the peering process to minutes rather than hours (or even days). Companies are moving into this space, helping operators understand who they should peer with, and building pre-negotiated peering contracts with many operators. While current operators seem to be aware of these options, they do not seem to be using these kinds of services yet.
While this paper is interesting, it does leave many corners of the inter-AS peering world un-exposed. For instance—I would like to know how correct my assumptions are about the kinds of peering used by each of the different classes of providers is, and whether there are regional differences in the kinds of peering. While its interesting to survey the reasons providers pursue peering, it would be interesting to understand the process of making a peering determination more fully. What kinds of tools are available, and how are they used? These would be useful bits of information for an operator who only connects to the Internet, rather than being part of the Internet infrastructure (perhaps a “non-infrastructure operator,” rather than “enterprise”) in understanding how their choice of upstream provider can impact the performance of their applications and network.
Note: this is another useful, but slightly older, paper on the topic of peering.
Working from Home: Myth and Reality
The last few weeks have seen a massive shift towards working from home because of the various “stay at home” orders being put in place around the world—a trend I consider healthy in the larger scheme of things. Of course, there has also been an avalanche of “tips for working from home” articles. I figured I’d add my own to the pile.
A bit of background—I first started working from home regularly around twenty years ago, while on the global Escalation Team at Cisco. I started by working from home one day a week, to focus on projects, slowly increasing over time, ultimately working from the office one day a week when I transitioned to the Deployment and Architecture Team. Since that team was scattered all over the world—we had a few folks in Raleigh, two in Reading (England), one Brussels, one in Scott’s Valley, and one or two in the San Jose (California) area, we had to learn to work together remotely if we had any hope of being effective. Further, we (as a family) have home-schooled “all the way through”—my oldest daughter is nearing the end of high school, currently overlapping with some college work. I have always had the entire family at home most of the time.
So, forthwith, some thoughts and experiences, including some you might not have ever heard, and some myth busting along the way.
Probably the most common piece of advice I hear is you should set a schedule and stick to it. On the other hand, the most common thing people like about working from home is the flexibility of the schedule. Somehow no-one ever seems to put these two together and say “hey, one of these things just doesn’t belong here!” (remember that old song, courtesy of public television). When you are working from home, setting a fixed schedule similar to being in an office is precisely the wrong thing to do. This is a myth.
Instead, what you should do is to try to intentionally carve out several hours a day of “quiet time” to get things done. This might be different times in the day for your house, and its not likely to be consistent on a daily, weekly, or monthly basis. The most productive times of the day are going to shift—get used to it. Most of my family are late sleepers or tend to get up in the morning and do “quiet things,” so the mornings tend to be much more productive for me than the afternoons. This isn’t always true; sometimes the evenings are quieter. There are few days, however, when there isn’t some quiet time during each day.
The key to scheduling is to plan project work for these quiet times, whenever they happen to be. Given the schedule of my house, I get up in the morning and immediately dive into project work. I leave email and reading RSS feeds for the afternoon, because I know I won’t be able to concentrate as well during those time. On the other hand, I try not to get frustrated if my routine is broken up—just put off the project work until the time when it is quiet. Roll with the punches, rather than trying to punch against the rolls, when it comes to time. The only fixed point should be to think about how many hours of quiet time you need per week and figure out how to get it.
Don’t worry about having 8 hours solid, or whatever. Take breaks to eat lunch with family or friends. Take breaks to have a cup of hot liquid. Run out to the store in the afternoon. Exercise in the middle of the day. Take advantage of the flexible schedule to break work up into multiple pieces, rather than being glued to an office desk for 8 hours, regardless of your productivity level through that time. This is not the office—don’t act like it is.
The second most common piece of advice I hear is find a space and stick to it. This is not a myth—it is absolutely true. I have a table in the basement in one place, and a dedicated room as an office in another. When we go on vacation as a family, I either make certain to have a large enough room to have space to work, or I scope out someplace I can “set up shop” someplace in the hotel. Space is just that important.
Sometimes you hear get the right equipment. This, also, I can attest to. Tools, however, are a bit of an obsession for me; I will spend hours perfecting a tool, even if it does not, ultimately, save me as much time as I originally thought. I am a stickler for the right monitory, chair, desk space, footrests, and keyboards. I play with lighting endlessly; I have finally settled on a setup with a dim monitor, dim room lighting, bias lighting behind the monitor, and a backlit keyboard. I’ve concluded that reducing the sheer amount of light being poured into my eyes, across the board, helps increase the amount of time I can keep working. Some people prefer two monitor setups—I prefer a single large monitor (31in right now). I don’t use a mouse, but a Wacom tablet. I’ve recently added an air purifier to my office space—I think this makes a difference. I don’t have a trash can close by—its useful to force myself to get up every now and again. Your physical environment is a matter of personal preference but pay attention to it. This will probably have as much impact on your productivity as anything else you do.
Overall, don’t let other people tell you what tools are the best. I’ve been told for more years I can remember that I should stop using Microsoft Word for long-form writing. You should switch to markdown, you should switch to Scrivener, you should use FocusWriter… Well, whatever. I’ve written … around 13 books, and I have two more in progress. I’ve written thousands of blog posts and articles. I’ve written thousands of pages of research papers, and I’m currently working on my dissertation. All in Microsoft Word. I’ve tried other tools, but ultimately I’ve found that Word, with my own customized interface settings, to be just as fast or faster than anything else I’ve worked with.
One thing you almost never hear is a team change. If one person is remote, act like everyone is. There is nothing more disruptive, or less useful, than having one person on the phone while everyone else is sitting in a conference room. If one person has to dial in, everyone should dial in. Setting up times to “just hang out” as a “meeting” is also sometimes helpful. If you have a fully remote team, dedicate “in real life” time to doing things you cannot do on meetings, and dedicate meeting time to things you cannot do through email, a chat app, or some other means.
Finally, some people are neat nicks while others are sloppy. I tend towards the extreme neat end of things, arranging even my virtual desktop “just so” (I actually do not have any icons on my desktop at all, and I’m very picky about what I put in my start menu and/or command bar).
The key thing is, I think, to pay attention to what helps and what doesn’t, and not be afraid to experiment to find a better way.
An Interesting take on Mapping an Attack Surface
Security often lives in one of two states. It’s either something “I” take care of, because my organization is so small there isn’t anyone else taking care of it. Or it’s something those folks sitting over there in the corner take care of because the organization is, in fact, large enough to have a separate security team. In both cases, however, security is something that is done to networks, or something thought about kind-of off on its own in relation to networks.
I’ve been trying to think of ways to challenge this way of thinking for many years—a long time ago, in a universe far away, I created and gave a presentation on network security at Cisco Live (raise your hand if you’re old enough to have seen this presentation!).
Reading through my paper pile this week, I ran into a viewpoint in the Communications of the ACM that revived my older thinking about network security and gave me a new way to think about the problem. The author’s expression of the problem of supply chain security can be used more broadly. The illustration below is replicated from the one in the original article; I will use this as a starting point.

This is a nice way to visualize your attack surface. The columns represent applications or systems and the rows represent vulnerabilities. The colors represent the risk, as explained across the bottom of the chart. One simple way to use this would be just to list all the things in the network along the top as columns, and all the things that can go wrong as rows and use it in the same way. This would just be a cut down, or more specific, version of the same concept.
Another way to use this sort of map—and this is just a nub of an idea, so you’ll need to think about how to apply it to your situation a little more deeply—is to create two groups of columns; one column for each application that relies on network services, and one for network infrastructure devices and services you rely on. Rows would be broken up into three classes, from the top to bottom—protection, services, and systems. In the protection group you would have things the network does to protect data and applications, like segmentation, preventing data exfiltration, etc. In the services group, you would mostly have various forms of denial of service and configuration. In the systems group, you would have individual hardware devices, protocols, software packages used to make the network “go,” etc. Maybe something like the illustration below.

If you place the most important applications towards the left, and the protection towards the top, the more severe vulnerabilities will be in the upper left corner of the chart, with less severe areas falling to the right and (potentially) towards the bottom. You would fill this chart out starting in the upper left, figuring out what each kind of “protection” the network as a service can offer to each application. These should, in turn, roll down to the services the network offers and their corresponding configurations. These should, in turn, roll across to the devices and software used to create these services, and then roll back down to the vulnerabilities of those services and devices. For instance, if sales management relies on application access control, and application access control relies on proper filtering, and filtering is configured on BGP and some sort of overlay virtual link to a cloud service… You start to get the idea of where different kinds of services rely on underlying capabilities, and then how those are related to suppliers, hardware, etc.
You can color the squares in different ways—the way the original article does, perhaps, or your reliance on an outside vendor to solve this problem, etc. Once the basic chart is in place you can use multiple color schemes to get different views of the attack surface by using the chart as a sort of heat map.
Again, this is something of a nub of an idea, but it is a potentially interesting way to get a single view of the entire network ecosystem from a security standpoint, know where things are weak (and hence need work), and understand where cascading failures might happen.
Enterprise and Service Provider—Once more into the Windmill
There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.
Let me try to recap some of the arguments I’ve heard used against my assertion.
The theory that enterprise and service provider networks require completely different technologies and implementations is often grounded in scale. Service provider networks are so large that they simply must use different solutions—solutions that you cannot apply to any network running at a smaller scale.
The problem with this line of thinking is it throws the baby out with the bathwater. Google is using automation to run their network? Well, then… you shouldn’t use automation because Google’s problems are not your problems. Microsoft is deploying 100g Ethernet over fiber? Then clearly enterprise networks should be using Token Ring or ARCnet because… Microsoft’s problems are not your problems.
The usual answer is—“I’m not saying we shouldn’t take good ideas when we see them, but we shouldn’t design networks the way someone else does just because.” I don’t see how this clarifies the solution, though—when is it a good idea or a bad one? What is our criterion to decide what to adopt and what not to adopt? Simply saying “X’s problems aren’t your problems” doesn’t really give me any actionable information—or at least I’m not getting it if it’s buried in there someplace.
Instead—maybe—just maybe—we are looking at this all wrong. Maybe there is some other way classify networks that will help us see the problem set better.
I don’t think networks are undifferentiated—I think the enterprise/service provider/hyerpscaler divide is not helpful to understand how different networks are … different, and how to correctly identify an environment and build to it. Reading a classic paper in software design this week—Programs, Life Cycles, and Laws of Software Evolution—brought all this to mind. In writing this paper, Meir Lehman was facing many of the same classification problems, just in software development rather than in building networks.
Rather than saying “enterprise software is different than service provider software”—an assertion absolutely no-one makes—or even “commercial software is different than private software, and developers working in these two areas cannot use the same tools and techniques,” Lehman posits there are three kinds of software systems. He calls these S-Programs, in which the problem and solution can be fully specified; P-Programs, in which the problem can be fully specified, but the program can only be partially specified because of complexity and scale; and E-Programs, where the program itself become part of the world it models. Lehman thinks most software will move towards S-Program status as time moves on—something that hasn’t happened (the reasons are out of scope for this already-too-long-blog-post).
But the classification is useful. For S-Programs, the inputs and outputs can be fully specified, full-on testing can take place before the software is deployed, and lifecycle management is largely about making the software more fully conform to its original conditions. Maybe there are S-Networks, too? Single-purpose networks which are aimed at fulfilling on well-defined thing, and only that thing. Lehman talks about learning how to breaking larger problems into smaller one so the S-Problems can be dealt with separately—is this anything different than separating out the basic problem of providing IP connectivity in a DC fabric underlay, or even providing basic IP connectivity in a transit or campus network, treating it as a separate module with fairly well design goals and measurements?
Lehman talks about P-Programs, where the problem is largely definable, but the solutions end up being more heuristic. Isn’t this similar to a traffic engineering overlay, where we largely know what the goals are, but we don’t necessarily know what specific solution is going to needed at any moment, and the complete set of solutions is just too large to initially calculate? What about E-Programs, where the software becomes a part of the world it models? Isn’t this like the intent-based stuff we’ve been talking about networking for going one 30 years now?
Looking at it another way, isn’t it possible that some networks are largely just S-Networks? And others are largely E-Networks? And that these classifications have nothing to do with whether the network is being built by what we call an “enterprise” or a “service provider?” Isn’t is possible that S-Networks should probably all use the same basic sort of structure and largely be classified as a “commodity,” while E-Networks will all be snowflakes, and largely classified as having high business importance?
Just like I don’t think the OSI model is particularly helpful in teaching and understanding networks any longer, I don’t find the enterprise/service/hyperscaler model very useful in building and operating networks. The service enterprise/service provider divide tends to artificially limit idea transfer when it wants to be transferred, and artificially “hype up” some networks while degrading others—largely based on perceptions of scale.
Scale != complexity. It’s not about service providers and enterprises. It doesn’t matter if Google’s problems are not your problems; borrowing from the hyperscale is not a “bad thing.” It’s just a “thing.” Think clearly about the problem set, understand the problem set, and borrow liberally. There is no such thing as a “service provider technology,” nor is there any such thing as an “enterprise technology.” There are problems, and there are solutions. To be an engineer is to connect the two.
