Learning from Failure at Scale

One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.

To produce the study, the authors took data from Facebook’s ticket logging system over 6 years, from 2011 through 2018. They used language-based systems to classify each event based on severity, kind of remediation, and root cause. Once the events were classified, the researchers plotted and tried to understand the results. For instance, table 2 lists the most common root causes of data center fabric incidents: 17% were maintenance, 13% misconfiguration, 13% hardware, and 12% software defects (bugs).

Given Facebook’s network is completely automated, with a full code review/canary process for validating changes before they are put into production, misconfiguration failures should lower than a manually operated network. That 13% of failures are still accounted for by misconfiguration shows even the best automation program cannot eliminate failures from misconfiguration. This number is also interesting because it implies networks without this degree of automation must have much higher failure rates due to misconfiguration. While the raw number of failures are not given, this seems to provide both an idea of how much improvement automation can create, as well as a sort of “cap” on how much improvement operators can expect by automating.

If misconfiguration causes 13% of all failures, and software defects cause 12%, then 25% of all failures are caused by human error. I don’t know of any other studies of this kind, but 25% sounds about right based on years of experience. Whether this 25% is spread across failures in vendor code and operator configuration, or across operator created code and operator configuration, the percentage of failure seems to remain about the same. It is not likely you can eliminate failures caused by human error, nor are you likely to drive it down more than a couple of percentage points.

Another interesting finding here is larger networks increase the time humans take to resolve incidents. As the size of the network scales up, the MTTR scales up with it. This is intuitive—larger networks tend to have more complex configurations, leading to more time spent trying to chase down and understand a problem. One thing the paper does not discuss, but might be interesting, is how modularization impacts these numbers. Intuitively, containing failures within a module (whether horizontally along topological lines or vertically through virtualization) should decrease the scope in which a network engineer needs to search to find a problem and resolve it. This is, on the other hand, likely to be offset somewhat by the increased complexity and reduction in visibility caused by segmentation—so it’s hard to determine what the overall effect of deeper segmentation in a network might be.

Overall, this is an interesting paper to parse through and understand—there are lots of great insights here for network operators at any scale.

Understanding Internet Peering

The world of provider interconnection is a little … “mysterious” … even to those who work at transit providers. The decision of who to peer with, whether such peering should be paid, settlement-free, open, and where to peer is often cordoned off into a separate team (or set of teams) that don’t seem to leak a lot of information. A recent paper on current interconnection practices published in ACM SIGCOMM sheds some useful light into this corner of the Internet, and hence is useful for those just trying to understand how the Internet really works.

To write the paper, the authors sent requests to fill out a survey through a wide variety of places, including NOG mailing lists and blogs. They ended up receiving responses from all seven regions (based on the RIRs, who control and maintain Internet numbering resources like AS numbers and IP addresses), 70% from ISPs, 14% from content providers, and 7% from “Enterprise” and infrastructure operators. Each of these kinds of operators will have different interconnection needs—I would expect ISPs to engage in more settlement-free peering (with roughly equal traffic levels), content providers to engage in more open (settlement-free connections with unequal traffic levels), IXs to do mostly local peering (not between regions), and “enterprises” to engage mostly in paid peering. The survey also classified respondents by their regional footprint (how many regions they operate in) and size (how many customers they support).

The survey focused on three facets of interconnection: time required to form a connection, the reasons given for interconnecting, and parameters included in the peering agreement. These largely describe the status quo in peering—interconnections as they are practiced today. As might be expected, connections at IXs are the quickest to form. Since IXs are normally set up to enable peering; it makes sense that the preset processes and communications channels enabled by an IX would make the peering process a lot faster. According to the survey results, the most common timeframe to complete peering is days, with about a quarter taking weeks.

Apparently, the vast majority (99%!) of peering arrangements are by “handshake,” which means there is no legal contract behind them. This is one reason Network Operator Groups (NOGs) are so important (a topic of discussion in the Hedge 31, dropping next week); the peering workshops are vital in building and keeping the relationships behind most peering arrangements.

On-demand connectivity is a new trend in inter-AS peering. For instance, interxion recently worked with LINX and several other IXs to develop a standard set of APIs allowing operators to peer with one another in a standard way, often reducing the technical side of the peering process to minutes rather than hours (or even days).  Companies are moving into this space, helping operators understand who they should peer with, and building pre-negotiated peering contracts with many operators. While current operators seem to be aware of these options, they do not seem to be using these kinds of services yet.

While this paper is interesting, it does leave many corners of the inter-AS peering world un-exposed. For instance—I would like to know how correct my assumptions are about the kinds of peering used by each of the different classes of providers is, and whether there are regional differences in the kinds of peering. While its interesting to survey the reasons providers pursue peering, it would be interesting to understand the process of making a peering determination more fully. What kinds of tools are available, and how are they used? These would be useful bits of information for an operator who only connects to the Internet, rather than being part of the Internet infrastructure (perhaps a “non-infrastructure operator,” rather than “enterprise”) in understanding how their choice of upstream provider can impact the performance of their applications and network.

Note: this is another useful, but slightly older, paper on the topic of peering.

The Hedge 29: Remote Work and Security

The massive numbers of people staying home to work because of the ongoing pandemic are placing a lot of strain on network infrastructure. One area many operators are not considering, however, is security—how does having a lot of remote workers impact DDoS? Is split tunneling really the right way to manage remote connectivity? Roland Dobbins joins Eyvonne Sharp and Russ White to discuss security in times of mass remote work on this episode of the Hedge.

a href=”https://media.blubrry.com/hedge/content.blubrry.com/hedge/hedge-029.mp3″>download

Working from Home: Myth and Reality

The last few weeks have seen a massive shift towards working from home because of the various “stay at home” orders being put in place around the world—a trend I consider healthy in the larger scheme of things. Of course, there has also been an avalanche of “tips for working from home” articles. I figured I’d add my own to the pile.

A bit of background—I first started working from home regularly around twenty years ago, while on the global Escalation Team at Cisco. I started by working from home one day a week, to focus on projects, slowly increasing over time, ultimately working from the office one day a week when I transitioned to the Deployment and Architecture Team. Since that team was scattered all over the world—we had a few folks in Raleigh, two in Reading (England), one Brussels, one in Scott’s Valley, and one or two in the San Jose (California) area, we had to learn to work together remotely if we had any hope of being effective. Further, we (as a family) have home-schooled “all the way through”—my oldest daughter is nearing the end of high school, currently overlapping with some college work. I have always had the entire family at home most of the time.

So, forthwith, some thoughts and experiences, including some you might not have ever heard, and some myth busting along the way.

Probably the most common piece of advice I hear is you should set a schedule and stick to it. On the other hand, the most common thing people like about working from home is the flexibility of the schedule. Somehow no-one ever seems to put these two together and say “hey, one of these things just doesn’t belong here!” (remember that old song, courtesy of public television). When you are working from home, setting a fixed schedule similar to being in an office is precisely the wrong thing to do. This is a myth.

Instead, what you should do is to try to intentionally carve out several hours a day of “quiet time” to get things done. This might be different times in the day for your house, and its not likely to be consistent on a daily, weekly, or monthly basis. The most productive times of the day are going to shift—get used to it. Most of my family are late sleepers or tend to get up in the morning and do “quiet things,” so the mornings tend to be much more productive for me than the afternoons. This isn’t always true; sometimes the evenings are quieter. There are few days, however, when there isn’t some quiet time during each day.

The key to scheduling is to plan project work for these quiet times, whenever they happen to be. Given the schedule of my house, I get up in the morning and immediately dive into project work. I leave email and reading RSS feeds for the afternoon, because I know I won’t be able to concentrate as well during those time. On the other hand, I try not to get frustrated if my routine is broken up—just put off the project work until the time when it is quiet. Roll with the punches, rather than trying to punch against the rolls, when it comes to time. The only fixed point should be to think about how many hours of quiet time you need per week and figure out how to get it.

Don’t worry about having 8 hours solid, or whatever. Take breaks to eat lunch with family or friends. Take breaks to have a cup of hot liquid. Run out to the store in the afternoon. Exercise in the middle of the day. Take advantage of the flexible schedule to break work up into multiple pieces, rather than being glued to an office desk for 8 hours, regardless of your productivity level through that time. This is not the office—don’t act like it is.

The second most common piece of advice I hear is find a space and stick to it. This is not a myth—it is absolutely true. I have a table in the basement in one place, and a dedicated room as an office in another. When we go on vacation as a family, I either make certain to have a large enough room to have space to work, or I scope out someplace I can “set up shop” someplace in the hotel. Space is just that important.

Sometimes you hear get the right equipment. This, also, I can attest to. Tools, however, are a bit of an obsession for me; I will spend hours perfecting a tool, even if it does not, ultimately, save me as much time as I originally thought. I am a stickler for the right monitory, chair, desk space, footrests, and keyboards. I play with lighting endlessly; I have finally settled on a setup with a dim monitor, dim room lighting, bias lighting behind the monitor, and a backlit keyboard. I’ve concluded that reducing the sheer amount of light being poured into my eyes, across the board, helps increase the amount of time I can keep working. Some people prefer two monitor setups—I prefer a single large monitor (31in right now). I don’t use a mouse, but a Wacom tablet. I’ve recently added an air purifier to my office space—I think this makes a difference. I don’t have a trash can close by—its useful to force myself to get up every now and again. Your physical environment is a matter of personal preference but pay attention to it. This will probably have as much impact on your productivity as anything else you do.

Overall, don’t let other people tell you what tools are the best. I’ve been told for more years I can remember that I should stop using Microsoft Word for long-form writing. You should switch to markdown, you should switch to Scrivener, you should use FocusWriter… Well, whatever. I’ve written … around 13 books, and I have two more in progress. I’ve written thousands of blog posts and articles. I’ve written thousands of pages of research papers, and I’m currently working on my dissertation. All in Microsoft Word. I’ve tried other tools, but ultimately I’ve found that Word, with my own customized interface settings, to be just as fast or faster than anything else I’ve worked with.

One thing you almost never hear is a team change. If one person is remote, act like everyone is. There is nothing more disruptive, or less useful, than having one person on the phone while everyone else is sitting in a conference room. If one person has to dial in, everyone should dial in. Setting up times to “just hang out” as a “meeting” is also sometimes helpful. If you have a fully remote team, dedicate “in real life” time to doing things you cannot do on meetings, and dedicate meeting time to things you cannot do through email, a chat app, or some other means.

Finally, some people are neat nicks while others are sloppy. I tend towards the extreme neat end of things, arranging even my virtual desktop “just so” (I actually do not have any icons on my desktop at all, and I’m very picky about what I put in my start menu and/or command bar).

The key thing is, I think, to pay attention to what helps and what doesn’t, and not be afraid to experiment to find a better way.

An Interesting take on Mapping an Attack Surface

Security often lives in one of two states. It’s either something “I” take care of, because my organization is so small there isn’t anyone else taking care of it. Or it’s something those folks sitting over there in the corner take care of because the organization is, in fact, large enough to have a separate security team. In both cases, however, security is something that is done to networks, or something thought about kind-of off on its own in relation to networks.

I’ve been trying to think of ways to challenge this way of thinking for many years—a long time ago, in a universe far away, I created and gave a presentation on network security at Cisco Live (raise your hand if you’re old enough to have seen this presentation!).

Reading through my paper pile this week, I ran into a viewpoint in the Communications of the ACM that revived my older thinking about network security and gave me a new way to think about the problem. The author’s expression of the problem of supply chain security can be used more broadly. The illustration below is replicated from the one in the original article; I will use this as a starting point.

This is a nice way to visualize your attack surface. The columns represent applications or systems and the rows represent vulnerabilities. The colors represent the risk, as explained across the bottom of the chart. One simple way to use this would be just to list all the things in the network along the top as columns, and all the things that can go wrong as rows and use it in the same way. This would just be a cut down, or more specific, version of the same concept.

Another way to use this sort of map—and this is just a nub of an idea, so you’ll need to think about how to apply it to your situation a little more deeply—is to create two groups of columns; one column for each application that relies on network services, and one for network infrastructure devices and services you rely on. Rows would be broken up into three classes, from the top to bottom—protection, services, and systems. In the protection group you would have things the network does to protect data and applications, like segmentation, preventing data exfiltration, etc. In the services group, you would mostly have various forms of denial of service and configuration. In the systems group, you would have individual hardware devices, protocols, software packages used to make the network “go,” etc. Maybe something like the illustration below.

If you place the most important applications towards the left, and the protection towards the top, the more severe vulnerabilities will be in the upper left corner of the chart, with less severe areas falling to the right and (potentially) towards the bottom. You would fill this chart out starting in the upper left, figuring out what each kind of “protection” the network as a service can offer to each application. These should, in turn, roll down to the services the network offers and their corresponding configurations. These should, in turn, roll across to the devices and software used to create these services, and then roll back down to the vulnerabilities of those services and devices. For instance, if sales management relies on application access control, and application access control relies on proper filtering, and filtering is configured on BGP and some sort of overlay virtual link to a cloud service… You start to get the idea of where different kinds of services rely on underlying capabilities, and then how those are related to suppliers, hardware, etc.

You can color the squares in different ways—the way the original article does, perhaps, or your reliance on an outside vendor to solve this problem, etc. Once the basic chart is in place you can use multiple color schemes to get different views of the attack surface by using the chart as a sort of heat map.

Again, this is something of a nub of an idea, but it is a potentially interesting way to get a single view of the entire network ecosystem from a security standpoint, know where things are weak (and hence need work), and understand where cascading failures might happen.