Reflections on Intent

No, not that kind. šŸ™‚

BGP security is a vexed topic—people have been working in this area for over twenty years with some effect, but we continuously find new problems to address. Today I am looking at a paper called BGP Communities: Can of Worms, which analyses some of the security problems caused by current BGP community usage in the ā€˜net. The point I want to think about here, though, is not the problem discussed in the paper, but rather some of the larger problems facing security in routing.

Assume there is some traffic flow passing from 101::47/64 and 100::46/64 in this network. AS65003 has helpfully set up community string-based policies that allow a peer to advertise a route with a specified AS Path prepend. In this case, if AS65003 receives a route with 3:65004x to prepend the route advertised towards 65004 with x number of additional AS Path entries, and 3:65005x to prepend the route advertised towards 65005 with x number of additional AS Path entries.

Assuming community strings set by AS65002 are carried with the 100::46/64 route through the rest of the network, AS65002 can:

  • Advertise 100::/46 towards AS65003 with 3:650045, causing the route received at AS65006 from AS65004 to have a longer AS Path than the route received through AS65005, causing the traffic to flow through AS65005
  • Advertise 100::/46 towards AS65003 with 3:650055, causing the route received at AS65006 from AS65005 to have a longer AS Path than the route received through AS65004, causing the traffic to flow through AS65004

A lot of abuse is possible because of this situation. For instance, AS65002 might know the cost of the link between AS65006 and AS65004 is very expensive, so directing large amounts of traffic across that link will cause financial harm to AS65004 or AS65006. A malicious actor at AS65002 could also determine it can overwhelm this link, causing a sort of denial of service against anyone connected to AS65004 or AS65006.

The potential problem, then, is real.

The problem is, however, how do we solve this? The most obvious way is to block communities from being transmitted beyond one hop past the point in the network where they are set. There are, however, two problems with this solution. First, how can anyone tell which AS set a community on a route? There is no originator code in the community string, and there’s no particular way to protect this kind of information from being forged or modified short of carrying a cryptographic hash in the update—which is probably not going to be acceptable from a performance perspective.

But the technical problem here is just the ā€œtip of the iceberg.ā€ Even if we could determine who modified the route to include the community, there is no particular way for anyone receiving the community to determine the originator’s intent. AS65002 may well install some system which measures, in near-real time, the delay across multiple paths to determine which performs the best. Such a system could be programmed with the correct community strings to impact traffic, and then left to run some sort of machine learning process to figure out how to mark routes to improve performance. If the operator at AS65002 does not realize the cost of the AS65004->AS65006 link is prohibitive, any sort of financial burden imposed by this system could be an unintended, rather than intended, consequence.

This, it turns out, is often the problem with security. It might be that person is bypassing building security to save a life, or it could be they are doing so to steal corporate secrets. There is simply no way to know without meeting the person in question, listening to their reasoning, and allowing a human to decide which course of action is appropriate.

In the case of BGP, we’re dealing with ā€œspooky action at a distance;ā€ the source of the problem is several steps removed from the result of the problem, there’s no clear way to connect the two, and there’s no clear way to resolve the problem other than ā€œpicking up the phoneā€ even if one of these operators can figure out what is going on.

The problem of intent is what RFC3514’s evil bit is poking a bit of fun at—if we only knew the attacker’s intent, we could often figure out what to actually do. Not knowing intent, however, puts a major crimp in many of the best-laid security plans.

The Hedge 31: Network Operator Groups

Many engineers have heard about the wide variety of Network Operator Group (NOG) meetings, from smaller regional organizations through larger multinational ones. What is the value of attending a NOG? How can you convince your business leadership of this value? In this episode of the Hedge Vincent Celindro and Edward McNair join Russ White to consider these questions.

download

Learning from Failure at Scale

One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.

To produce the study, the authors took data from Facebook’s ticket logging system over 6 years, from 2011 through 2018. They used language-based systems to classify each event based on severity, kind of remediation, and root cause. Once the events were classified, the researchers plotted and tried to understand the results. For instance, table 2 lists the most common root causes of data center fabric incidents: 17% were maintenance, 13% misconfiguration, 13% hardware, and 12% software defects (bugs).

Given Facebook’s network is completely automated, with a full code review/canary process for validating changes before they are put into production, misconfiguration failures should lower than a manually operated network. That 13% of failures are still accounted for by misconfiguration shows even the best automation program cannot eliminate failures from misconfiguration. This number is also interesting because it implies networks without this degree of automation must have much higher failure rates due to misconfiguration. While the raw number of failures are not given, this seems to provide both an idea of how much improvement automation can create, as well as a sort of ā€œcapā€ on how much improvement operators can expect by automating.

If misconfiguration causes 13% of all failures, and software defects cause 12%, then 25% of all failures are caused by human error. I don’t know of any other studies of this kind, but 25% sounds about right based on years of experience. Whether this 25% is spread across failures in vendor code and operator configuration, or across operator created code and operator configuration, the percentage of failure seems to remain about the same. It is not likely you can eliminate failures caused by human error, nor are you likely to drive it down more than a couple of percentage points.

Another interesting finding here is larger networks increase the time humans take to resolve incidents. As the size of the network scales up, the MTTR scales up with it. This is intuitive—larger networks tend to have more complex configurations, leading to more time spent trying to chase down and understand a problem. One thing the paper does not discuss, but might be interesting, is how modularization impacts these numbers. Intuitively, containing failures within a module (whether horizontally along topological lines or vertically through virtualization) should decrease the scope in which a network engineer needs to search to find a problem and resolve it. This is, on the other hand, likely to be offset somewhat by the increased complexity and reduction in visibility caused by segmentation—so it’s hard to determine what the overall effect of deeper segmentation in a network might be.

Overall, this is an interesting paper to parse through and understand—there are lots of great insights here for network operators at any scale.

Understanding Internet Peering

The world of provider interconnection is a little … ā€œmysteriousā€ … even to those who work at transit providers. The decision of who to peer with, whether such peering should be paid, settlement-free, open, and where to peer is often cordoned off into a separate team (or set of teams) that don’t seem to leak a lot of information. A recent paper on current interconnection practices published in ACM SIGCOMM sheds some useful light into this corner of the Internet, and hence is useful for those just trying to understand how the Internet really works.

To write the paper, the authors sent requests to fill out a survey through a wide variety of places, including NOG mailing lists and blogs. They ended up receiving responses from all seven regions (based on the RIRs, who control and maintain Internet numbering resources like AS numbers and IP addresses), 70% from ISPs, 14% from content providers, and 7% from ā€œEnterpriseā€ and infrastructure operators. Each of these kinds of operators will have different interconnection needs—I would expect ISPs to engage in more settlement-free peering (with roughly equal traffic levels), content providers to engage in more open (settlement-free connections with unequal traffic levels), IXs to do mostly local peering (not between regions), and ā€œenterprisesā€ to engage mostly in paid peering. The survey also classified respondents by their regional footprint (how many regions they operate in) and size (how many customers they support).

The survey focused on three facets of interconnection: time required to form a connection, the reasons given for interconnecting, and parameters included in the peering agreement. These largely describe the status quo in peering—interconnections as they are practiced today. As might be expected, connections at IXs are the quickest to form. Since IXs are normally set up to enable peering; it makes sense that the preset processes and communications channels enabled by an IX would make the peering process a lot faster. According to the survey results, the most common timeframe to complete peering is days, with about a quarter taking weeks.

Apparently, the vast majority (99%!) of peering arrangements are by ā€œhandshake,ā€ which means there is no legal contract behind them. This is one reason Network Operator Groups (NOGs) are so important (a topic of discussion in the Hedge 31, dropping next week); the peering workshops are vital in building and keeping the relationships behind most peering arrangements.

On-demand connectivity is a new trend in inter-AS peering. For instance, interxion recently worked with LINX and several other IXs to develop a standard set of APIs allowing operators to peer with one another in a standard way, often reducing the technical side of the peering process to minutes rather than hours (or even days). Ā Companies are moving into this space, helping operators understand who they should peer with, and building pre-negotiated peering contracts with many operators. While current operators seem to be aware of these options, they do not seem to be using these kinds of services yet.

While this paper is interesting, it does leave many corners of the inter-AS peering world un-exposed. For instance—I would like to know how correct my assumptions are about the kinds of peering used by each of the different classes of providers is, and whether there are regional differences in the kinds of peering. While its interesting to survey the reasons providers pursue peering, it would be interesting to understand the process of making a peering determination more fully. What kinds of tools are available, and how are they used? These would be useful bits of information for an operator who only connects to the Internet, rather than being part of the Internet infrastructure (perhaps a ā€œnon-infrastructure operator,ā€ rather than ā€œenterpriseā€) in understanding how their choice of upstream provider can impact the performance of their applications and network.

Note: this is another useful, but slightly older, paper on the topic of peering.

The Hedge 29: Remote Work and Security

The massive numbers of people staying home to work because of the ongoing pandemic are placing a lot of strain on network infrastructure. One area many operators are not considering, however, is security—how does having a lot of remote workers impact DDoS? Is split tunneling really the right way to manage remote connectivity? Roland Dobbins joins Eyvonne Sharp and Russ White to discuss security in times of mass remote work on this episode of the Hedge.

a href=”https://media.blubrry.com/hedge/content.blubrry.com/hedge/hedge-029.mp3″>download

Working from Home: Myth and Reality

The last few weeks have seen a massive shift towards working from home because of the various ā€œstay at homeā€ orders being put in place around the world—a trend I consider healthy in the larger scheme of things. Of course, there has also been an avalanche of ā€œtips for working from homeā€ articles. I figured I’d add my own to the pile.

A bit of background—I first started working from home regularly around twenty years ago, while on the global Escalation Team at Cisco. I started by working from home one day a week, to focus on projects, slowly increasing over time, ultimately working from the office one day a week when I transitioned to the Deployment and Architecture Team. Since that team was scattered all over the world—we had a few folks in Raleigh, two in Reading (England), one Brussels, one in Scott’s Valley, and one or two in the San Jose (California) area, we had to learn to work together remotely if we had any hope of being effective. Further, we (as a family) have home-schooled ā€œall the way throughā€ā€”my oldest daughter is nearing the end of high school, currently overlapping with some college work. I have always had the entire family at home most of the time.

So, forthwith, some thoughts and experiences, including some you might not have ever heard, and some myth busting along the way.

Probably the most common piece of advice I hear is you should set a schedule and stick to it. On the other hand, the most common thing people like about working from home is the flexibility of the schedule. Somehow no-one ever seems to put these two together and say ā€œhey, one of these things just doesn’t belong here!ā€ (remember that old song, courtesy of public television). When you are working from home, setting a fixed schedule similar to being in an office is precisely the wrong thing to do. This is a myth.

Instead, what you should do is to try to intentionally carve out several hours a day of ā€œquiet timeā€ to get things done. This might be different times in the day for your house, and its not likely to be consistent on a daily, weekly, or monthly basis. The most productive times of the day are going to shift—get used to it. Most of my family are late sleepers or tend to get up in the morning and do ā€œquiet things,ā€ so the mornings tend to be much more productive for me than the afternoons. This isn’t always true; sometimes the evenings are quieter. There are few days, however, when there isn’t some quiet time during each day.

The key to scheduling is to plan project work for these quiet times, whenever they happen to be. Given the schedule of my house, I get up in the morning and immediately dive into project work. I leave email and reading RSS feeds for the afternoon, because I know I won’t be able to concentrate as well during those time. On the other hand, I try not to get frustrated if my routine is broken up—just put off the project work until the time when it is quiet. Roll with the punches, rather than trying to punch against the rolls, when it comes to time. The only fixed point should be to think about how many hours of quiet time you need per week and figure out how to get it.

Don’t worry about having 8 hours solid, or whatever. Take breaks to eat lunch with family or friends. Take breaks to have a cup of hot liquid. Run out to the store in the afternoon. Exercise in the middle of the day. Take advantage of the flexible schedule to break work up into multiple pieces, rather than being glued to an office desk for 8 hours, regardless of your productivity level through that time. This is not the office—don’t act like it is.

The second most common piece of advice I hear is find a space and stick to it. This is not a myth—it is absolutely true. I have a table in the basement in one place, and a dedicated room as an office in another. When we go on vacation as a family, I either make certain to have a large enough room to have space to work, or I scope out someplace I can ā€œset up shopā€ someplace in the hotel. Space is just that important.

Sometimes you hear get the right equipment. This, also, I can attest to. Tools, however, are a bit of an obsession for me; I will spend hours perfecting a tool, even if it does not, ultimately, save me as much time as I originally thought. I am a stickler for the right monitory, chair, desk space, footrests, and keyboards. I play with lighting endlessly; I have finally settled on a setup with a dim monitor, dim room lighting, bias lighting behind the monitor, and a backlit keyboard. I’ve concluded that reducing the sheer amount of light being poured into my eyes, across the board, helps increase the amount of time I can keep working. Some people prefer two monitor setups—I prefer a single large monitor (31in right now). I don’t use a mouse, but a Wacom tablet. I’ve recently added an air purifier to my office space—I think this makes a difference. I don’t have a trash can close by—its useful to force myself to get up every now and again. Your physical environment is a matter of personal preference but pay attention to it. This will probably have as much impact on your productivity as anything else you do.

Overall, don’t let other people tell you what tools are the best. I’ve been told for more years I can remember that I should stop using Microsoft Word for long-form writing. You should switch to markdown, you should switch to Scrivener, you should use FocusWriter… Well, whatever. I’ve written … around 13 books, and I have two more in progress. I’ve written thousands of blog posts and articles. I’ve written thousands of pages of research papers, and I’m currently working on my dissertation. All in Microsoft Word. I’ve tried other tools, but ultimately I’ve found that Word, with my own customized interface settings, to be just as fast or faster than anything else I’ve worked with.

One thing you almost never hear is a team change. If one person is remote, act like everyone is. There is nothing more disruptive, or less useful, than having one person on the phone while everyone else is sitting in a conference room. If one person has to dial in, everyone should dial in. Setting up times to ā€œjust hang outā€ as a ā€œmeetingā€ is also sometimes helpful. If you have a fully remote team, dedicate ā€œin real lifeā€ time to doing things you cannot do on meetings, and dedicate meeting time to things you cannot do through email, a chat app, or some other means.

Finally, some people are neat nicks while others are sloppy. I tend towards the extreme neat end of things, arranging even my virtual desktop ā€œjust soā€ (I actually do not have any icons on my desktop at all, and I’m very picky about what I put in my start menu and/or command bar).

The key thing is, I think, to pay attention to what helps and what doesn’t, and not be afraid to experiment to find a better way.