At NANOG on the Road (NotR) in September of 2018, I participated in a panel on BGP security—specifically the deployment of Route Origin Authentication (ROA), with some hints and overtones of path validation by carrying signatures in BGP updates (BGPsec). This is an area I have been working in for… 20 years? … at this point, so I have seen the argument develop across these years many times, and in many ways. What always strikes me about this discussion, whenever and wherever it is aired, is the clash between business realities and the desire for “someone to do something about routing security in the DFZ, already!” What also strikes me about these conversations it the number of times very fundamental concepts end up being explained to folks who are “new to the problem.”
- BGP security is a business problem first, and a technology problem second
- Signed information is only useful insofar as it is maintained
- The cost of deployment must be lower than the return on that cost
- Local policy will always override global policy—as it should
- The fear of losing business is a stronger motivator than gaining new business
Part of the problem here is solutions considered “definitive and final” have been offered, the operator community has rejected them for many years, and yet these same solutions are put on the table year after year—like the perennial fruit cake made by someone’s great great aunt in the mists of Christmastime history that has been regifted so many times no-one really remembers where it came from, nor what sorts of fruit it actually contains.
The business reality, in terms of BGP security, is simple. To deploy some sort of check on the global routing table, at point must be reached where it costs more to not deploy it than to deploy it. This simply business reality is something network designers and architects beat their heads against every day. The solution can be the neatest solution in the world. It might even shop for the ingredients, mix the cookie dough in perfect proportions, bake the cookies, and then transport them to the proper location for the perfect amount of enjoyment from just the right people (insert Goldilocks here, perhaps). But none of this will ever matter if there is no financial upside, or if the financial risks are greater than the financial gains.
So, some hopefully helpful business realities.
Signed information is no more useful than unsigned information if it is not kept up to date. It is great to get everyone out in full force to build a cryptographically secured database of who owns what prefixes and AS numbers. It is wonderful to find a way to distribute that information throughout the ‘net so people can use it as another tool to determine whether or not to accept a route, or what weight to place on that route. People might even spend a bit of time building this database, just because they believe it is good for the community.
The problem is not day one. It is day two, and then day two thousand. What motivation is there to keep the information in this database up-to-date? Unless there is some—and here I mean financial motivation—the database will lose its effectiveness over time. At some point, when the error rate reaches some number (around 30% seems to be about right), people will simply stop trusting it. When the error rate gets high enough, the tools will stop being used, and the ‘net will revert to its old self.
The cost of deployment and operation must at least be close to the gains from deployment. At this point, there is no financial gain any one company can see from deploying anything in the realm of BGP security, so the cost of deployment and maintenance must be close to zero. While there are folks (including me) trying to reduce the cost as close to zero as possible, we are not there yet, and I do not know if we will ever be there.
Local policy will always override global policy. The literature of BGP security is replete with statements like: “if the route meets this criteria, the BGP speaker MUST drop it.” Good luck. The Internet is a confederation of independent companies, each of which runs their network in a way that they believe will make the most money at the lowest cost possible. One of the ways this happens is that people tune their local policies to charge their adjacent autonomous systems as much money as possible, while reducing their OPEX and CAPEX as much as possible. There will always be money to be made in the grey space around local tuning of policies for optimal traffic flow. Hence local policy will always win over what any database anyplace might say.
To give a specific example: assume you run a network, and you have peered with another operator in multiple places for many years. You know the other operator’s routes well, as they have not changed for many years. One day, you receive a route from this operator in which everything looks correct, but the route is not contained in this outside database of “correct routes.”
Noting this route is a route you have received from this very same operator for many years, are you going to drop it because it’s not right in some database, or are you going to use it given your past standing and relationship?
The fear of losing business is always the strongest motivator. Which leads to the next issue. It does not matter how wonderful your network is if you have a high customer churn rate. The most certain way to have a high churn rate is to place your customer’s experience in the hands of someone else. Such as a communally managed database, perhaps. This is another reason why local policy will always win over remote policy—the local provider is handed checks by customers, not the community at large. They have more incentive to keep their customers happy than the community.
You cannot secure things you do not tell anyone about. This final one is probably not as obvious as the others, but it is just as important as any other item on this list. There are many backdoor arrangements and sealed contracts in the provider world. People transit traffic without telling anyone else that traffic is being transited. Some people are customers of others only in the event of a massive failure someplace else in their network, but do not want anyone to know about this.
All of these arrangements are perfectly legitimate and legal in their respective jurisdictions. But you cannot secure something that no-one knows about. The more information that is hidden in a system, the harder it is to validate the information that exists is correct.
The bottom line is this: BGP security, like most networking problems, is not a technology problem. BGP security is, at its heart, a business problem. The lesson is here not just for security, but for network engineering in general. Business is the bottom line, not technology.
The Resource Public Key Infrastructure (RPKI) system is designed to prevent hijacking of routes at their origin AS. If you don’t know how this system works (and it is likely you don’t, because there are only a few deployments in the world), you can review the way the system works by reading through this post here on rule11.tech.
The paper under review today examines how widely Route Origin Validation (ROV) based on the RPKI system has been deployed. The authors began by determining which Autonomous Systems (AS’) are definitely not deploying route origin validation. They did this by comparing the routes in the global RPKI database, which is synchronized among all the AS’ deploying the RPKI, to the routes in the global Default Free Zone (DFZ), as seen from 44 different route servers located throughout the world. In comparing these two, they found a set of routes which the RPKI system indicated should be originated from one AS, but were actually being originated from another AS in the default free zone.
Using this information, the researchers then looked for AS’ through which these routes with a mismatched RPKI and global table origin were advertised. If an AS accepted, and then readvertised, routes with mismatched RPKI and global table origins, they marked this AS as one that does not enforce route origin authentication.
A second, similar check was used to find the mirror set of AS’, those that do perform a route origin validation check. In this case, the authors traced the same type of route—those for which the origin AS the route is advertised with does not match the originating AS in the RPKI–and discovered some AS’ will not readvertise such a route. These AS’ apparently do perform a check for the correct route origin information.
The result is that only one of the 20 Internet Service Providers (ISPs) with the largest number of customers performs route origination validation on the routes they receive. Out of the largest 100 ISPs (again based on customer AS count), 22 appear to perform a route origin validation check. These are very low numbers.
To double check these numbers, the researchers surveyed a group of ISPs, and found that very few of them claim to check the routes they receive against the RPKI database. Why is this? When asked, these providers gave two reasons.
First, these providers are concerned about the problems involved with their connectivity being impacted in the case of an RPKI system failure. For instance, it would be easy enough for a company to become involved in a contract dispute with their naming authority, or with some other organization (two organizations claiming the same AS number, for instance). These kinds of cases could result in many years of litigation, causing a company to effectively lose their connectivity to the global ‘net during the process. This might seem like a minor fear for some, and there might be possible mitigations, but the ‘net is much more statically defined than many people realize, and many operators operate on a razor thin margin. The disruptions caused by such an event could simply put a company out of business.
Second, there is a general perception that the RPKI database is not exactly a “clean” representation of the real world. Since the database is essentially self-reported, there is little incentive to make changes to the database once something in the real world has changed (such as the transfer of address space between organization). It only takes a small amount of old, stale, or incorrect information to reduce the usefulness of this kind of public database. The authors address this concern by examining the contents of the RPKI, and find that it does, in fact, contain a good bit of incorrect information. They develop a tool to help administrators find this information, but ultimately people must use these kinds of tools.
The point of the paper is that the RPKI system, which is seen as crucial to the security of the global Internet, is not being widely used, and deployment does not appear to be increasing over time. One possible takeaway is the community needs to band together and deploy this technology. Another might be that the RPKI is not a viable solution to the problem at hand for various technical and social reasons—it might be time to start looking for another alternative for solving this problem.
I was recently invited to a webinar for the RIPE NCC about the future of BGP security. The entire series is well worth watching; I was in the final session, which was a panel discussion on where we are now, and where we might go to make BGP security better.
Yet another protocol episode over at the Network Collective. This time, Nick, Jordan, Eyvonne and I talk about BGP security.
From time to time, someone publishes a new blog post lauding the wonderfulness of BGPsec, such as this one over at the Internet Society. In return, I sometimes feel like I am a broken record discussing the problems with the basic idea of BGPsec—while it can solve some problems, it creates a lot of new ones. Overall, BGPsec, as defined by the IETF Secure Interdomain (SIDR) working group is a “bad idea,” a classic study in the power of unintended consequences, and the fond hope that more processing power can solve everything. To begin, a quick review of the operation of BGPsec might be in order. Essentially, each AS in the AS Path signs the “BGP update” as it passes through the internetwork, as shown below.
In this diagram, assume AS65000 is originating some route at A, and advertising it to AS65001 and AS65002 at B and C. At B, the route is advertised with a cryptographic signature “covering” the first two hops in the AS Path, AS65000 and AS65001. At C, the route is advertised with a cryptogrphic signature “covering” the first two hops in the AS Path, AS65000 and AS65002. When F advertises this route to H, at the AS65001 to AS65003 border, it again signs the AS Path, including the AS F is advertising the route to, so the signed path includes AS65000, AS65001, and AS65003.
To validate the route, H can use AS65000’s public key to verify the signature over the first two hops in the AS Path. This shows that AS65000 not only did advertise the route to AS65001, but also that it intended to advertise this route to AS65001. In this way, according to the folks working on BGPsec, the intention of AS65000 is laid bare, and the “path of the update” is cryptographically verified through the network.
Except, of course, there is no such thing as an “update” in BGP that is carried from A to H. Instead, at each router along the way, the information stored in the update is broken up and stored in different memory structures, and then rebuilt to be transmitted to specific peers as needed. BGPsec, then, begins with a misunderstanding of how BGP actually works; it attempts to validate the path of an update through an internetwork—and this turns out to be the one piece of information that doesn’t matter all that much in security terms.
But set this problem aside for a moment, and consider how this actually works. First, B, before the signatures, could have sent a single update to multiple peers. After the signatures, each peer must receive its own update. One of the primary ways BGP uses to increase performance is in gathering updates up and sending one update whenever possible using either a peer group or an update group. Worse yet, every reachable destination—NLRI—now must be carried in its own update. So this means no packing, and no peer groups. The signatures themselves must be added to the update packets, as well, which means they must be stored, carried across the wire, etc.
The general assumption in the BGPsec community is the resulting performance problems can be resolved by just upping the processor and bandwidth. That BGPsec has been around for 20 years, and the performance problem still hasn’t been solved is not something anyone seems to consider. 🙂 In practice, this also means replacing every eBGP speaker in the internetwork—perhaps hundreds of thousands of them in the ‘net—to support this functionality. “At what cost,” and “for what tradeoffs,” are questions that are almost never asked.
But let’s lay aside this problem for a moment, and just assumed every eBGP speaking router in the entire ‘net could be replaced tomorrow, at no cost to anyone. Okay, all the BGP AS Path problems are now solved right? Not so fast…
Assume, for a moment, that AS65000 and AS65001 break their peering relationship for some reason. At the moment the B to D peering relationship is shut down, D still has a copy of the signed updates it has been using. How long can AS65001 continue advertising connectivity to this route? The signatures are in band, carried in the BGP update as constructed at B, and transmitted to D. So long as AS65001 has a copy of a single update, it can appear to remain connected to AS65000, even though the connection has been shut down. The answer, then, is that AS65000 must somehow invalidate the updates it previously sent to AS65001. There are three ways to do this.
First, AS65000 could roll its public and private key pair. This might work, so long as peering and depeering events are relatively rare, and the risk from such depeering situations is small. But are they? Further, until the new public and private key pairs are distributed, and until new routes can be sent through the internetwork using these new keys, the old keys must remain in place to prevent a routing disruption. How long is this? Even if it is 24 hours, probably a reasonable number, AS65001 has the means to grab traffic that is destined to AS65000 and do what it likes with that traffic. Are you comfortable with this?
Second, the community could build a certificate revocation list. This is a mess, so there’s no point in going there.
Third, you could put a timer in the BGP update, like a Link State Update. Once the timer runs down or our, the advertisement must be replaced. Given there are 800k routes in the default free zone, a timer of 24 hours (which would still make me uncomfortable in security terms), there would need to be 800k/24 hours updates per hour added to the load of every router in the Internet. On top of the reduced performance noted above.
Again, it is useful to set this problem aside, and assume it can be solved with the wave of a magic wand someplace. Perhaps someone comes up with a way to add a timer without causing any additional load, or a new form of revocation list is created that has none of the problems of any sort known about today. Given these, all the BGP AS Path problems in the Internet are solved, right?
Consider, for a moment, the position of AS65001 and AS65002. These are transit providers, companies that rely on their customer’s trust, and their ability to out compete in the area of peering, to make money. First, signing updates means that you are declaring, to the entire world, in a legally provable way, who your customers are. This, from what I understand of the provider business model, is not only a no-no, but a huge legal issue. But this is actually, still, the simpler problem to solve.
Second, you cannot deploy this kind of system with a single, centrally stored private key. Assume, for a moment, that you do solve the problem this way. What happens if a single eBGP speaker is compromised? What if you need to replace a single eBGP speaker? You must roll your AS level private key. And replace all your advertisements in the entire Internet. This, from a security standpoint, is a bad idea.
Okay—the reasonable alternative is to create a private key per eBGP speaker. This private key would have its own public key, which would, in turn, be signed by the AS level private key. There are two problems with this scheme, however. The first is: when H validates the signature on some update it has received, it must now find not only the AS level public keys for AS65000 and AS65001, it must find the public key for B and F. This is going to be messy. The second is: By examining the publickeys I receive in a collection of “every update on the Internet,” I can now map the actual peering points between every pair of autonomous systems in the world. All the secret sauce in peering relationships? Exposed. Which router (or set of routers) to attack to impact the business of a specific company? Exposed.
The bottom line is this: even setting aside BGPsec’s flawed view of the way BGP works, even setting aside BGPsec’s flawed view of what needs to be secured, even setting aside BGPsec implementations the benefit of doing the impossible (adding state and processing without impacting performance), even given some magical form of replay attack prevention that costs nothing, BGPsec still exposes information no-one really wants exposed. The tradeoffs are ultimately unacceptable.
Which all comes back to this: If you haven’t found the tradeoffs, you haven’t looked hard enough.