CONTENT TYPE
Reducing RPKI Single Point of Takedown Risk
The RPKI, for those who do not know, ties the origin AS to a prefix using a certificate (the Route Origin Authorization, or ROA) signed by a third party. The third party, in this case, is validating that the AS in the ROA is authorized to advertise the destination prefix in the ROA—if ROA’s were self-signed, the security would be no better than simply advertising the prefix in BGP. Who should be able to sign these ROAs? The assigning authority makes the most sense—the Regional Internet Registries (RIRs), since they (should) know which company owns which set of AS numbers and prefixes.
The general idea makes sense—you should not accept routes from “just anyone,” as they might be advertising the route for any number of reasons. An operator could advertise routes to source spam or phishing emails, or some government agency might advertise a route to redirect traffic, or block access to some web site. But … if you haven’t found the tradeoffs, you haven’t looked hard enough. Security, in particular, is replete with tradeoffs.
Every time you deploy some new security mechanism, you create some new attack surface—sometimes more than one. Deploy a stateful packet filter to protect a server, and the device itself becomes a target of attack, including buffer overflows, phishing attacks to gain access to the device as a launch-point into the private network, and the holes you have to punch in the filters to allow services to work. What about the RPKI?
When the RKI was first proposed, one of my various concerns was the creation of new attack services. One specific attack surface is the control a single organization—the issuing RIR—has over the very existence of the operator. Suppose you start a new content provider. To get the new service up and running, you sign a contract with an RIR for some address space, sign a contract with some upstream provider (or providers), set up your servers and service, and start advertising routes. For whatever reason, your service goes viral, netting millions of users in a short span of time.
Now assume the RIR receives a complaint against your service for whatever reason—the reason for the complaint is not important. This places the RIR in the position of a prosecutor, defense attorney, and judge—the RIR must somehow figure out whether or not the charges are true, figure out whether or not taking action on the charges is warranted, and then take the action they’ve settled on.
In the case of a government agency (or a large criminal organization) making the complaint, there is probably going to be little the RIR can do other than simply revoke your certificate, pulling your service off-line.
Overnight your business is gone. You can drag the case through the court system, of course, but this can take years. In the meantime, you are losing users, other services are imitating what you built, and you have no money to pay the legal fees.
A true story—without the names. I once knew a man who worked for a satellite provider, let’s call them SATA. Now, SATA’s leadership decided they had no expertise in accounts receivables, and they were spending too much time on trying to collect overdue bills, so they outsourced the process. SATB, a competing service, decided to buy the firm SATA outsourced their accounts receivables to. You can imagine what happens next… The accounting firm worked as hard as it could to reduce the revenue SATA was receiving.
Of course, SATA sued the accounting firm, but before the case could make it to court, SATA ran out of money, laid off all their people, and shut their service down. SATA essentially went out of business. They won some money later, in court, but … whatever money they won was just given to the investors of various kinds to make up for losses. The business itself was gone, permanently.
Herein lies the danger of giving a single entity like an RIR, even if they are friendly, honest, etc., control over a critical resource.
A recent paper presented at the ANRW at APNIC caught my attention as a potential way to solve this problem. The idea is simple—just allow (or even require) multiple signatures on a ROA. To be more accurate, each authorizing party issues a “partial certificate;” if “enough” pieces of the certificate are found and valid, the route will be validated.
The question is—how many signatures (or parts of the signature, or partial attestations) should be enough? The authors of the paper suggest there should be a “Threshold Signature Module” that makes this decision. The attestations of the various signers are combined in the threshold module to produce a single signature that is then used to validate the route. This way the validation process on the router remains the same, which means the only real change in the overall RPKI system is the addition of the threshold module.
If one RIR—even the one that allocated the addresses you are using—revokes their attestation on your ROA, the remaining attestations should be enough to convince anyone receiving your route that it is still valid. Since there are five regions, you have at least five different choices to countersign your ROA. Each RIR is under the control of a different national government; hence organizations like governments (or criminals!) would need to work across multiple RIRs and through other government organizations to have a ROA completely revoked.
An alternate solutions here, one that follows the PGP model, might be to simply have the threshold signature model consider the number and source of ROAs using the existing model. Local policy could determine how to weight attestations from different RIRs, etc.
This multiple or “shared” attestation (or signature) idea seems like a neat way to work around one of (possibly the major) attack surfaces introduced by the RPKI system. If you are interested in Internet core routing security, you should take a read through the post linked above, and then watch the video.
The Hedge 52: Tobi Metz and the Technologist Question
Tobi Metz asked What is a Technologists? in a recent blog post. Tobi joins Tom and Russ on this episode of the Hedge to expand on his answer, and get our thoughts on the question.
Everyone Must Learn to Code
The word on the street is that everyone—especially network engineers—must learn to code. A conversation with a friend and an article passing through my RSS reader brought this to mind once again—so once more into the breach. Part of the problem here is that we seem to have a knack for asking the wrong question. When we look at network engineer skill sets, we often think about the ability to configure a protocol or set of features, and then the ability to quickly troubleshoot those protocols or features using a set of commands or techniques.
This is, in some sense, what various certifications have taught us—we have reached the expert level when we can configure a network quickly, or when we can prove we understand a product line. There is, by the way, a point of truth in this. If you claim your expertise is with a particular vendor’s gear, then it is true that you must be able to configure and troubleshoot on that vendor’s gear to be an expert. There is also a problem of how to test for networking skills without actually implementing something, and how to implement things without actually configuring them. This is a problem we are discussing in the new “certification” I’ve been working on, as well.
This is also, in some sense, what the hiring processes we use have taught us. Computers like to classify things in clear and definite ways. The only clear and definite way to classify networking skills is by asking questions like “what protocols do you understand how to configure and troubleshoot?” It is, it seems, nearly impossible to test design or communication skills in a way that can be easily placed on a resume.
Coding, I think, is one of those skills that is easy to appear to measure accurately, and it’s also something the entire world insists is the “coming thing.” No coding skills, no job. So it’s easy to ask the easy question—what languages do you know, how many lines of code have you written, etc. But again, this is the wrong question (or these are the wrong questions).
What is the right question? In terms of coding skills, more along the lines of something like, “do you know how to build and use tools to solve business problems?” I phrase it this way because the one thing I have noticed about every really good coder I have known is they all spend as much time building tools as they do building shipping products. They build tools to test their code, or to modify the code they’ve already written en masse, etc. In fact, the excellent coders I know treat functions like tools—if they have to drive a nail twice, they stop and create a hammer rather than repeating the exercise with some other tool.
So why is coding such an important skill to gain and maintain for the network engineer? This paragraph seems to sum it up nicely for me—
It’s not the coding that matters, it’s “being able to model problems and use computers to solve them.” This is the essence of tool building or engineering—seeing the problem, understanding the problem, and then thinking through (sometimes by trial and error) how to build a tool that will solve the problem in a consistent, easy to manage way. I fear that network engineers are taking their attitude of configuring things and automating it to make the configuration and troubleshooting faster. We seem to end up asking “how do I solve the problem of making the configuration of this network faster,” rather than asking “what business problem am I trying to solve?”
To make effective use of the coding skills we’re telling everyone to learn, we need to go back to basics and understand the problems we’re trying to solve—and the set of possible solutions we can use to solve those problems. Seen this way, the routing protocol becomes “just another tool,” just like a function call, that can be used to solve a specific set of problems—instead of a set of configuration lines that we invoke like a magic incantation to make things happen.
Coding skills are important—but they require the right mindset if we’re going to really gain the sorts of efficiencies we think are possible.
The Hedge 51: Tim Fiola and pyNTM
Have you ever looked at your wide area network and wondered … what would the traffic flows look like if this link or that router failed? Traffic modeling of this kind is widely available in commercial tools, which means it’s been hard to play with these kinds of tools, learn how they work, and understand how they can be effective. There is, however, an open source alternative—pyNTM.
It Has to Work (RFC1925, Rule 1)
From time immemorial, humor has served to capture truth. This is no different in the world of computer networks. A notable example of using humor to capture truth is the April 1 RFC series published by the IETF. RFC1925, The Twelve Networking Truths, will serve as our guide.
According to RFC1925, the first fundamental truth of networking is: it has to work. While this might seem to be overly simplistic, it has proven—over the years—to be much more difficult to implement in real life than it looks like in a slide deck. Those with extensive experience with failures, however, can often make a better guess at what is possible to make work than those without such experience. The good news, however, is the experience of failure can be shared, especially through self-deprecating humor.
Consider RFC748, which is the first April First RFC published by the IETF, the TELNET RANDOMLY-LOSE Option. This RFC describes a set of additional signals in the TELNET protocol (for those too young to remember, TELNET is what people used to communicate with hosts before SSH and web browsers!) that instruct the server not to provide random losses through such things as “system crashes, lost data, incorrectly functioning programs, etc., as part of their services.” The RFC notes that many systems apparently have undocumented features that provide such losses, frustrating users and system administrators. The option proposed would instruct the server to disable features which cause these random losses.
Lesson learned? Although one of the general rules of application design is the network is not reliable, the counter rule suggested by RFC748 is the application is not reliable, either. This a key point in the race to Mean Time to Innocence (MTTI). RFC1882, published a few years after RFC748, is a veritable guidebook for finding problems in a network, including transceiver failures, databases with broken b-trees, unterminated contacts, and a plethora of other places to look. Published just before Christmas, RFC1882 is an ideal guide for those who want to spend time with their families during the most festive times of the year.
Another common problem in large-scale networks is services that want to choose to operate from the safety and security of an anonymous connection. RFC6593 describes the Doman Pseudonym System, specifically designed to support services that do not wish to be discovered. The specification describes two parties to the protocol, the first being the seeker, or “it,” and the second being the service which is attempting to hide from it. The process used is for the seeker to send a transmission declaring the beginning of the search sequence called the “ready or not,” followed by a countdown during which “it” is not allowed to peek at a list of available services. During this countdown, the service may change its name or location, although it will be penalized if discovered doing so. This Domain Pseudonym System is the perfect counterpart to the Domain Name System normally used to discover services on large-scale networks, as shown by the many networks that already deploy such a hide-and-seek method to managing services.
What if all the above guidance for network operators fails, and you are stuck troubleshooting a problem? RFC2321 has an answer to this problem: RITA — The Reliable Internetwork Troubleshooting Agent. The typical RITA is described as 51.25cm in length, and yellow/orange in color. The first test the operator can perform with the RITA is placing it on the documentation for the suspect system, or on top of the suspect system itself. If the RITA eventually flies away, there is a greater than 90% chance there is a defect in the system tested. The odds of the defects in the tested system being the root cause of the problem the operator is currently troubleshooting is not guaranteed, however. The RITA has such a high success rate because it is believed that 100% of systems in operation do, in fact, contain defects. The 10% failure rate primarily occurs in cases where the RITA itself dies during the test, or decides to go to sleep rather than flying to some other location.
Each of these methods can help the network operator fulfill the first rule of networking: it has to work.
Understandability
According to Maor Rudick, in a recent post over at Cloud Native, programming is 10% writing code and 90% understanding why it doesn’t work. This expresses the art of deploying network protocols, security, or anything that needs thought about where and how. I’m not just talking about the configuration, either—why was this filter deployed here rather than there? Why was this BGP community used rather than that one? Why was this aggregation range used rather than some other? Even in a fully automated world, the saying holds true.
So how can you improve the understandability of your network design? Maor defines understandability as “the dev who creates the software is to effortlessly … comprehend what is happening in it.” Continuing—“the more understandable a system is, the easier it becomes for the developers who created it to change it in a way that is safe and predictable.” What are the elements of understandability?
Documentation must be complete, clear, concise, and organized. The two primary failings I encounter in documentation are completeness and organization. Why something is done, when it was last changed, and why it was changed are often missing. The person making the change just assumes “I’ll remember this, or someone will figure it out.” You won’t, and they won’t. Concise is the “other side” of complete … Recording unsubstantial changes just adds information that won’t ever be needed. You have to balance between enough and too much, of course.
Organization is another entire problem in documentation—most people have a favorite way to organize things. When you get a team of people all organizing things based on their favorite way, you end up with a mess. Going back in time … I remember that just about everyone who was assigned to the METNAV shop began their time by re-organizing the tools. Each time the re-organization made things so much easier to find, and improved the MTTR for the airfield equipment we supported … After a while, you’d think someone would ask, “Does re-organizing all the tools every year really help? Or are you just making stuff up for new folks to do?”
Moving beyond documentation, what else can we do to make our networks more understandable?
First, we can focus on actually making networks simpler. I don’t mean just glossing things over with a pretty GUI, or automating thousands of lines of configuration using Python. I mean taking steps by using protocols that are simpler to run, require less configuration, and produce more information you can use for troubleshooting—choose something like IS-IS for your DC fabric underlay rather than BGP, unless you really have several hundred thousand of underlay destinations (hint, if you’ve properly separated “customer” routes in the overlay from “infrastructure” routes in the underlay, you shouldn’t have this kind of routing tangle in the underlay anyway).
What about having multiple protocols that do the same job? Do you really need three or four routing protocols, four or five tunneling protocols, and five or six … well, you get the idea. Reducing the sheer number of protocols running in your network can make a huge difference in the tooling troubleshooting time. What about having four or five kinds of boxes in your network that fulfill the same role? Okay—so maybe you have three DC fabrics, and you run each one using a different vendor. But is there is any reason to have three DC fabrics, each of which has a broad mix of equipment from five different vendors? I doubt it.
Second, you can think about what you would measure in the case of failure, how you would measure it, and put the basic piece in place in the design phase to do those measurements. Don’t wait until you need the data to figure out how to get at it, and what the performance results of trying to get it are going to be.
Third, you can think about where you put policy in your network. There is no “right” answer to this question, other than … be consistent. The first option is to put all your policy in one place—say, on the devices that connect the core to the aggregation, or the devices in the distribution layer. The second option is to always put the policy as close to the source or destination of the traffic impacted by the policy. In a DC fabric, you should always put policy and external connectivity in the T0 or ToR, never in the spine (it’s not a core, it’s a spine).
Maybe you have other ideas on how to improve understandability in networks … If you do, get in touch and let’s talk about it. I’m always looking for practical ways to make networks more understandable.
The Hedge 49: Karen O’Donoghue and Network Time Security
Time is critical for many of the systems that make the Internet and other operational networks “go,” but we often just assume the time is there and it’s right. In this episode of the Hedge, Karen O’Donoghue joins Alvaro and Russ to talk about some of the many attacks and failures that can be caused by an incorrect time, and current and ongoing work in securing network time in the IETF.
