The Hedge 34: Andrew Alston and the IETF

Complaining about how slow the IETF is, or how single vendors dominate the standards process, is almost a by-game in the world of network engineering going back to the very beginning. It is one thing to complain; it is another to understand the structure of the problem and make practical suggestions about how to fix it. Join us at the Hedge as Andrew Alston, Tom Ammon, and Russ White reveal some of the issues, and brainstorm how to fix them.

download

Quitting Certifications: When?

At what point in your career do you stop working towards new certifications?

Daniel Dibb’s recent post on his blog is, I think, an excellent starting point, but I wanted to add a few additional thoughts to the answer he gives there.

Daniel’s first question is how do you learn? Certifications often represent a body of knowledge people who have a lot of experience believe is important, so they often represent a good guided path to holistically approaching a new body of knowledge. In the professional learning world this would be called a ready-made mental map. There is a counterargument here—certifications are often created by vendors as a marketing tool, rather than as something purely designed for the betterment of the community, or the dissemination of knowledge. This doesn’t mean, however, that certifications are “evil.” It just means you need to evaluate each certification on its own merits.

As an aside, I’ve been trying to start a non-vendor-specific certification for the last two years but have been struggling to find a group of people with the energy and excitement required to make it happen. To some degree, the reason certifications are vendor-based is because we, as a community, don’t do a good job at building them.

The second series of questions relate to your position—would a certification give you a bonus, help you get a new position, or give credibility to the company you work for? These are all valid questions requiring self-reflection around what you hope to achieve materially by working through the certification.

The final set of questions Daniel poses relate to whether a certification would give you what might be called authority in the network engineering world. Certifications, seen in this way, are a form of transitive trust. There are two components here—the certification blueprint tells you about the body of knowledge, and the certifying authority tells you about the credibility of the process. Given these two things, someone with a certification is saying “someone you trust has said I have this knowledge, so you should trust I have this knowledge as well.” The certification acts as a transit between you and the certified person, transferring some amount of your trust in the certifying organization to the person you are talking to.

There are other ways to build this kind of trust, of course. For instance, if you blog, or run a podcast, or are a frequent guest contributor, or have a lot of code on git, or have written some books, etc. In these cases, the trust is no longer transitive but direct—you can see a person has a body of knowledge because they have (at least to some degree) exposed that knowledge publicly.

All of these reasons are fine and good—but I think there is another point to think about in this discussion: what are you saying to the community? Once you stop “doing” certifications, you can be saying one of two things. The first is certifications are useless. If all the reasons for getting a certification above are true, then telling someone “certifications are useless” is not a good thing. Those who don’t care about certifications should rather take the position first, do no harm.

A stronger position would be to carefully evaluate existing certifications and help guide folks desiring a certification down a good path rather than a bad one. Which certifications are primarily vendor marketing programs? Which are technically sound? If you are at the point where you are no longer going to pursue certifications, these are questions you should be able to answer.

An even stronger position would be—if you’re at the point where you do not think you need to be certified, where you have a “body of knowledge” that allows people to directly trust your work, then perhaps you should also be at a point where you are helping guide the development of certifications in some way.

Certifications—whether to get them or not, and when to stop caring about them—are rather more nuanced than many in the networking world make out. There are valid reasons for, and valid reasons against—and in general, I think we need to do better a developing and policing certifications in order to build a stronger community.

The Hedge 33: Balazs Varga and DETNET

Balazs Varge joins Alvaro Retana and Russ White on this episode of the Hedge to discuss the working going on in the IETF around deterministic networking. This work is important for applications requiring networks providing low latency and loss. You can read more about DETNET in these drafts:

https://datatracker.ietf.org/doc/draft-ietf-detnet-mpls-over-udp-ip/
https://datatracker.ietf.org/doc/draft-ietf-detnet-ip/
https://datatracker.ietf.org/doc/draft-ietf-detnet-mpls/

download

The Resilience Problem

…we have educated generations of computer scientists on the paradigm that analysis of algorithm only means analyzing their computational efficiency. As Wikipedia states: “In computer science, the analysis of algorithms is the process of finding the computational complexity of algorithms—the amount of time, storage, or other resources needed to execute them.” In other words, efficiency is the sole concern in the design of algorithms. … What about resilience? —Moshe Y. Vardi

This quote set me to thinking about how efficiency and resilience might interact, or trade off against one another, in networks. The most obvious extreme cases are two routers connected via a single long-haul link and the highly parallel data center fabrics we build today. Obviously adding a second long-haul link would improve resilience—but at what cost in terms of efficiency? Its also obvious highly meshed data center fabrics have plenty of resilience—and yet they still sometimes fail. Why?

These cases can be described as efficiency extremes. The single link between two distant points is extremely efficient at minimizing cost and complexity; there is only one link to pay for, only one pair of devices to configure, etc. The highly meshed data center fabric, on the other hand, is extremely efficient at rapidly carrying large amounts of data between vast numbers of interconnected devices (east/west traffic flows). Have these optimizations towards one goal resulted in tradeoffs in resilience?

Consider the case of the single long-haul link between two routers. In terms of the state/optimization/surfaces (SOS) tirade, this single pair of routers and single link minimize the amount of control plane state and the breadth of surfaces (there is only point at which the control plane and the physical network intersect, for instance). The tradeoff, however, is a single link failure causes all traffic through the network to stop flowing—the network completely fails to do the work its designed to do. To create resiliency, or rather add a second dimension of optimization to the network, a second link and a second pair of routers need to be added. Adding these, however, will increase the amount of state and the number of interaction surfaces in the network. Another way to put this is the overall system becomes more complex to solve a harder set of problems—inexpensive traffic flow versus minimal cost traffic flow with resilience.

The second case is a little harder to understand—we assume all those parallel links necessarily make the network more resilient. If this is the case, then why do data center fabrics ever fail? In fact, DC fabrics are plagued by one of the hardest kinds of failure to understand and repair—grey failures. Going back to the SOS triad, the massive number of parallel links and devices in a DC fabric, designed to optimize the network for carrying massive amounts of traffic, also add lots of state and interaction surfaces to the network. Increasing the amount of state and interaction surfaces should, in theory, reduce some other form of optimization—in this case resilience through overwhelmed control planes and grey failures.

In the case of a DC fabric, simplification can increase resilience. Since you cannot reduce the number of links and devices, you must think through how and where to abstract information to reduce state. Reducing state, in turn, is bound to reduce the efficiency of traffic flows through the network, so you immediately run into a domino effect of optimization tradeoffs. Highly turned optimization for traffic carrying causes a lack of optimization in resilience; optimizing for resilience reduces the optimization of traffic flow through the network. These kinds of chain reactions are common in the network engineering world. How can you optimize against grey failures? Perhaps simplifying design by using a single kind of optic, rather than having multiple kinds, or finding other ways to cope with the complexity in physical design.

Returning to the original quote—we often build a lot of resilience into network designs, so we do not face the same sorts of problems software designers and implementors do. Quite often the hyper-focus on resilience in network design is a result of a lack of resilience in software design—software designers have thrown the complexity of resilient design over the cubicle wall into the network operator’s lap. This clearly does not seem to be the most efficient way to handle things, as network are vastly more complex because of the absolute resilience they are expected to provide; looking at the software and network as a system might produce a more resilient, and yet simpler, system.

The key, in the meantime, is for network engineers to learn how to ply the tradeoffs, understanding precisely what their goals are—or what they are optimizing for—and how those optimizations trade off against one another.

The Hedge 32: Overcommunication

Michael Natkin, over at Glowforge, writes: “That’s a funny thing about our minds. In the absence of information, they fill in the gaps and make up all sorts of plausible things, without the owners of said minds even realizing it is happening.” The answer, he says, is to overcommunicate. Michael joins Eyvonne Sharpe, Tom Ammon, and Russ White on this episode of the Hedge to discuss what it means to overcommunicate.

download

Reflections on Intent

No, not that kind. 🙂

BGP security is a vexed topic—people have been working in this area for over twenty years with some effect, but we continuously find new problems to address. Today I am looking at a paper called BGP Communities: Can of Worms, which analyses some of the security problems caused by current BGP community usage in the ‘net. The point I want to think about here, though, is not the problem discussed in the paper, but rather some of the larger problems facing security in routing.

Assume there is some traffic flow passing from 101::47/64 and 100::46/64 in this network. AS65003 has helpfully set up community string-based policies that allow a peer to advertise a route with a specified AS Path prepend. In this case, if AS65003 receives a route with 3:65004x to prepend the route advertised towards 65004 with x number of additional AS Path entries, and 3:65005x to prepend the route advertised towards 65005 with x number of additional AS Path entries.

Assuming community strings set by AS65002 are carried with the 100::46/64 route through the rest of the network, AS65002 can:

  • Advertise 100::/46 towards AS65003 with 3:650045, causing the route received at AS65006 from AS65004 to have a longer AS Path than the route received through AS65005, causing the traffic to flow through AS65005
  • Advertise 100::/46 towards AS65003 with 3:650055, causing the route received at AS65006 from AS65005 to have a longer AS Path than the route received through AS65004, causing the traffic to flow through AS65004

A lot of abuse is possible because of this situation. For instance, AS65002 might know the cost of the link between AS65006 and AS65004 is very expensive, so directing large amounts of traffic across that link will cause financial harm to AS65004 or AS65006. A malicious actor at AS65002 could also determine it can overwhelm this link, causing a sort of denial of service against anyone connected to AS65004 or AS65006.

The potential problem, then, is real.

The problem is, however, how do we solve this? The most obvious way is to block communities from being transmitted beyond one hop past the point in the network where they are set. There are, however, two problems with this solution. First, how can anyone tell which AS set a community on a route? There is no originator code in the community string, and there’s no particular way to protect this kind of information from being forged or modified short of carrying a cryptographic hash in the update—which is probably not going to be acceptable from a performance perspective.

But the technical problem here is just the “tip of the iceberg.” Even if we could determine who modified the route to include the community, there is no particular way for anyone receiving the community to determine the originator’s intent. AS65002 may well install some system which measures, in near-real time, the delay across multiple paths to determine which performs the best. Such a system could be programmed with the correct community strings to impact traffic, and then left to run some sort of machine learning process to figure out how to mark routes to improve performance. If the operator at AS65002 does not realize the cost of the AS65004->AS65006 link is prohibitive, any sort of financial burden imposed by this system could be an unintended, rather than intended, consequence.

This, it turns out, is often the problem with security. It might be that person is bypassing building security to save a life, or it could be they are doing so to steal corporate secrets. There is simply no way to know without meeting the person in question, listening to their reasoning, and allowing a human to decide which course of action is appropriate.

In the case of BGP, we’re dealing with “spooky action at a distance;” the source of the problem is several steps removed from the result of the problem, there’s no clear way to connect the two, and there’s no clear way to resolve the problem other than “picking up the phone” even if one of these operators can figure out what is going on.

The problem of intent is what RFC3514’s evil bit is poking a bit of fun at—if we only knew the attacker’s intent, we could often figure out what to actually do. Not knowing intent, however, puts a major crimp in many of the best-laid security plans.

The Hedge 31: Network Operator Groups

Many engineers have heard about the wide variety of Network Operator Group (NOG) meetings, from smaller regional organizations through larger multinational ones. What is the value of attending a NOG? How can you convince your business leadership of this value? In this episode of the Hedge Vincent Celindro and Edward McNair join Russ White to consider these questions.

download