CONTENT TYPE
The Hedge 33: Balazs Varga and DETNET
Balazs Varge joins Alvaro Retana and Russ White on this episode of the Hedge to discuss the working going on in the IETF around deterministic networking. This work is important for applications requiring networks providing low latency and loss. You can read more about DETNET in these drafts:
https://datatracker.ietf.org/doc/draft-ietf-detnet-mpls-over-udp-ip/
https://datatracker.ietf.org/doc/draft-ietf-detnet-ip/
https://datatracker.ietf.org/doc/draft-ietf-detnet-mpls/
The Resilience Problem
This quote set me to thinking about how efficiency and resilience might interact, or trade off against one another, in networks. The most obvious extreme cases are two routers connected via a single long-haul link and the highly parallel data center fabrics we build today. Obviously adding a second long-haul link would improve resilience—but at what cost in terms of efficiency? Its also obvious highly meshed data center fabrics have plenty of resilience—and yet they still sometimes fail. Why?
These cases can be described as efficiency extremes. The single link between two distant points is extremely efficient at minimizing cost and complexity; there is only one link to pay for, only one pair of devices to configure, etc. The highly meshed data center fabric, on the other hand, is extremely efficient at rapidly carrying large amounts of data between vast numbers of interconnected devices (east/west traffic flows). Have these optimizations towards one goal resulted in tradeoffs in resilience?
Consider the case of the single long-haul link between two routers. In terms of the state/optimization/surfaces (SOS) tirade, this single pair of routers and single link minimize the amount of control plane state and the breadth of surfaces (there is only point at which the control plane and the physical network intersect, for instance). The tradeoff, however, is a single link failure causes all traffic through the network to stop flowing—the network completely fails to do the work its designed to do. To create resiliency, or rather add a second dimension of optimization to the network, a second link and a second pair of routers need to be added. Adding these, however, will increase the amount of state and the number of interaction surfaces in the network. Another way to put this is the overall system becomes more complex to solve a harder set of problems—inexpensive traffic flow versus minimal cost traffic flow with resilience.
The second case is a little harder to understand—we assume all those parallel links necessarily make the network more resilient. If this is the case, then why do data center fabrics ever fail? In fact, DC fabrics are plagued by one of the hardest kinds of failure to understand and repair—grey failures. Going back to the SOS triad, the massive number of parallel links and devices in a DC fabric, designed to optimize the network for carrying massive amounts of traffic, also add lots of state and interaction surfaces to the network. Increasing the amount of state and interaction surfaces should, in theory, reduce some other form of optimization—in this case resilience through overwhelmed control planes and grey failures.
In the case of a DC fabric, simplification can increase resilience. Since you cannot reduce the number of links and devices, you must think through how and where to abstract information to reduce state. Reducing state, in turn, is bound to reduce the efficiency of traffic flows through the network, so you immediately run into a domino effect of optimization tradeoffs. Highly turned optimization for traffic carrying causes a lack of optimization in resilience; optimizing for resilience reduces the optimization of traffic flow through the network. These kinds of chain reactions are common in the network engineering world. How can you optimize against grey failures? Perhaps simplifying design by using a single kind of optic, rather than having multiple kinds, or finding other ways to cope with the complexity in physical design.
Returning to the original quote—we often build a lot of resilience into network designs, so we do not face the same sorts of problems software designers and implementors do. Quite often the hyper-focus on resilience in network design is a result of a lack of resilience in software design—software designers have thrown the complexity of resilient design over the cubicle wall into the network operator’s lap. This clearly does not seem to be the most efficient way to handle things, as network are vastly more complex because of the absolute resilience they are expected to provide; looking at the software and network as a system might produce a more resilient, and yet simpler, system.
The key, in the meantime, is for network engineers to learn how to ply the tradeoffs, understanding precisely what their goals are—or what they are optimizing for—and how those optimizations trade off against one another.
The Hedge 32: Overcommunication
Michael Natkin, over at Glowforge, writes: “That’s a funny thing about our minds. In the absence of information, they fill in the gaps and make up all sorts of plausible things, without the owners of said minds even realizing it is happening.” The answer, he says, is to overcommunicate. Michael joins Eyvonne Sharpe, Tom Ammon, and Russ White on this episode of the Hedge to discuss what it means to overcommunicate.
Reflections on Intent
No, not that kind. 🙂
BGP security is a vexed topic—people have been working in this area for over twenty years with some effect, but we continuously find new problems to address. Today I am looking at a paper called BGP Communities: Can of Worms, which analyses some of the security problems caused by current BGP community usage in the ‘net. The point I want to think about here, though, is not the problem discussed in the paper, but rather some of the larger problems facing security in routing.

Assume there is some traffic flow passing from 101::47/64 and 100::46/64 in this network. AS65003 has helpfully set up community string-based policies that allow a peer to advertise a route with a specified AS Path prepend. In this case, if AS65003 receives a route with 3:65004x to prepend the route advertised towards 65004 with x number of additional AS Path entries, and 3:65005x to prepend the route advertised towards 65005 with x number of additional AS Path entries.
Assuming community strings set by AS65002 are carried with the 100::46/64 route through the rest of the network, AS65002 can:
- Advertise 100::/46 towards AS65003 with 3:650045, causing the route received at AS65006 from AS65004 to have a longer AS Path than the route received through AS65005, causing the traffic to flow through AS65005
- Advertise 100::/46 towards AS65003 with 3:650055, causing the route received at AS65006 from AS65005 to have a longer AS Path than the route received through AS65004, causing the traffic to flow through AS65004
A lot of abuse is possible because of this situation. For instance, AS65002 might know the cost of the link between AS65006 and AS65004 is very expensive, so directing large amounts of traffic across that link will cause financial harm to AS65004 or AS65006. A malicious actor at AS65002 could also determine it can overwhelm this link, causing a sort of denial of service against anyone connected to AS65004 or AS65006.
The potential problem, then, is real.
The problem is, however, how do we solve this? The most obvious way is to block communities from being transmitted beyond one hop past the point in the network where they are set. There are, however, two problems with this solution. First, how can anyone tell which AS set a community on a route? There is no originator code in the community string, and there’s no particular way to protect this kind of information from being forged or modified short of carrying a cryptographic hash in the update—which is probably not going to be acceptable from a performance perspective.
But the technical problem here is just the “tip of the iceberg.” Even if we could determine who modified the route to include the community, there is no particular way for anyone receiving the community to determine the originator’s intent. AS65002 may well install some system which measures, in near-real time, the delay across multiple paths to determine which performs the best. Such a system could be programmed with the correct community strings to impact traffic, and then left to run some sort of machine learning process to figure out how to mark routes to improve performance. If the operator at AS65002 does not realize the cost of the AS65004->AS65006 link is prohibitive, any sort of financial burden imposed by this system could be an unintended, rather than intended, consequence.
This, it turns out, is often the problem with security. It might be that person is bypassing building security to save a life, or it could be they are doing so to steal corporate secrets. There is simply no way to know without meeting the person in question, listening to their reasoning, and allowing a human to decide which course of action is appropriate.
In the case of BGP, we’re dealing with “spooky action at a distance;” the source of the problem is several steps removed from the result of the problem, there’s no clear way to connect the two, and there’s no clear way to resolve the problem other than “picking up the phone” even if one of these operators can figure out what is going on.
The problem of intent is what RFC3514’s evil bit is poking a bit of fun at—if we only knew the attacker’s intent, we could often figure out what to actually do. Not knowing intent, however, puts a major crimp in many of the best-laid security plans.
The Hedge 31: Network Operator Groups
Many engineers have heard about the wide variety of Network Operator Group (NOG) meetings, from smaller regional organizations through larger multinational ones. What is the value of attending a NOG? How can you convince your business leadership of this value? In this episode of the Hedge Vincent Celindro and Edward McNair join Russ White to consider these questions.
Learning from Failure at Scale

One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.
To produce the study, the authors took data from Facebook’s ticket logging system over 6 years, from 2011 through 2018. They used language-based systems to classify each event based on severity, kind of remediation, and root cause. Once the events were classified, the researchers plotted and tried to understand the results. For instance, table 2 lists the most common root causes of data center fabric incidents: 17% were maintenance, 13% misconfiguration, 13% hardware, and 12% software defects (bugs).
Given Facebook’s network is completely automated, with a full code review/canary process for validating changes before they are put into production, misconfiguration failures should lower than a manually operated network. That 13% of failures are still accounted for by misconfiguration shows even the best automation program cannot eliminate failures from misconfiguration. This number is also interesting because it implies networks without this degree of automation must have much higher failure rates due to misconfiguration. While the raw number of failures are not given, this seems to provide both an idea of how much improvement automation can create, as well as a sort of “cap” on how much improvement operators can expect by automating.
If misconfiguration causes 13% of all failures, and software defects cause 12%, then 25% of all failures are caused by human error. I don’t know of any other studies of this kind, but 25% sounds about right based on years of experience. Whether this 25% is spread across failures in vendor code and operator configuration, or across operator created code and operator configuration, the percentage of failure seems to remain about the same. It is not likely you can eliminate failures caused by human error, nor are you likely to drive it down more than a couple of percentage points.
Another interesting finding here is larger networks increase the time humans take to resolve incidents. As the size of the network scales up, the MTTR scales up with it. This is intuitive—larger networks tend to have more complex configurations, leading to more time spent trying to chase down and understand a problem. One thing the paper does not discuss, but might be interesting, is how modularization impacts these numbers. Intuitively, containing failures within a module (whether horizontally along topological lines or vertically through virtualization) should decrease the scope in which a network engineer needs to search to find a problem and resolve it. This is, on the other hand, likely to be offset somewhat by the increased complexity and reduction in visibility caused by segmentation—so it’s hard to determine what the overall effect of deeper segmentation in a network might be.
Overall, this is an interesting paper to parse through and understand—there are lots of great insights here for network operators at any scale.
The Hedge 30: Ethan Banks and Network Fundamentals
In this episode of the Hedge, Ethan Banks, Ethan’s old-timey routers, Tom Ammon, Tom’s printer, Eyvonne Sharp, and Russ White sit around the virtual hedge to talk about networking fundamentals. What are they, why are they important, how you learn them, and how to be intentional about your career.
