CONTENT TYPE
The Hedge 35: Peter Jones and Single Pair Ethernet

When you think of new Ethernet standards, you probably think about faster and optical. There is, however, an entire world of buildings out there with older copper cabling, particularly in the industrial realm, that could see dramatic improvements in productivity if their control and monitoring systems could be moved to IP. In these cases, what is needed is an Ethernet standard that runs over a single copper pair, and yet offers enough speed to support industrial use cases. Peter Jones joins Jeremy Filliben and Russ White to discuss single pair Ethernet.
Understanding DC Fabric Complexity
When I think of complexity, I mostly consider transport protocols and control planes—probably because I have largely worked in these areas from the very beginning of my career in network engineering. Complexity, however, is present in every layer of the networking stack, all the way down to the physical. I recently ran across an interesting paper on complexity in another part of the network I had not really thought about before: the physical plant of a data center fabric.
The paper begins by defining what complexity in the physical infrastructure of a DC fabric looks like. They focus on packaging, or the layout of the switches in the fabric, the bundles of cabling required to wire the topology, and the number and locations of patch panels required. The packaging and patch panels impact the length and complexity of the cable runs (whether optical or copper), which represents a base complexity for the entire topology.
The second thing they consider is the lifecycle of the physical fabric infrastructure. What steps are required to upgrade the fabric from a smaller configuration to a larger one? Or from a lower speed (higher oversubscription) to a higher speed (lower oversubscription)? The result is the ability to put a number on the overall complexity of each topology.
The first class of topologies they consider are spine-and-leaf, such as the Clos, Benes, and butterfly fabrics. They call all kinds of spine-and-leaf fabrics Clos fabrics. Spine-and-leaf fabrics, they note generally have very low cabling complexity because their symmetry encourages consistent bundling and hardware placement. They call the second kind of topology expander fabrics; the most common fabric in this class is the dragonfly. These topologies are more difficult to wire but simpler to scale out because they can be expanded largely by modifying just the edge of the fabric. Their analysis shows these classes of fabric rate equally on their complexity scale.
A side note they don’t consider in the paper—their complexity computation implies that if you are building a fabric with a somewhat fixed range of sizes, and you can preplan the location of spines leaving enough room for the maximum sized fabric on the first day, spine-and-leaf fabrics are less complex than the fancier topologies you might hear about from time to time. Since most data center fabrics do, in fact, fall into these kinds of constraints (given a good day one designer!), this seems to validate the widespread use of butterfly and Clos fabrics for most applications. This feels like a significant result for most common data center fabric designs.
Finally, they describe an interesting topology they call FatClique, which is an interesting blend of spine-and-lead and edge expander topologies; I’ve screen grabbed the image from the paper below.

Overall, it’s well worth spending the time to read the entire paper if you have an in-depth interest in fabric design.The way this topology is described feels very much like a Benes to me, or a butterfly where the fabric routers are replaced by fabrics (making a seven-stage fabric). It’s hard to tell how useful this topology would be in real deployments—but that researchers are looking into alternatives other than the venerable spine-and-leaf is interesting in its own right.
The Hedge 34: Andrew Alston and the IETF
Complaining about how slow the IETF is, or how single vendors dominate the standards process, is almost a by-game in the world of network engineering going back to the very beginning. It is one thing to complain; it is another to understand the structure of the problem and make practical suggestions about how to fix it. Join us at the Hedge as Andrew Alston, Tom Ammon, and Russ White reveal some of the issues, and brainstorm how to fix them.
Quitting Certifications: When?
At what point in your career do you stop working towards new certifications?
Daniel Dibb’s recent post on his blog is, I think, an excellent starting point, but I wanted to add a few additional thoughts to the answer he gives there.
Daniel’s first question is how do you learn? Certifications often represent a body of knowledge people who have a lot of experience believe is important, so they often represent a good guided path to holistically approaching a new body of knowledge. In the professional learning world this would be called a ready-made mental map. There is a counterargument here—certifications are often created by vendors as a marketing tool, rather than as something purely designed for the betterment of the community, or the dissemination of knowledge. This doesn’t mean, however, that certifications are “evil.” It just means you need to evaluate each certification on its own merits.
As an aside, I’ve been trying to start a non-vendor-specific certification for the last two years but have been struggling to find a group of people with the energy and excitement required to make it happen. To some degree, the reason certifications are vendor-based is because we, as a community, don’t do a good job at building them.
The second series of questions relate to your position—would a certification give you a bonus, help you get a new position, or give credibility to the company you work for? These are all valid questions requiring self-reflection around what you hope to achieve materially by working through the certification.
The final set of questions Daniel poses relate to whether a certification would give you what might be called authority in the network engineering world. Certifications, seen in this way, are a form of transitive trust. There are two components here—the certification blueprint tells you about the body of knowledge, and the certifying authority tells you about the credibility of the process. Given these two things, someone with a certification is saying “someone you trust has said I have this knowledge, so you should trust I have this knowledge as well.” The certification acts as a transit between you and the certified person, transferring some amount of your trust in the certifying organization to the person you are talking to.
There are other ways to build this kind of trust, of course. For instance, if you blog, or run a podcast, or are a frequent guest contributor, or have a lot of code on git, or have written some books, etc. In these cases, the trust is no longer transitive but direct—you can see a person has a body of knowledge because they have (at least to some degree) exposed that knowledge publicly.
All of these reasons are fine and good—but I think there is another point to think about in this discussion: what are you saying to the community? Once you stop “doing” certifications, you can be saying one of two things. The first is certifications are useless. If all the reasons for getting a certification above are true, then telling someone “certifications are useless” is not a good thing. Those who don’t care about certifications should rather take the position first, do no harm.
A stronger position would be to carefully evaluate existing certifications and help guide folks desiring a certification down a good path rather than a bad one. Which certifications are primarily vendor marketing programs? Which are technically sound? If you are at the point where you are no longer going to pursue certifications, these are questions you should be able to answer.
An even stronger position would be—if you’re at the point where you do not think you need to be certified, where you have a “body of knowledge” that allows people to directly trust your work, then perhaps you should also be at a point where you are helping guide the development of certifications in some way.
Certifications—whether to get them or not, and when to stop caring about them—are rather more nuanced than many in the networking world make out. There are valid reasons for, and valid reasons against—and in general, I think we need to do better a developing and policing certifications in order to build a stronger community.
The Hedge 33: Balazs Varga and DETNET
Balazs Varge joins Alvaro Retana and Russ White on this episode of the Hedge to discuss the working going on in the IETF around deterministic networking. This work is important for applications requiring networks providing low latency and loss. You can read more about DETNET in these drafts:
https://datatracker.ietf.org/doc/draft-ietf-detnet-mpls-over-udp-ip/
https://datatracker.ietf.org/doc/draft-ietf-detnet-ip/
https://datatracker.ietf.org/doc/draft-ietf-detnet-mpls/
The Resilience Problem
This quote set me to thinking about how efficiency and resilience might interact, or trade off against one another, in networks. The most obvious extreme cases are two routers connected via a single long-haul link and the highly parallel data center fabrics we build today. Obviously adding a second long-haul link would improve resilience—but at what cost in terms of efficiency? Its also obvious highly meshed data center fabrics have plenty of resilience—and yet they still sometimes fail. Why?
These cases can be described as efficiency extremes. The single link between two distant points is extremely efficient at minimizing cost and complexity; there is only one link to pay for, only one pair of devices to configure, etc. The highly meshed data center fabric, on the other hand, is extremely efficient at rapidly carrying large amounts of data between vast numbers of interconnected devices (east/west traffic flows). Have these optimizations towards one goal resulted in tradeoffs in resilience?
Consider the case of the single long-haul link between two routers. In terms of the state/optimization/surfaces (SOS) tirade, this single pair of routers and single link minimize the amount of control plane state and the breadth of surfaces (there is only point at which the control plane and the physical network intersect, for instance). The tradeoff, however, is a single link failure causes all traffic through the network to stop flowing—the network completely fails to do the work its designed to do. To create resiliency, or rather add a second dimension of optimization to the network, a second link and a second pair of routers need to be added. Adding these, however, will increase the amount of state and the number of interaction surfaces in the network. Another way to put this is the overall system becomes more complex to solve a harder set of problems—inexpensive traffic flow versus minimal cost traffic flow with resilience.
The second case is a little harder to understand—we assume all those parallel links necessarily make the network more resilient. If this is the case, then why do data center fabrics ever fail? In fact, DC fabrics are plagued by one of the hardest kinds of failure to understand and repair—grey failures. Going back to the SOS triad, the massive number of parallel links and devices in a DC fabric, designed to optimize the network for carrying massive amounts of traffic, also add lots of state and interaction surfaces to the network. Increasing the amount of state and interaction surfaces should, in theory, reduce some other form of optimization—in this case resilience through overwhelmed control planes and grey failures.
In the case of a DC fabric, simplification can increase resilience. Since you cannot reduce the number of links and devices, you must think through how and where to abstract information to reduce state. Reducing state, in turn, is bound to reduce the efficiency of traffic flows through the network, so you immediately run into a domino effect of optimization tradeoffs. Highly turned optimization for traffic carrying causes a lack of optimization in resilience; optimizing for resilience reduces the optimization of traffic flow through the network. These kinds of chain reactions are common in the network engineering world. How can you optimize against grey failures? Perhaps simplifying design by using a single kind of optic, rather than having multiple kinds, or finding other ways to cope with the complexity in physical design.
Returning to the original quote—we often build a lot of resilience into network designs, so we do not face the same sorts of problems software designers and implementors do. Quite often the hyper-focus on resilience in network design is a result of a lack of resilience in software design—software designers have thrown the complexity of resilient design over the cubicle wall into the network operator’s lap. This clearly does not seem to be the most efficient way to handle things, as network are vastly more complex because of the absolute resilience they are expected to provide; looking at the software and network as a system might produce a more resilient, and yet simpler, system.
The key, in the meantime, is for network engineers to learn how to ply the tradeoffs, understanding precisely what their goals are—or what they are optimizing for—and how those optimizations trade off against one another.
The Hedge 32: Overcommunication
Michael Natkin, over at Glowforge, writes: “That’s a funny thing about our minds. In the absence of information, they fill in the gaps and make up all sorts of plausible things, without the owners of said minds even realizing it is happening.” The answer, he says, is to overcommunicate. Michael joins Eyvonne Sharpe, Tom Ammon, and Russ White on this episode of the Hedge to discuss what it means to overcommunicate.
