Measuring the Core

This last week I was a guest on the TechSequences podcast with Leslie and Alexa discussing the centralization of the routed infrastructure in the ‘net. When that episode posts, I’ll cross post it here (but, of course, you should really just subscribe to their podcast, as they always have interesting guests—I’ll have Leslie and Alexa on the Hedge at some point, as well). The topic is related to this post on CircleID about the death of transit, which was a reaction to Geoff Huston’s article on the death of transit some time before.

All that to say… while reading through some research papers this week, I ran into a recent (2018) paper where Carisimo et al. try out different ways of measuring which autonomous systems belong to the “core” of the ‘net. They went about this by taking a set of AS’ “everyone” acknowledges to be “part of the core,” and then trying to find some measurement that successfully describes something all of them have in common.

The result is the k-metric, which measures the connectivity of an AS’ peers. If an AS has peers who are just as connected as they are, then k-metric is high. Otherwise, the k-metric is low. It does make sense this measure would be able to pick out “core” AS’, because it picks out the set of most highly interconnected AS’ in the ‘net.

Once they determined the k-metric is a good way to determine which AS’ are in the core of the ‘net, they calculated the membership of the core over time. Their graph is below.

The way the chart is laid out is a little difficult to see, but the green is transit providers and the blue is content providers. Certainly enough, the percentage of content providers in the core of the ‘net, in terms of sheer connectivity, has increased over time. These same content providers now account for some 80% (or more?) of the traffic on the ‘net. All this means is the centralization of content is visible in objective measurements, so its a real thing. Content providers are currently “only” 20% of the core but given their traffic levels this is a much bigger deal than it seems. There are many parts of the world where the population or access density is not high enough for large content providers to justify building out so they touch the last mile. If communities build out last mile optical networks, however, its likely these large content providers will consume ever-larger percentages of the “core” AS’.

Is QUIC really Quicker?

QUIC is a relatively new data transport protocol developed by Google, and currently in line to become the default transport for the upcoming HTTP standard. Because of this, it behooves every network engineer to understand a little about this protocol, how it operates, and what impact it will have on the network. We did record a History of Networking episode on QUIC, if you want some background.

In a recent Communications of the ACM article, a group of researchers (Kakhi et al.) used a modified implementation of QUIC to measure its performance under different network conditions, directly comparing it to TCPs performance under the same conditions. Since the current implementations of QUIC use the same congestion control as TCP—Cubic—the only differences in performance should be code tuning in estimating the round-trip timer (RTT) for congestion control, QUIC’s ability to form a session in a single RTT, and QUIC’s ability to carry multiple streams in a single connection. The researchers asked two questions in this paper: how does QUIC interact with TCP flows on the same network, and does UIC perform better than TCP in all situations, or only some?

To answer the first question, the authors tried running QUIC and TCP over the same network in different configurations, including single QUIC and TCP sessions, a single QUIC session with multiple TCP sessions, etc. In each case, they discovered that QUIC consumed about 50% of the bandwidth; if there were multiple TCP sessions, they would be starved for bandwidth when running in parallel with the QUIC session. For network folk, this means an application implemented using QUIC could well cause performance issues for other applications on the network—something to be aware of. This might mean it is best, if possible, to push QUIC-based applications into a separate virtual or physical topology with strict bandwidth controls if it causes other applications to perform poorly.

Does QUIC’s ability to consume more bandwidth mean applications developed on top of it will perform better? According to the research in this paper, the answer is how many balloons fit in a bag? In other words, it all depends. QUIC does perform better when its multi-stream capability comes into play and the network is stable—for instance, when transferring variably sized objects (files) across a network with stable jitter and delay. In situations with high jitter or delay, however, TCP consistently outperforms QUIC.

TCP outperforming QUIC is a bit of a surprise in any situation; how is this possible? The researchers used information from their additional instrumentation to discover QUIC does not tolerate out-of-order packet delivery very well because of its fast packet retransmission implementation. Presumably, it should be possible to modify these parameters somewhat to make QUIC perform better.

This would still leave the second problem the researchers found with QUIC’s performance—a large difference between its performance on desktop and mobile platforms. The difference between these two comes down to where QUIC is implemented. Desktop devices (and/or servers) often have smart NICs which implement TCP in the ASIC to speed packet processing up. QUIC, because it runs in user space, only runs on the main processor (it seems hard to see how a user space application could run on a NIC—it would probably require a specialized card of some type, but I’ll have to think about this more). The result is that QUIC’s performance depends heavily on the speed of the processor. Since mobile devices have much slower processors, QUIC performs much more slowly on mobile devices.

QUIC is an interesting new transport protocol—one everyone involved in designing or operating networks is eventually going to encounter. This paper gives good insight into the “soul” of this new protocol.

To Route or Not?

When you are building a data center fabric, should you run a control plane all the way to the host? This is question I encounter more often as operators deploy eVPN-based spine-and-leaf fabrics in their data centers (for those who are actually deploying scale-out spine-and-leaf—I see a lot of people deploying hybrid sorts of networks designed as “mini-hierarchical” designs and just calling them spine-and-leaf fabrics, but this is probably a topic for another day). Three reasons are generally given for deploying the control plane all on the hosts attached to the fabric: faster down detection, load sharing, and traffic engineering. Let’s consider each of these in turn.

Faster Down Detection. There’s no simple way for ToR switches to determine when the connection to a host has failed, whether the host is single or dual-homed. Somehow the set of routes reachable through the host must be related to the interface state, or some underlying fast hello state (such as BFD), so that if a link fails the ToR knows to pull the correct set of routes from the routing table. It’s simpler to just let the host itself advertise the correct reachability information; when the link fails, the routing session will fail, and the correct routes will automatically be withdrawn.

Load Sharing. While this only applies to hosts with two connections into the fabric (dual-homed hosts), this is still an important use case. If a dual-homed host only has two default routes to work from, the host is blind to network conditions, and can only load share equally across the available paths. Equal load sharing, however, may not be ideal in all situations. If the host is running routing, it is possible to inject more intelligence into the load sharing between the upstream links.

Traffic Engineering. Or traffic shaping, steering, etc. In some cases, traffic engineering requires injecting a label or outer header onto the packet as it enters the fabric. In others, more specific routes might be sent along one path and not another to draw specific kinds of traffic through a more optimal route in the fabric. This kind of traffic engineering is only possible if the control plane is running on the host.

All these reasons are well and good, but they all assume something that should be of great interest to the network designer: which control plane are we talking about?

Most DC fabric designs I see today assume there is a single control plane running on the fabric—generally this single control plane is BGP, and it’s being used both to provide basic IP connectivity through the fabric (the infrastructure underlay control plane) and to provide tunneled overlay reachability (the infrastructure overlay control plane—generally eVPN).

This entangling of the infrastructure underlay and overlay has always seemed, to me, to be less than ideal. When I worked on large-scale transit provider networks in my more youthful days, we intentionally designed networks that separated customer routes from infrastructure routes. This created two separate failure and security domains in the network, as well as dividing the telemetry data in ways that allowed faster troubleshooting of common problems.

The same principles should apply in a DC fabric—after all, the workloads are essentially customers of the fabric, while the basic underlay connectivity counts as infrastructure. The simplest way to adopt this sort of division of labor is the same way large-scale transit providers did (and do)—use two different routing protocols for the underlay and overlay. For instance, IS-IS or RIFT for the underlay and eVPN using BGP for the overlay.

If you move to two layers of control plane, the question above becomes a bit more nuanced—should the overlay control plane run on the hosts? Should the underlay control plane run on the hosts?

For faster down detection—for those hosts that need faster down detection, BFD tied to IGP neighbor state can remove the correct nexthop from the local routing table at a ToR, causing the correct reachable destinations to be withdrawn. Alternatively, the host can run an instance of the overlay control plane, which allows it to advertise and withdraw “customer routes” directly. In neither case is the underlay control plane required to run on the host.

For load sharing and traffic engineering—if something like SRm6, or even other more traditional forms of traffic engineering, the information needed will be carried in the overlay rather than the underlay—so the underlay routing protocol does not need to run on the host.

On the other side of the coin, not running the underlay protocol on the host can help the overall network security posture. Assume a public facing host connected to the fabric is somehow pwned… If the host is running the underlay protocol, its pretty simple to DoS the entire fabric to take it down, or to inject incorrect routing information. If the overlay is configured correctly, however, only the virtual topology which the host has access to can be impacted by an attack—and if microsegmentation is deployed, that damage can be minimized as well.

From a complexity perspective, running the underlay control plane on the host dramatically increases the amount of state the host must maintain; there is no effective filter you can run to reduce state on the host without destroying some of the advantages gained by running the underlay control plane there. On the other hand, the ToR can be configured to filter routing information the host receives, controlling the amount of state the host needs to manage.

Control plane on the host or not? This is one of those questions where properly modularized and layered network design can make a big difference in what the right answer should be.

Ruminating on SOS

Many years ago I attended a presentation by Dave Meyers on network complexity—which set off an entire line of thinking about how we build networks that are just too complex. While it might be interesting to dive into our motivations for building networks that are just too complex, I starting thinking about how to classify and understand the complexity I was seeing in all the networks I touched. Of course, my primary interest is in how to build networks that are less complex, rather than just understanding complexity…

This led me to do a lot of reading, write some drafts, and then write a book. During this process, I ended coining what I call the complexity triad—State, Optimization, and Surface. If you read the book on complexity, you can see my views on what the triad consisted of changed through in the writing—I started out with volume (of state), speed (of state), and optimization. Somehow, though, interaction surfaces need to play a role in the complexity puzzle.

First, you create interaction surface when you modularize anything—and you modularize to control state (the scope to set apart failure domains, the speed and volume to enable scaling). Second, adding interaction surfaces adds complexity by creating places where information must be exchanged—which requires protocols and other things. Finally, reducing state through abstraction at an interaction surface is the primary cause of many forms of suboptimal behavior in a control plane, and causes unintended consequences. Since interaction surfaces are so closely tied to state and optimization, then, I added surfaces to the triad, and merged the two kinds of state into one, just state.

I have been thinking through the triad again in the last several weeks for various reasons, and I’m not certain it’s quite right still because I’m not convinced surfaces are really a tradeoff against state and optimization. It seems more accurate to say that state and optimization trade off through interaction surfaces. This does not make it any less of a triad, but it might mean I need to find a little different way to draw it. One way to illustrate it is as a system of moving parts, such as the illustration below.

If you think of the interaction surface between modules 1 and 2—two topological parts, or a virtual topology on top of a physical—then the abstraction is the amount of information allowed to pass between the two modules. For instance, in aggregation the length of the aggregated prefixes, or the aggregated prefix metrics, etc.

When you “turn the crank,” so-to-speak, you adjust the volume, speed (velocity), breadth, or depth of information being passed between the modules—either more or less information, faster or slower, in more places or fewer, or the reaction of the module receiving the state. Every time you turn the crank, however, there is not one reaction but many. Notices optimization 1 will turn in the opposite direction from optimization 2 in the diagram—so turning the crank for 1 to be more optimal will always result in 2 becoming less optimal. There are tens or hundreds of such interactions in any system, and it is impossible for any person to know or understand all of them.

For instance, if you aggregate hundreds of /64’s to tens of /60’s, you reduce the state and optimize by reducing the scope of the failure domain. On the other hand, because you have less specific routing information, traffic is (most likely) going to flow along less-than-optimal paths. If you “turn the crank” by aggregating those same hundreds of /64’s to a 0::0, you will have more “airtight” failure domains or modules, but less optimal traffic flow. Hence …

If you haven’t found the tradeoffs, you haven’t looked hard enough.

What understanding the SOS triad allows you, combined with a fundamental knowledge of how these things work, is to know where to look for the tradeoffs. Maybe it would be better to illustrate the SOS triad with surfaces at the bottom all the time, acting as a sort of fulcrum or balance point between state and optimization… Or maybe a completely different illustration would be better. This is something for me to think about more and figure out.

Complexity interacts with these interaction surfaces as well, of course—the more complex a system becomes, the more complex the interaction surface within the system become or the more of them you have. A key point in design of any kind is balancing the number of interaction surfaces with their complexity, depth, and breath—in other words, where should you modularize, what should each module contain, what sort of state passed between the modules, where does state pass between the modules, etc. Somehow, mentally, you have to factor in the unintended consequences of hiding information (the first corollary to Keith’s Law, in effect), and the law of leaky abstractions (all nontrivial abstractions leak).

This is a far different way of looking at networks and their design than what you learned in any random certification, and its probably not even something you will find in a college textbook. It is quite difficult to apply when you’re down in the configuration of individual devices. But it’s also the key to understanding networks as a system and beginning the process of thinking about where and how to modularize to create the simplest system to solve a given hard problem.

Going back to the beginning, then—one of the reasons we build such complex networks is we do not really think about how the modules fit together. Instead, we use rules-of-thumb and folk wisdom while we mumble about failure domains and “this won’t scale” under our breath. We are so focused on the individual gears becoming commodities that we fail to see the system and all its moving parts—or we somehow think “this is all so easy,” that we build very inefficient systems with brute-force resolutions, often resulting in mass failures that are hard to understand and resolve.

Sorry, there’s no clear point or lesson here… This is just what happens when I’ve been buried in dissertation work all day and suddenly realize I have not written a blog post for this week… But it should give you something to think about.

Learning from the Post-Mortem

Post-mortem reviews seem to be quite common in the software engineering and application development sides of the IT world—but I do not recall a lot of post-mortems in network engineering across my 30 years. This puzzling observation sprang to mind while I was reading a post over at the ACM this last week about how to effectively learn from the post-mortem exercise.

The common pattern seems to be setting aside a one hour meeting, inviting a lot of people, trying to shift blame while not actually saying you are shifting blame (because we are all supposed to live in a blame-free environment now—fix the problem, not the blame!), and then … a list is created on a whiteboard, pictures are taken, and everyone walks away with a rock-solid plan to never do that again.

In a few months’ time, the same team will be in the same room, draw the same drawings, and say the same things all over again. At least that is the way it seems to me. If there is an effective post-mortem process in use by a company someplace, I do not think I have seen it.

From the article—

Are we missing anything in this prevalent rinse-and-repeat cycle of how the industry generally addresses incidents that could be helpful? Put another way: As we experience incidents, work through them, and deal with their aftermath, if we set aside incident-specific, and therefore fundamentally static, remediation items, both in technology and process, are we learning anything else that would be useful in addressing and responding to incidents? Can we describe that knowledge? And if so, how would we then make use of it to leverage past pain and improve future chances at success?

I tend to think, from the few times I have seen network post-mortems performed, that the reason they do not work well is because we slip into the same appliance/configuration frame of mind so quickly. We want to understand what configuration was entered incorrectly, or what defect should be reported back to the vendor, rather than thinking about organizational and process changes. The smaller the detail, the safer the conclusions, after all—aim small, miss big, is what we say in the shooting world.

We focus so much on mean time to innocence, and how to create a practically perfect process that will never fail, that we fail to do the one thing we should be doing: learning.

Okay, so enough whining—what can be done about this situation? A few practical suggestions come to mind. These are not, of course, well-thought-out solutions, but rather, perhaps, “part of the solution.”

Rather than trying to figure out the root cause, spend that precious hour of post-mortem time mapping out three distinct workflows. The first should be the process that set up the failure. What drove the installation of this piece of hardware or software? What drove the deployment of this protocol? How did we get to the place where this failure had that effect? Once this is mapped out, see if there is anything in that process, or even in the political drivers and commitments made during that process, that could or should be modified to really change the way technology is deployed in your network.

The second process you should map out is the steps taken to detect the problem. Dwell time is a huge problem in modern networks—the time between a failure occurring and being detected. You should constantly focus on bringing dwell time down while paying close attention to the collateral damage of false positives. Mapping out how this failure was detected, and where it should have been caught sooner, can help improve telemetry systems, ultimately decreasing MTTR.

The third, and final, workflow you map out should be the troubleshooting process itself. People rarely map out their troubleshooting process for later reference, but this little trick I learned from way back in tube-type electronics days used to save me hours of time in the field. As you troubleshoot, make a flow chart. Record what you checked, why you checked it, how you checked it, and what you learned from the check. This flowchart, or workflow, is precious material in the post-mortem process. What can you instrument, or make easier to find, to reduce troubleshooting time in the next go-round? How can you traverse the network and find the root cause faster next time? These are crucial questions you can only answer with the use of a troubleshooting workflow.

I don’t know if you already do post-mortems or not, or how valuable you think they are—but I would suggest they can be, and are, quite useful. So long as you get out of the narrows and focus on systems and workflows. Aim small, miss big—but aim big and you’ll either hit the target or, at worst, miss small.

Understanding DC Fabric Complexity

When I think of complexity, I mostly consider transport protocols and control planes—probably because I have largely worked in these areas from the very beginning of my career in network engineering. Complexity, however, is present in every layer of the networking stack, all the way down to the physical. I recently ran across an interesting paper on complexity in another part of the network I had not really thought about before: the physical plant of a data center fabric.

Some researchers at USC and VMWare have thought about complexity in the physical infrastructure, however, and they wrote a rather interesting paper about it.

The paper begins by defining what complexity in the physical infrastructure of a DC fabric looks like. They focus on packaging, or the layout of the switches in the fabric, the bundles of cabling required to wire the topology, and the number and locations of patch panels required. The packaging and patch panels impact the length and complexity of the cable runs (whether optical or copper), which represents a base complexity for the entire topology.

The second thing they consider is the lifecycle of the physical fabric infrastructure. What steps are required to upgrade the fabric from a smaller configuration to a larger one? Or from a lower speed (higher oversubscription) to a higher speed (lower oversubscription)? The result is the ability to put a number on the overall complexity of each topology.

The first class of topologies they consider are spine-and-leaf, such as the Clos, Benes, and butterfly fabrics. They call all kinds of spine-and-leaf fabrics Clos fabrics. Spine-and-leaf fabrics, they note generally have very low cabling complexity because their symmetry encourages consistent bundling and hardware placement. They call the second kind of topology expander fabrics; the most common fabric in this class is the dragonfly. These topologies are more difficult to wire but simpler to scale out because they can be expanded largely by modifying just the edge of the fabric. Their analysis shows these classes of fabric rate equally on their complexity scale.

A side note they don’t consider in the paper—their complexity computation implies that if you are building a fabric with a somewhat fixed range of sizes, and you can preplan the location of spines leaving enough room for the maximum sized fabric on the first day, spine-and-leaf fabrics are less complex than the fancier topologies you might hear about from time to time. Since most data center fabrics do, in fact, fall into these kinds of constraints (given a good day one designer!), this seems to validate the widespread use of butterfly and Clos fabrics for most applications. This feels like a significant result for most common data center fabric designs.

Finally, they describe an interesting topology they call FatClique, which is an interesting blend of spine-and-lead and edge expander topologies; I’ve screen grabbed the image from the paper below.

Overall, it’s well worth spending the time to read the entire paper if you have an in-depth interest in fabric design.The way this topology is described feels very much like a Benes to me, or a butterfly where the fabric routers are replaced by fabrics (making a seven-stage fabric). It’s hard to tell how useful this topology would be in real deployments—but that researchers are looking into alternatives other than the venerable spine-and-leaf is interesting in its own right.

Quitting Certifications: When?

At what point in your career do you stop working towards new certifications?

Daniel Dibb’s recent post on his blog is, I think, an excellent starting point, but I wanted to add a few additional thoughts to the answer he gives there.

Daniel’s first question is how do you learn? Certifications often represent a body of knowledge people who have a lot of experience believe is important, so they often represent a good guided path to holistically approaching a new body of knowledge. In the professional learning world this would be called a ready-made mental map. There is a counterargument here—certifications are often created by vendors as a marketing tool, rather than as something purely designed for the betterment of the community, or the dissemination of knowledge. This doesn’t mean, however, that certifications are “evil.” It just means you need to evaluate each certification on its own merits.

As an aside, I’ve been trying to start a non-vendor-specific certification for the last two years but have been struggling to find a group of people with the energy and excitement required to make it happen. To some degree, the reason certifications are vendor-based is because we, as a community, don’t do a good job at building them.

The second series of questions relate to your position—would a certification give you a bonus, help you get a new position, or give credibility to the company you work for? These are all valid questions requiring self-reflection around what you hope to achieve materially by working through the certification.

The final set of questions Daniel poses relate to whether a certification would give you what might be called authority in the network engineering world. Certifications, seen in this way, are a form of transitive trust. There are two components here—the certification blueprint tells you about the body of knowledge, and the certifying authority tells you about the credibility of the process. Given these two things, someone with a certification is saying “someone you trust has said I have this knowledge, so you should trust I have this knowledge as well.” The certification acts as a transit between you and the certified person, transferring some amount of your trust in the certifying organization to the person you are talking to.

There are other ways to build this kind of trust, of course. For instance, if you blog, or run a podcast, or are a frequent guest contributor, or have a lot of code on git, or have written some books, etc. In these cases, the trust is no longer transitive but direct—you can see a person has a body of knowledge because they have (at least to some degree) exposed that knowledge publicly.

All of these reasons are fine and good—but I think there is another point to think about in this discussion: what are you saying to the community? Once you stop “doing” certifications, you can be saying one of two things. The first is certifications are useless. If all the reasons for getting a certification above are true, then telling someone “certifications are useless” is not a good thing. Those who don’t care about certifications should rather take the position first, do no harm.

A stronger position would be to carefully evaluate existing certifications and help guide folks desiring a certification down a good path rather than a bad one. Which certifications are primarily vendor marketing programs? Which are technically sound? If you are at the point where you are no longer going to pursue certifications, these are questions you should be able to answer.

An even stronger position would be—if you’re at the point where you do not think you need to be certified, where you have a “body of knowledge” that allows people to directly trust your work, then perhaps you should also be at a point where you are helping guide the development of certifications in some way.

Certifications—whether to get them or not, and when to stop caring about them—are rather more nuanced than many in the networking world make out. There are valid reasons for, and valid reasons against—and in general, I think we need to do better a developing and policing certifications in order to build a stronger community.