The 4D Network

3 August 2020 | Comments Off on The 4D Network

I think we can all agree networks have become too complex—and this complexity is a result of the network often becoming the “final dumping ground” of every problem that seems like it might impact more than one system, or everything no-one else can figure out how to solve. It’s rather humorous, in fact, to see a lot of server and application folks sitting around saying “this networking stuff is so complex—let’s design something better and simpler in our bespoke overlay…” and then falling into the same complexity traps as they start facing the real problems of policy and scale.

This complexity cannot be “automated away.” It can be smeared over with intent, but we’re going to find—soon enough—that smearing intent on top of complexity just makes for a dirty kitchen and a sub-standard meal.

Smart Network or Dumb?

27 July 2020 | Comments Off on Smart Network or Dumb?

Should the network be dumb or smart? Network vendors have recently focused on making the network as smart as possible because there is a definite feeling that dumb networks are quickly becoming a commodity—and it’s hard to see where and how steep profit margins can be maintained in a commodifying market. Software vendors, on the other hand, have been encroaching on the network space by “building in” overlay network capabilities, especially in virtualization products. VMWare and Docker come immediately to mind; both are either able to, or working towards, running on a plain IP fabric, reducing the number of services provided by the network to a minimum level (of course, I’d have a lot more confidence in these overlay systems if they were a lot smarter about routing … but I’ll leave that alone for the moment).

How can this question be answered? One way is to think through what sorts of things need to be done in processing packets, and then think through where it makes most sense to do those things. Another way is to measure the accuracy or speed at which some of these “packet processing things” can be done so you can decide in a more empirical way. The paper I’m looking at today, by Anirudh et al., takes both of these paths in order to create a baseline “rule of thumb” about where to place packet processing functionality in a network.

To Route or Not?

1 June 2020 | Comments Off on To Route or Not?

When you are building a data center fabric, should you run a control plane all the way to the host? This is question I encounter more often as operators deploy eVPN-based spine-and-leaf fabrics in their data centers (for those who are actually deploying scale-out spine-and-leaf—I see a lot of people deploying hybrid sorts of networks designed as “mini-hierarchical” designs and just calling them spine-and-leaf fabrics, but this is probably a topic for another day). Three reasons are generally given for deploying the control plane all on the hosts attached to the fabric: faster down detection, load sharing, and traffic engineering. Let’s consider each of these in turn.

Ruminating on SOS

25 May 2020 | Comments Off on Ruminating on SOS

Many years ago I attended a presentation by Dave Meyers on network complexity—which set off an entire line of thinking about how we build networks that are just too complex. While it might be interesting to dive into our motivations for building networks that are just too complex, I starting thinking about how to classify and understand the complexity I was seeing in all the networks I touched. Of course, my primary interest is in how to build networks that are less complex, rather than just understanding complexity…

This led me to do a lot of reading, write some drafts, and then write a book. During this process, I ended coining what I call the complexity triad—State, Optimization, and Surface. If you read the book on complexity, you can see my views on what the triad consisted of changed through in the writing—I started out with volume (of state), speed (of state), and optimization. Somehow, though, interaction surfaces need to play a role in the complexity puzzle.

Understanding DC Fabric Complexity

11 May 2020 |

When I think of complexity, I mostly consider transport protocols and control planes—probably because I have largely worked in these areas from the very beginning of my career in network engineering. Complexity, however, is present in every layer of the networking stack, all the way down to the physical. I recently ran across an interesting paper on complexity in another part of the network I had not really thought about before: the physical plant of a data center fabric.

The Resilience Problem

27 April 2020 | Comments Off on The Resilience Problem

…we have educated generations of computer scientists on the paradigm that analysis of algorithm only means analyzing their computational efficiency. As Wikipedia states: “In computer science, the analysis of algorithms is the process of finding the computational complexity of algorithms—the amount of time, storage, or other resources needed to execute them.” In other words, efficiency is the sole concern in the design of algorithms. … What about resilience? —Moshe Y. Vardi

This quote set me to thinking about how efficiency and resilience might interact, or trade off against one another, in networks. The most obvious extreme cases are two routers connected via a single long-haul link and the highly parallel data center fabrics we build today. Obviously adding a second long-haul link would improve resilience—but at what cost in terms of efficiency? Its also obvious highly meshed data center fabrics have plenty of resilience—and yet they still sometimes fail. Why?

Learning from Failure at Scale

13 April 2020 |

One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.

Enterprise and Service Provider—Once more into the Windmill

16 March 2020 | Comments Off on Enterprise and Service Provider—Once more into the Windmill

There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.

Let me try to recap some of the arguments I’ve heard used against my assertion.

The Hedge 24: Single Source of Truth

26 February 2020 | Comments Off on The Hedge 24: Single Source of Truth

Tim Schreyack recently presented at NANOG on the topic of building a single source of truth for network automation. Tim joins Tom and Russ in a wide-ranging discussion about single sources of truth, changing the way we see the network, and the changing skills of network engineers.

Letting go of Clean Design

10 February 2020 |

What is the best way to build a large-scale network—in two words? Ask ten networking folks (engineers, designers, or whatever else), and you’re likely to get the same answer from at least nine: clean abstractions. They might not say the word abstraction, of course; instead, they might say words like build things in modules, using summarization and aggregation to divide the modules up. Or they might say make certain to reduce the failure domain to the smallest you possible can everywhere you can. Or they might say use hierarchical design. These answers are, however, variants of the single word: abstraction.