Learning from the Post-Mortem

18 May 2020 | Comments Off on Learning from the Post-Mortem

Post-mortem reviews seem to be quite common in the software engineering and application development sides of the IT world—but I do not recall a lot of post-mortems in network engineering across my 30 years. This puzzling observation sprang to mind while I was reading a post over at the ACM this last week about how to effectively learn from the post-mortem exercise.

The common pattern seems to be setting aside a one hour meeting, inviting a lot of people, trying to shift blame while not actually saying you are shifting blame (because we are all supposed to live in a blame-free environment now—fix the problem, not the blame!), and then … a list is created on a whiteboard, pictures are taken, and everyone walks away with a rock-solid plan to never do that again.

Understanding DC Fabric Complexity

11 May 2020 |

When I think of complexity, I mostly consider transport protocols and control planes—probably because I have largely worked in these areas from the very beginning of my career in network engineering. Complexity, however, is present in every layer of the networking stack, all the way down to the physical. I recently ran across an interesting paper on complexity in another part of the network I had not really thought about before: the physical plant of a data center fabric.

Quitting Certifications: When?

4 May 2020 |

At what point in your career do you stop working towards new certifications?

Daniel Dibb’s recent post on his blog is, I think, an excellent starting point, but I wanted to add a few additional thoughts to the answer he gives there.

Daniel’s first question is how do you learn? Certifications often represent a body of knowledge people who have a lot of experience believe is important, so they often represent a good guided path to holistically approaching a new body of knowledge. In the professional learning world this would be called a ready-made mental map. There is a counterargument here—certifications are often created by vendors as a marketing tool, rather than as something purely designed for the betterment of the community, or the dissemination of knowledge. This doesn’t mean, however, that certifications are “evil.” It just means you need to evaluate each certification on its own merits.

The Resilience Problem

27 April 2020 | Comments Off on The Resilience Problem

…we have educated generations of computer scientists on the paradigm that analysis of algorithm only means analyzing their computational efficiency. As Wikipedia states: “In computer science, the analysis of algorithms is the process of finding the computational complexity of algorithms—the amount of time, storage, or other resources needed to execute them.” In other words, efficiency is the sole concern in the design of algorithms. … What about resilience? —Moshe Y. Vardi

This quote set me to thinking about how efficiency and resilience might interact, or trade off against one another, in networks. The most obvious extreme cases are two routers connected via a single long-haul link and the highly parallel data center fabrics we build today. Obviously adding a second long-haul link would improve resilience—but at what cost in terms of efficiency? Its also obvious highly meshed data center fabrics have plenty of resilience—and yet they still sometimes fail. Why?

Reflections on Intent

20 April 2020 |

No, not that kind. 🙂

BGP security is a vexed topic—people have been working in this area for over twenty years with some effect, but we continuously find new problems to address. Today I am looking at a paper called BGP Communities: Can of Worms, which analyses some of the security problems caused by current BGP community usage in the ‘net. The point I want to think about here, though, is not the problem discussed in the paper, but rather some of the larger problems facing security in routing.

Learning from Failure at Scale

13 April 2020 |

One of the difficulties for the average network operator trying to understand their failure rates and reasons is they just don’t have enough devices, or enough incidents, to make informed observations. If you have a couple of dozen switches, it is often hard to understand how often software defects take a device down versus human error (Mean Time Between Mistakes, or MTBM). As networks become larger, however, more information becomes available, and more interesting observations can be made. A recent paper written in conjunction with Facebook uses information from Facebook’s data center fabrics to make some observations about the rate and severity of different kinds of failures—needless to say, the results are fairly interesting.

Understanding Internet Peering

6 April 2020 | Comments Off on Understanding Internet Peering

The world of provider interconnection is a little … “mysterious” … even to those who work at transit providers. The decision of who to peer with, whether such peering should be paid, settlement-free, open, and where to peer is often cordoned off into a separate team (or set of teams) that don’t seem to leak a lot of information. A recent paper on current interconnection practices published in ACM SIGCOMM sheds some useful light into this corner of the Internet, and hence is useful for those just trying to understand how the Internet really works.

Working from Home: Myth and Reality

30 March 2020 |

The last few weeks have seen a massive shift towards working from home because of the various “stay at home” orders being put in place around the world—a trend I consider healthy in the larger scheme of things. Of course, there has also been an avalanche of “tips for working from home” articles. I figured I’d add my own to the pile.

An Interesting take on Mapping an Attack Surface

23 March 2020 | Comments Off on An Interesting take on Mapping an Attack Surface

Security often lives in one of two states. It’s either something “I” take care of, because my organization is so small there isn’t anyone else taking care of it. Or it’s something those folks sitting over there in the corner take care of because the organization is, in fact, large enough to have a separate security team. In both cases, however, security is something that is done to networks, or something thought about kind-of off on its own in relation to networks.

I’ve been trying to think of ways to challenge this way of thinking for many years—a long time ago, in a universe far away, I created and gave a presentation on network security at Cisco Live (raise your hand if you’re old enough to have seen this presentation!).

Enterprise and Service Provider—Once more into the Windmill

16 March 2020 | Comments Off on Enterprise and Service Provider—Once more into the Windmill

There is no enterprise, there is no service provider—there are problems, and there are solutions. I’m certain everyone reading this blog, or listening to my podcasts, or listening to a presentation I’ve given, or following along in some live training or book I’ve created, has heard me say this. I’m also certain almost everyone has heard the objections to my argument—that hyperscaler’s problems are not your problems, the technologies and solutions providers user are fundamentally different than what enterprises require.

Let me try to recap some of the arguments I’ve heard used against my assertion.