DESIGN – Page 10 – rule 11 reader

Design Intelligence from the Hourglass Model

Over at the Communications of the ACM, Micah Beck has an article up about the hourglass model. While the math is quite interesting, I want to focus on transferring the observations from the realm of protocol and software systems development to network design. Specifically, start with the concept and terminology, which is very useful. Taking a typical design, such as this—

The first key point made in the paper is this—

The thin waist of the hourglass is a narrow straw through which applications can draw upon the resources that are available in the less restricted lower layers of the stack.

A somewhat obvious point to be made here is applications can only use services available in the spanning layer, and the spanning layer can only build those services out of the capabilities of the supporting layers. If fewer applications need to be supported, or the applications deployed do not require a lot of “fancy services,” a weaker spanning layer can be deployed. Based on this, the paper observes—

The balance between more applications and more supports is achieved by first choosing the set of necessary applications N and then seeking a spanning layer sufficient for N that is as weak as possible. This scenario makes the choice of necessary applications N the most directly consequential element in the process of defining a spanning layer that meets the goals of the hourglass model.

Beck calls the weakest possible spanning layer to support a given set of applications the minimally sufficient spanning layer (MSSL). There is one thing that seems off about this definition, however—the correlation between the number of applications supported and the strength of the spanning layer. There are many cases where a network supports thousands of applications, and yet the network itself is quite simple. There are many other cases where a network supports just a few applications, and yet the network is very complex. It is not the number of applications that matter, it is the set of services the applications demand from the spanning layer.

Based on this, we can change the definition slightly: an MSSL is the weakest spanning layer that can provide the set of services required by the applications it supports. This might seem intuitive or obvious, but it is often useful to work these kinds of intuitive things out, so they can be expressed more precisely when needed.

First lesson: the primary driver in network complexity is application requirements. To make the network simpler, you must reduce the requirements applications place on the network.

There are, however, several counter-intuitive cases here. For instance, TCP is designed to emulate (or abstract) a circuit between two hosts—it creates what appears to be a flow controlled, error free channel with no drops on top of IP, which has no flow control, and drops packets. In this case, the spanning layer (IP), or the wasp waist, does not support the services the upper layer (the application) requires.

In order to make this work, TCP must add a lot of complexity that would normally be handled by one of the supporting layers—in fact, TCP might, in some cases, recreate capabilities available in one of the supporting layers, but hidden by the spanning layer. There are, as you might have guessed, tradeoffs in this neighborhood. Not only are the mechanisms TCP must use more complex that the ones some supporting layer might have used, TCP represents a leaky abstraction—the underlying connectionless service cannot be completely hidden.

Take another instance more directly related to network design. Suppose you aggregate routing information at every point where you possibly can. Or perhaps you are using BGP route reflectors to manage configuration complexity and route counts. In most cases, this will mean information is flowing through the network suboptimally. You can re-optimize the network, but not without introducing a lot of complexity. Further, you will probably always have some form of leaky abstraction to deal with when abstracting information out of the network.

Second lesson: be careful when stripping information out of the spanning layer in order to simplify the network. There will be tradeoffs, and sometimes you end up with more complexity than what you started with.

A second counter-intuitive case is that of adding complexity to the supporting layers in order to ultimately simplify the spanning layer. It seems, on the model presented in the paper, that adding more services to the spanning layer will always end up adding more complexity to the entire system. MPLS and Segment Routing (SR), however, show this is not always true. If you need traffic steering, for instance, it is easier to implement MPLS or SR in the support layer rather than trying to emulate their services at the application level.

Third lesson: sometimes adding complexity in a lower layer can simplify the entire system—although this might seem to be counter-intuitive from just examining the model.

The bottom line: complexity is driven by applications (top down), but understanding the full stack, and where interactions take place, can open up opportunities for simplifying the overall system. The key is thinking through all parts of the system carefully, using effective mental models to understand how they interact (interaction surfaces), and the considering the optimization tradeoffs you will make by shifting state to different places.

Posted in DESIGN, RESEARCH, WRITTEN

DORA, DevOps, and Lessons for Network Engineers

DevOps Research and Assessment (DORA) released their 2018 Accelerate report on the state of DevOps at the end of 2018; I’m a little behind in my reading, so I just got around to reading it, and trying to figure out how to apply their findings to the infrastructure (networking) side of the world.

DORA found organizations that outsource entire functions, such as building an entire module or service, tend to perform more poorly than organizations that outsource by integrating individual developers into existing internal teams (page 43). It is surprising companies still think outsourcing entire functions is a good idea, given the many years of experience the IT world has with the failures of this model. Outsourced components, it seems, too often become a bottleneck in the system, especially as contracts constrain your ability to react to real-world changes. Beyond this, outsourcing an entire function not only moves the work to an outside organization, but also the expertise. Once you have lost critical mass in an area, and any opportunity for employees to learn about that area, you lose control over that aspect of your system.

DORA also found a correlation between faster delivery of software and reduced Mean Time To Repair (MTTR) (page 19). On the surface, this makes sense. Shops that delivery software continuously are bound to have faster, more regularly exercised processes in place for developing, testing, and rolling out a change. Repairing a fault or failure requires change; anything that improves the speed of rolling out a change is going to drive MTTR down.

Organizations that emphasize monitoring and observability tended to perform better than others (page 55). This has major implications for network engineering, where telemetry and management are often “bolted on” as an afterthought, much like security. This is clearly not optimal, however—telemetry and network management need to be designed and operated like any other application. Data sources, stores, presentation, and analysis need to be segmented into separate services, so new services can be tried out on top of existing data, and new sources can feed into existing services. Network designers need to think about how telemetry will flow through the management system, including where and how it will originate, and what it will be used for.

These observations about faster delivery and observability should drive a new way of thinking about failure domains; while failure domains are often primarily thought of as reducing the “blast radius” when a router or link fails, they serve two much larger roles. First, failure domain boundaries are good places to gather telemetry because this is where information flows through some form of interaction surface between two modules. Information gathered at a failure domain boundary will not tend to change as often, and it will often represent the operational status of the entire module.

Second, well places failure domain boundaries can be used to stake out areas where “new things” can be put in operation with some degree of confidence. If a network has well-designed failure domain boundaries, it is much easier to deploy new software, hardware, and functionality in a controlled way. This enables a more agile view of network operations, including the ability to roll out changes incrementally through a canary process, and to use processes like chaos monkey to understand and correct unexpected failure modes.

Another interesting observation is the j-curve of adoption (page 3):

This j-curve shows the “tax” of building the underlying structures needed to move from a less automated state to a more automated one. Keith’s Law:

In a complex system, the cumulative effect of a large number of small optimizations is externally indistinguishable from a radical leap.

…operates in part because of this j-curve. Do not be discouraged if it seems to take a lot of work to make small amounts of progress in many stages of system development—the results will come later.

The bottom line: it might seem like a report about software development is too far outside the realm of network engineering to be useful—but the reality is network engineers can learn a lot about how to design, build, and operate a network from software engineers.

Posted in DESIGN, RESEARCH, WRITTEN

Cloudy with a chance of lost context

One thing we often forget in our drive to “more” is this—context matters. Ivan hit on this with a post about the impact of controller failures in software-defined networks just a few days ago:

A controller (or management/provisioning) system is obviously the central point of failure in any network, but we have to go beyond that and ask a simple question: “What happens when the controller cluster fails and/or when nodes lose connectivity to the controller?”

What is the context in terms of a network controller? To broaden the question—what is the context for anything? The context, in philosophical terms, is the telos, the final end, or the purpose of the system you are building. Is the point of the network to provide neat gee-whiz capabilities, or is it to move data in a way that creates value for the business?

Ivan’s article reminded me of another article I read recently about the larger world of cloud, IoT, and context problems by Bob Frankston., who says:

Having abundant server capacity available is wonderful, but it makes no sense to have simple local functions fail. For example, when turning on a porch light, it makes no sense to have the light switch fail if someone doesn’t renew a subscription or if a distant server fails.

We are currently building networks with the presupposition that cloud based services are more secure, more agile, and more reliable than we can build locally. Somehow we think “the cloud” will never—can never—fail.

But of course, cloud services can fail. We know this because they have. Office 365 has failed a number of times, including this one. Azure failed at least once in 2014 due to operator error. Dyn was taken down by a DDoS attack not so long ago. Google cloud failed not too long ago, as well. Back in 2016, Salesforce suffered a major outage.

The point is not that we should not use cloud services. The point is that systems should be designed so any form of failure in a centralized component should not cause local services, at least at some minimal level, to fail. And yet, how many networks today rely on cloud-based services to operate at all, much less correctly? Would your company’s email still be delivered if your cloud based anti-spam software service failed? Would your company’s building locks work if the cloud based security service you use goes down?

If they would not, do you have a contingency plan in place for when these services do fail? Can you open the dors manually, do you have a way to turn off the spam filtering, do you have a way to manually configure the network to “get by” until service can be restored? Do you know which data flows must be supported in the case of an outage of this sort, or what the impact on the entire system will be? Do you have a plan to restore the network and its services to their optimal state once the external services optimal operation relies on are available after an outage?

A larger point can be made here: when pulling services into the network itself, we often rely on unexamined presuppositions, and we often leave the system with unexamined and little-understood risks.

Context matters—what is it you are trying to do? Do you really need to do it this way? What happens if the centralized or cloudy system you rely on fails? It’s not that we should not use or deploy cloud-based systems—but we should try to expose and understand the failure modes and risks we are building into our networks.

Posted in DESIGN, WRITTEN

It’s not a CLOS, it’s a Clos

Way back in the day, when telephone lines were first being installed, running the physical infrastructure was quite expensive. The first attempt to maximize the infrastructure was the party line. In modern terms, the party line is just an Ethernet segment for the telephone. Anyone can pick up and talk to anyone else who happens to be listening. In order to schedule things, a user could contact an operator, who could then “ring” the appropriate phone to signal another user to “pick up.” CSMA/CA, in essence, with a human scheduler.

This proved to be somewhat unacceptable to everyone other than various intelligence agencies, so the operator’s position was “upgraded.” A line was run to each structure (house or business) and terminated at a switchboard. Each line terminated into a jack, and patch cables were supplied to the operator, who could then connect two telephone lines by inserting a jumper cable between the appropriate jacks.

An important concept: this kind of operator driven system is nonblocking. If Joe calls Susan, then Joe and Susan cannot also talk to someone other than one another for the duration of their call. If Joe’s line is tied up, when someone tries to call him, they will receive a busy signal. The network is not blocking in this case, the edge is—because the node the caller is trying to reach is already using 100% of its available bandwidth for an existing call. This is called admission control; all non-blocking networks must either be provisioned so the sum total of the transmitters cannot consume the whole of the cross-sectional bandwidth available in the network, or they must have effective admission control.

Blocking networks did exist in the form of trunk connections, or connections between these switch panels. Trunk connections not only consumed ports on the switchboard, they were expensive to build, and required a lot of power to run. Hence, making a “long distance call” costs money because it consumed a blocking resource. It is only when we get to packet switched digital networks that the cost of a “long distance call” drops to the rough equivalent of a “normal” call, and we see “long distance” charges fade into memory (many of my younger readers have never been charged for “long distance calls,” in fact, and may not even know what I’m talking about).

The human operators were eventually replaced by relays.

The old “crank to ring” phones were replaced with “dial phones,” which break the circuit between the phone and the relay to signal a digit being dialed. The process begins when you lift the handset, which causes voltage to run across the line. Spinning the dial breaks the circuit once for each digit on the dial. Each time the circuit breaks, the armature on this stack of circular relays is moved. The first digit dialed causes the relay to move up (or down, depending on the relay) the stack. When you pause for a second before dialing the second number, the motors reset so the arm will now move around the circle, starting with the zero position. When it reaches this spot, the arm connects your physical line to another relay, which then repeats the process, ultimately reaching the line of the person you want to call and creating a direct electrical connection between the caller and the receiver. We used to have huge banks of these relays, accompanied by 66 style punch-down blocks, and sometimes (in newer systems) spin-down connections for permanent circuits. We mostly “troubleshot” them with WD-40 and electrical contact solution from spray cans (yes, I have personal experience).

These huge relay banks became unwieldy to support, of course, so they were eventually replaced with fabrics—crossbar fabrics.

In a crossbar fabric, each device is “attached twice,” once on the send side, and once on the receive side. When you place a call, your “send” is connected to the receivers “receive” by making an electrical connection where your “send” line intersects with the receivers “receive” line. All of this hardware, however, has the property of scaling up rather than out. In order to create a larger telephone switch, you simply have to buy a larger fabric of some kind. You cannot “add to” a fabric once it is built, which means when a region outgrows its fabric, you either must install a new on and connect the two fabrics with trunk lines, you must rip the entire fabric out and replace it.

This is the problem Edson Erwin, and later Charles Clos, were trying to solve—how do you build a large-scale switching fabric out of simple components, and yet scale to virtually any size. While Erwin invented the concept in 1938, the concept was formalized by Clos in 1952. Charles Clos, by the way, was French, so the proper pronunciation is “klo,” with a long “o.”

The solution was to use four port switches (or relays, in the case of circuits), each of which are connected into a familiar looking fabric.

This was designed to be a unidirectional fabric, with each edge node (leaf) having a send and receive side, much like a crossbar. It is fairly obvious that the fabric is nonblocking on admission control; if a1 connects to b3, then a1’s entire bandwidth is consumed in that connection, so a1 cannot accept any other connections. If d1 also tries to connect to b3, it will receive the traditional “busy signal.” The network does not block, it just refuses to accept the new connection at the edge.

What if you wanted to make this fabric bidirectional? To allow traffic to flow from a1 to b3 while also allowing traffic to flow from a3 to, say, d1? You would “fold” the fabric. This does not involve changing the physical layout at all; folded Clos fabrics are not drawn any differently than non-folded Clos fabrics. They look identical on paper (stop drawing them as if putting a1 and a3 on the same side of the fabric makes them into a “two-stage folded fabric”—this is not what “folded” means). The primary change here is in the “control plane,” or the scheduler, which must be modified to allow traffic to flow in either direction. Packet switching is almost always bidirectional, so all Clos fabrics used in packet switched networks are “folded” in this way.

Another interesting point—you cannot really perform “acceptance flow control” on a packet-switched network, so there is no way to make the fabric “nonblocking.” In a circuit-based network, the number of edge ports can be tied to the amount of cross-sectional bandwidth available in the fabric. You can over-subscribe the fabric by provisioning more ports on the edge than the cross-sectional bandwidth of the fabric itself can support, of course, which makes the fabric non-contending, rather than non-blocking. In a packet-switched network, it just is not possible to perform reliable admission control; because of this, packet switched Clos fabrics, then, are always non-contending rather than non-blocking.

So the next time you put CLOS in a document, or on a white board, don’t. It’s a Clos fabric, not a CLOS fabric. It’s not non-blocking, and you do not “fold” it by drawing it with two stages. Clos fabrics always have an odd number of stages. There is a mathematical reason for this, but this post is already long enough.

Another problem with our common usage of these terms is we call every kind of spine-and-leaf (or leaf-and-spine) fabric a Clos, and we have started calling all kinds of networks, including overlay networks, “fabrics.” This post is already long (as above), so I will leave these for the future.

If you liked this short article, and would like to understand more about fabrics and fabric design, please join me for me upcoming Data Center Design live webinar on Safari Books. I will cover this history there, but I will also cover a lot of other things involved in designing data center fabrics.

Posted in DESIGN, TECH, WRITTEN

When should you use IPv6 PA space?

I was reading RFC8475 this week, which describes some IPv6 multihoming ‘net connection solutions. This set me to thinking about when you should uses IPv6 PA space. To begin, it’s useful to review the concept of IPv6 PI and PA space.

PI, or provider independent, space, is assigned by a regional routing registry to network operators who can show they need an address space that is not tied to a service provider. These requirements generally involve having a specific number of hosts, showing growth in the number of IPv6 addresses used over time, and other factors which depend on the regional registry providing the address space. PA, or provider assigned, IPv6 addresses can be assigned by a provider from their PI pool to an operator to which they are providing connectivity service.

There are two main differences between these two kinds of addresses. PI space is portable, which means the operator can take the address space when them when they change providers. PI space is also fixed; it is (generally) safe to use PI space as you might private or other IP address spaces; you can assign them to individual subnets, hosts, etc., and count on them remaining the same over time. If everyone obtained PI space, however, the IPv6 routing table in the default free zone (DFZ) could explode. PI space cannot be aggregated by the operator’s upstream provider because it is portable in just this way.

PA space, on the other hand, can be aggregated by your upstream provider because it is assigned by the provider. On the other hand, the provider can change the address block assigned to its customer at any time. The general idea is the renumbering capabilities built into IPv6 make it possible to “not care” about the addresses assigned to individual hosts on your network.

How does this work out in real life? Consider the following network—

Assume AS65000 assigns 2001:db8:1:1::/64 to the operator, and AS65001 assigns 2001:db8:2:2::/64. IPv6 provides mechanisms for A, B, and D to obtain addresses from within these two ranges, so each device has two IP addresses. Now assume A wants to send a packet to some site connected to the public Internet. If it sources the packet from its address in the 1:1::/64 range, A should send the packet to E; if it sources the packet from its address in the 2:2::/64 range, it should send the packet to F. This behavior is described in RFC6724, rule 5.5:

Prefer addresses in a prefix advertised by the next-hop. If SA or SA’s prefix is assigned by the selected next-hop that will be used to send to D and SB or SB’s prefix is assigned by a different next-hop, then prefer SA. Similarly, if SB or SB’s prefix is assigned by the next-hop that will be used to send to D and SA or SA’s prefix is assigned by a different next-hop, then prefer SB.

If the host uses this solution, it needs to remember which sources it has used in the past for which destinations—at least for the length of a single session, at any rate. RFC6724, however, notes supporting rule 5.5 is a SHOULD, rather than a MUST, which means some hosts may not have this capability. What should the network operator do in this case?

RFC8475 suggests using a set of policy based routing or filter based forwarding policies at the routers in the network to compensate. If E, for instance, receives a packet with a source in the range 2:2::/64 range, it should route the packet to F for forwarding. This will generally mean forwarding the packet out the interface on which E received the packet. Likewise, F should have a local policy that forwards packets it receives with a source address in the 1:1::/64 range to E. RFC8475 provides several examples of policies which would work for a number of different situations (for active/standby, active/active, one border router or two).

There are, however, two failure modes here. For the first one, assume AS65000 decides to assign the operator another IPv6 address range. IPv6’s renumbering capabilities will take care of getting the correct addresses onto A, B, and D—but the policies at E and F must be manually updated for the new address space to work correctly. It should be possible to automate the management of these filters in some way, of course, but the complexity injected into the network is larger than you might initially think.

The second failure relates to a deeper problem. What if B is not allowed, by policy, to talk to D? If the addressing in the network were consistent, the operator could set up a filter at C to prevent traffic from flowing between these two devices. When the network is renumbered, any such filters must be reconfigured, of course. A second instance of this same kind of failure: what if D is the internal DNS server for the network? While the DNS server’s address can be pushed out through the IPv6 renumbering capabilities (through NA, specifically), some manual or automated configuration must be adjusted to get the new address into the IPv6 advertisements so it can be distributed.

The short answer to the question above—when should you use PA space for a network?—comes down to this: when you have a very small, simple network behind a set of routers connecting to the ‘net, where the hosts attached to the network support RFC6724 rule 5.5, and intra-site communication is very simple (or there is no intra-site communications at all). Essentially, renumbering is not the only problem to solve when renumbering a network.

Posted in DESIGN, WRITTEN

The EIGRP SIA Incident: Positive Feedback Failure in the Wild

Reading a paper to build a research post from (yes, I’ll write about the paper in question in a later post!) jogged my memory about an old case that perfectly illustrated the concept of a positive feedback loop leading to a failure. We describe positive feedback loops in Computer Networking Problems and Solutions, and in Navigating Network Complexity, but clear cut examples are hard to find in the wild. Feedback loops almost always contribute to, rather than independently cause, failures.

Many years ago, in a network far away, I was called into a case because EIGRP was failing to converge. The immediate cause was neighbor flaps, in turn caused by Stuck-In-Active (SIA) events. To resolve the situation, someone in the past had set the SIA timers really high, as in around 30 minutes or so. This is a really bad idea. The SIA timer, in EIGRP, is essentially the amount of time you are willing to allow your network to go unconverged in some specific corner cases before the protocol “does something about it.” An SIA event always represents a situation where “someone didn’t answer my query, which means I cannot stay within the state machine, so I don’t know what to do—I’ll just restart the state machine.” Now before you go beating up on EIGRP for this sort of behavior, remember that every protocol has a state machine, and every protocol has some condition under which it will restart the state machine. IT just so happens that EIGRP’s conditions for this restart were too restrictive for many years, causing a lot more headaches than they needed to.

So the situation, as it stood at the moment of escalation, was that the SIA timer had been set unreasonably high in order to “solve” the SIA problem. And yet, SIAs were still occurring, and the network was still working itself into a state where it would not converge. The first step in figuring this problem out was, as always, to reduce the number of parallel links in the network to bring it to a stable state, while figuring out what was going on. Reducing complexity is almost always a good, if counterintuitive, step in troubleshooting large scale system failure. You think you need the redundancy to handle the system failure, but in many cases, the redundancy is contributing to the system failure in some way. Running the network in a hobbled, lower readiness state can often provide some relief while figuring out what is happening.

In this case, however, reducing the number of parallel links only lengthened the amount of time between complete failures—a somewhat odd result, particularly in the case of EIGRP SIAs. Further investigation revealed that a number of core routers, Cisco 7500’s with SSE’s, were not responding to queries. This was a particularly interesting insight. We could see the queries going into the 7500, but there was no response. Why?

Perhaps the packets were being dropped on the input queue of the receiving box? There were drops, but not nearly enough to explain what we were seeing. Perhaps the EIGRP reply packets were being dropped on the output queue? No—in fact, the reply packets just weren’t being generated. So what was going on?

After collecting several show tech outputs, and looking over them rather carefully, there was one odd thing: there was a lot of free memory on these boxes, but the largest block of available memory was really small. In old IOS, memory was allocated per process on an “as needed basis.” In fact, processes could be written to allocate just enough memory to build a single packet. Of course, if two processes allocate memory for individual packets in an alternating fashion, the memory will be broken up into single packet sized blocks. This is, as it turns out, almost impossible to recover from. Hence, memory fragmentation was a real thing that caused major network outages.

Here what we were seeing was EIGRP allocating single packet memory blocks, along with several other processes on the box. The thing is, EIGRP was actually allocating some of the largest blocks on the system. So a query would come in, get dumped to the EIGRP process, and the building of a response would be placed on the work queue. When the worker ran, it could not find a large enough block in which to build a reply packet, so it would patiently put the work back on its own queue for future processing. In the meantime, the SIA timer is ticking in the neighboring router, eventually timing out and resetting the adjacency.

Resetting the adjacency, of course, causes the entire table to be withdrawn, which, in turn, causes… more queries to be sent, resulting in the need for more replies… Causing the work queue in the EIGRP process to attempt to allocate more packet sized memory blocks, and failing, causing…

You can see how this quickly developed into a positive feedback loop—

EIGRP receives a set of queries to which it must respond
EIGRP allocates memory for each packet to build the responses
Some other processes allocate memory blocks interleaved with EIGRP’s packet sized memory blocks
EIGRP receives more queries, and finds it cannot allocate a block to build a reply packet
EIGRP SIA timer times out, causing a flood of new queries…

Rinse and repeat until the network fails to converge.

There are two basic problems with positive feedback loops. The first is they are almost impossible to anticipate. The interaction surfaces between two systems just have to be both deep enough to cause unintended side effects (the law of leaky abstractions almost guarantees this will be the case at least some times), and opaque enough to prevent you from seeing the interaction (this is what abstraction is supposed to do). There are many ways to solve positive feedback loops. In this case, cleaning up the way packet memory was allocated in all the processes in IOS, and, eventually, giving the active process in EIGRP an additional, softer, state before it declared a condition of “I’m outside the state machine here, I need to reset,” resolved most of the incidents of SIA’s in the real world.

But rest assured—there are still positive feedback loops lurking in some corner of every network.

Posted in DESIGN, TECH, WRITTEN

If you haven’t found the tradeoff…

This week, I ran into an interesting article over at Free Code Camp about design tradeoffs. I’ll wait for a moment if you want to go read the entire article to get the context of the piece… But this is the quote I’m most interested in:

[time-span]

Just like how every action has an equal and opposite reaction, each “positive” design decision necessarily creates a “negative” compromise. Insofar as designs necessarily create compromises, those compromises are very much intentional. (And in the same vein, unintentional compromises are a sign of bad design.)

In other words, design is about making tradeoffs. If you think you’ve found a design with no tradeoffs, well… Guess what? You’ve not looked hard enough. This is something I say often enough, of course, so what’s the point? The point is this: We still don’t really think about this in network design. This shows up in many different places; it’s worth taking a look at just a few.

Hardware is probably the place where network engineers are most conscious of design tradeoffs. Even so, we still tend to think sticking a chassis in a rack is a “future and requirements proof solution” to all our network design woes. With a chassis, of course, we can always expand network capacity with minimal fuss and muss, and since the blades can be replaced, the life cycle of the chassis should be much, much, longer than any sort of fixed configuration unit. As for port count, it seems like it should always be easier to replace line cards than to replace or add a box to get more ports, or higher speeds.

Cross posted at CircleID

But are either of these really true? While it might “just make sense” that a chassis box will last longer than a fixed configuration box, is there real data to back this up? Is it really a lower risk operation to replace the line cards in a chassis (including the brains!) with a new set, rather than building (scaling) out? And what about complexity—is it better to eat the complexity in the chassis, or the complexity in the network? Is it better to push the complexity into the network device, or into the network design? There are actually plenty of tradeoffs consider here, as it turns out—it just sometimes takes a little out of the box thinking to find them.

What about software? Network engineers tend to not think about tradeoffs here. After all, software is just that “stuff” you get when you buy hardware. It’s something you cannot touch, which means you are better off buying software with every feature you think you might ever need. There’s no harm in this right? The vendor is doing all the testing, and all the work of making certain every feature they include works correctly, right out of the box, so just throw the kitchen sink in there, too.

Or maybe not. My lesson here came through an experience in Cisco TAC. My pager went off one morning at 2AM because an image designed to test a feature in EIGRP had failed in production. The crash traced back to some old X.25 code. The customer didn’t even have X.25 enabled anyplace in their network. The truth is that when software systems get large enough, and complex enough, the laws of leaky abstractions, large numbers, and unintended consequences take over. Software defined is not a panacea for every design problem in the world.

What about routing protocols? The standards communities seem focused on creating and maintaining a small handulf of routing protocols, each of which is capable of doing everything. After all, who wants to deploy a routing protocol only to discover, a few years later, that it cannot handle some task that you really need done? Again, maybe not. BGP itself is becoming a complex ecosystem with a lot of interlocking parts and pieces. What started as a complex idea has become more complex over time, and we now have engineers who (seriously) only know one routing protocol—because there is enough to know in this one protocol to spend a lifetime learning it.

In all these situations we have tried to build a universal where a particular would do just fine. There is another side to this pendulum, of course—the custom built network of snowflakes. But the way to prevent snowflakes is not to build giant systems that seem to have every feature anyone can ever imagine.

The way to prevent landing on either extreme—in a world where every device is capable of anything, but cannot be understood by even the smartest of engineers, and a world where every device is uniquely designed to fit its environment, and no other device will do—is consider the tradeoffs.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

A corollary rule, returning to the article that started this rant, is this: unintentional compromises are a sign of bad design.

Posted in DESIGN, WRITTEN