For any field of study, there are some mental habits that will make you an expert over time. Whether you are an infrastructure architect, a network designer, or a network reliability engineer, what are the habits of mind those involved in the building and operation of networks follow that mark out expertise?
Experts involve the user
Experts don’t just listen to the user, they involve the user. This means taking the time to teach the developer or application owner how their applications interact with the network, showing them how their applications either simplify or complicate the network, and the impact of these decisions on the overall network.
Experts think about data
Rather than applications. What does the data look like? How does the business use the data? Where does the data need to be, when does it need to be there, how often does it need to go, and what is the cost of moving it? What might be in the data that can be harmful? How can I protect the data while at rest and in flight?
Experts think in modules, surfaces, and protocols
Devices and configurations can, and should, change over time. The way a problem is broken up into modules and the interaction surfaces (interfaces) between those modules can be permanent. Choosing the wrong protocol means choosing a different protocol to solve every problem, leading to accretion of complexity, ossification, and ultimately brittleness. Break the problem up right the first time, and choose the protocols carefully, and let the devices and configurations follow.
Choosing devices first is like selecting the hammer you’re going to use to build a house, and then selecting the design and materials used in the house based on what you can use the hammer for.
Experts think about tradeoffs
State, optimization, and surface is an ironclad tradeoff. If you increase state, you increase complexity while also increasing optimization. If you increase surfaces through abstraction, you are both increasing and decreasing state, which has an impact both on complexity and optimization. All nontrivial abstractions leak. Every time you move data you are facing the speed of serialization, queueing, and light, and hence you are dealing with the choice between consistency, availablity, and partitioning.
If you haven’t found the tradeoffs, you haven’t looked hard enough.
Experts focus on the essence
Every problem has an essential core—something you are trying to solve, and a reason for solving it. Experts know how to divide between the essential and the nonessential. Experts think about what they are not designing, and what they are not trying to accomplish, as well as what they are. This doesn’t mean the rest isn’t there, it just means it’s not quite in focus all the time.
Experts are mentally stimulated to simulate
Labs are great—but moving beyond the lab and thinking about how the system works as a whole is better. Experts mentally simulate how the data moves, how the network converges, how attackers might try to break in, and other things besides.
Experts look around
Interior designers go to famous spaces to see how others have designed before them. Building designers walk through cities and famous buildings to see how others have designed before them. The more you know about how others have designed, the more you know about the history of networks, the more of an expert you will be.
Experts reshape the problem space
Experts are unafraid to think about the problem in a different way, to say “no,” and to try solutions that have not been tried before. Best common practice is a place to start, not a final arbiter of all that is good and true. Experts do not fall to the “is/ought” fallacy.
Experts treat problems as opportunities
Whether the problem is a mistake or a failure, or even a little bit of both, every problem is an opportunity to learn how the system works, and how networks work in general.
Backscatter is often used to detect various kinds of attacks, but how does it work? The paper under review today, Who Knocks at the IPv6 Door, explains backscatter usage in IPv4, and examines how effectively this technique might be used to detect scanning of IPv6 addresses, as well. The best place to begin is with an explanation of backscatter itself; the following network diagram will be helpful—
Assume A is scanning the IPv4 address space for some reason—for instance, to find some open port on a host, or as part of a DDoS attack. When A sends an unsolicited packet to C, a firewall (or some similar edge filtering device), C will attempt to discover the source of this packet. It could be there is some local policy set up allowing packets from A, or perhaps A is part of some domain none of the devices from C should be connecting to. IN order to discover more, the firewall will perform a reverse lookup. To do this, C takes advantage of the PTR DNS record, looking up the IP address to see if there is an associated domain name (this is explained in more detail in my How the Internet Really Works webinar, which I give every six months or so). This reverse lookup generates what is called a backscatter—these backscatter events can be used to find hosts scanning the IP address space. Sometimes these scans are innocent, such as a web spider searching for HTML servers; other times, they could be a prelude to some sort of attack.
Kensuke Fukuda and John Heidemann. 2018. Who Knocks at the IPv6 Door?: Detecting IPv6 Scanning. In Proceedings of the Internet Measurement Conference 2018 (IMC ’18). ACM, New York, NY, USA, 231-237. DOI: https://doi.org/10.1145/3278532.3278553
Scanning the IPv6 address space is much more difficult because there are 2128 addresses rather than 232. The paper under review here is one of the first attempts to understand backscatter in the IPv6 address space, which can lead to a better understanding of the ways in which IPv6 scanners are optimizing their search through the larger address space, and also to begin understanding how backscatter can be used in IPv6 for many of the same purposes as it is in IPv4.
The researchers begin by setting up a backscatter testbed across a subset of hosts for which IPv4 backscatter information is well-known. They developed a set of heuristics for identifying the kind of service or host performing the reverse DNS lookup, classifying them into major services, content delivery networks, mail servers, etc. They then examined the number of reverse DNS lookups requested versus the number of IP packets each received.
It turns out that about ten times as many backscatter incidents are reported for IPv4 than IPv6, which either indicates that IPv6 hosts perform reverse lookup requests about ten times less often than IPv4 hosts, or IPv6 hosts are ten times less likely to be monitored for backscatter events. Either way, this result is not promising—it appears, on the surface, that IPv6 hosts will be less likely to cause backscatter events, or IPv6 backscatter events are ten times less likely to be reported. This could indicate that widespread deployment of IPv6 will make it harder to detect various kinds of attacks on the DFZ. A second result from this research is that using backscatter, the researchers determined IPv6 scanning is increasing over time; while the IPv6 space is not currently a prime target for attacks, it might become more so over time, if the scanning rate is any indicator.
The bottom line is—IPv6 hosts need to be monitored as closely, or more closely than IPv6 hosts, for scanning events. The techniques used for scanning the IPv6 address space are not well understood at this time, either.
According to the recent SONAR report, 52% of respondents reported they are using Software Defined Networking (SDN) tools to automate their networks, while 57% reported they are using network management tools. The report notes “52% may be slightly exaggerated, depending on how one defines SDN…” Which leads naturally to the question—what the difference between SDN and DevOps is, and how does AI figure into both or either of these. SDN, DevOps, and AI describe separate and overlapping movements in the design, deployment, and management of networks. While they are easy to confuse, they have three different origins and meanings.
Software Defined Networking grew out of research efforts to build and deploy experimental control planes, either distributed or centralized. SDN, however, quickly became associated with replacing some or all the functions of a distributed control plane with a centralized controller, particularly in order to centralize policy related to the control plane such as traffic engineering. SDN solutions always work through a programmatic interface designed to primarily supply forwarding information to network devices.
Development Operations, or DevOps, is a movement away from human-centered interfaces towards machine-centered interfaces for the deployment, operation, and troubleshooting of networks. DevOps is centered on the deployment, configuration, and management of the entire device, rather than providing the information required to forward traffic. DevOps can either use a programmatic interface, such as YANG, or “screen scraping,” to configure and manage network devices.
Finally, Artificial Intelligence, or AI, in the context of computer networks, is focused on the use of data gathered from the network to improve operations, from decreasing the time required to troubleshoot a problem to making the network adapt more quickly to shifting application and business requirements. AI, applied to networks, is narrow in scope, so it is Artificial Narrow Intelligence, or ANI. Real implementations of AI in the networking field are often applications of Machine Learning, or ML; while these two terms are often used interchangeably, they are not quite the same thing.
The following illustration will be useful in understanding the relationship between these three concepts.
In the figure, the SDN and DevOps controllers interact with two different aspects of the network devices forwarding traffic; both SDN and DevOps can be deployed in the same network to solve different problems. For instance, DevOps might be used to configure network devices to reach the SDN controller so they can receive the information they need to forward packets. Or the DevOps system might be used to configure a distributed control plane, such as IS-IS, on all the network devices, and also to configure a centralized controller which can override the local decisions of the distribute routing protocol for traffic engineering.
There are some situations where the difference between SDN and DevOps solutions is not obvious. The most common example is DevOps could be used to configure routing information on each network device, performing the same function as an SDN controller. In this case, what is the difference?
First, an SDN solution is intended specifically to replace the distributed control plane, rather than to configure the entire device. Second, the configurations pushed to a device through DevOps is normally persistent; if a device reboots, the configuration pushed through DevOps will be loaded and enabled, impacting the operation of the device. In contrast, any information pushed to a device through an SDN controller would normally be ephemeral; when the device is rebooted, information pushed by the SDN controller will be lost.
Finally, AI and self-healing are shown on the right side of this diagram as a way to turn telemetry into actionable input for either the DevOps or the SDN system. The ability of ML networks to find and recognize patterns in streams of data means it is perfectly suited to find new patterns of network behavior and alert an operator, or to match current conditions to the past, anticipating future failures or finding an otherwise unnoticed problem.
While SDN, DevOps, and AI overlap, then, they serve different purposes in the realm of network engineering and operations. There are many areas of overlap, but they are also different enough to argue the three terms should be cleanly separated, with each adding a different kind of value to the overall system.
Over at the ECI blog, Jonathan Homa has a nice article about the importance of network planning–
In the classic movie, The Graduate (1967), the protagonist is advised on career choices, “In one word – plastics.” If you were asked by a young person today, graduating with an engineering or similar degree about a career choice in telecommunications, would you think of responding, “network planning”? Well, probably not.
Jonathan describes why this is so–traffic is constantly increasing, and the choice of tools we have to support the traffic loads of today and tomorrow can be classified in two ways: slim and none (as I remember a weather forecaster saying when I “wore a younger man’s shoes”). The problem, however, is not just tools. The network is increasingly seen as a commodity, “pure bandwidth that should be replaceable like memory,” made up of entirely interchangeable parts and pieces, primarily driven by the cost to move a bit across a given distance.
This situation is driving several different reactions in the network engineering world, none of which are really healthy. There is a sense of resignation among people who work on networks. If commodities are driven by price, then the entire life of a network operator or engineer is driven by speed, and speed alone. All that matters is how you can build ever larger networks with ever fewer people–so long as you get the bandwidth you need, nothing else matters.
This is compounded by a simple reality–network world has driven itself into the corner of focusing on the appliance–the entire network is appliances running customized software, with little thought about the entire system. Regardless of whether this is because of the way we educate engineers through our college programs and our certifications, this is the reality on the ground level of network engineering. When your skill set is primarily built around configuring and managing appliances, and the world is increasingly making those appliances into commodities, you find yourself in a rather depressing place.
Further, there is a belief that there is no more real innovation to be had–the end of the road is nigh, and things are going to look pretty much like they look right now for the rest of … well, forever.
I want you, as a network engineer, operator, or whatever you call yourself, to look these beliefs in the eye and call them what they are: nonsense on stilts.
The real situation is this: the current “networking industry,” such as it is, has backed itself into a corner. The emphasis on planning Jonathan brings out is valid, but it is just the tip of the proverbial iceberg. There is a hint in this direction in Jonathan’s article in the list of suggestions (or requirements). Thinking across layers, thinking about failure, continuous optimization… these are all… system level thinking, To put this another way, a railway boxcar might be a commodity, but the railroad system is not. The individual over-the-road truck might be a commodity, and the individual road might not be all that remarkable, but the road system is definitely not a commodity.
The sooner we start thinking outside the appliance as network engineers or operators (or whatever you call yourself), the sooner we will start adding value to the business. This means thinking about algorithms, protocols, and systems–all that “theory stuff” we typically decry as being less than usefl–rather than how to configure x on device y. This means thinking about security across the network, rather than as how you configure a firewall. This means thinking about the tradeoffs with implementing security, including what systemic risk looks like, and when the risks are acceptable when trying to accomplish as specific goal, rather than thinking about how to route traffic through a firewall.
If demand is growing, why is the networking world such a depressing place right now? Why do I see lots of people saying things like “there will be no network engineers in enterprises in five years?” Rather than blaming the world, maybe we should start looking at how we are trying to solve the problems in front of us.
Once the shipping department drops the box off with that new switch, router, or “firewall,” what happens next? You rack it, cable it up, turn it on, and start configuring, right? There are access to controls to configure—SSH, keys, disabling standard accounts, disabling telnet—interface addresses to configure, routing adjacencies to configure, local policies to configure, and… After configuring all of this, you can adjust routing in the network to route around the new device, and then either canary the device “in production” (if you run your network the way it should be run), or find some prearranged maintenance time to bring the new device online and test things out. After all of this, you can leave the new device up and running in the network, and move on to the next task.
Until it breaks.
Then you consult the documentation to remind yourself why it was configured this way, consult the documentation to understand how the application everyone is complaining about not working should work, etc. There are the many hours spent sitting on the console gathering information by running various commands and the output of various logs. Eventually, once you find the problem, you can either replace the right parts, or reconfigure the right bits, and get everything running again.
In the “modern” world (such as it is), we think it’s a huge leap forward to stop configuring devices manually. If we can just automate the configuration of all that “stuff” we have to do at the beginning, after the box is opened and before the device is placed into service, we think we have this whole networking thing pretty well figured out.
Even if you had everything in your network automated, you still haven’t figured this networking thing out.
We need to move beyond automation. Where do we need to move to? It’s not one place, but two. The first is we need to move beyond automation to autonomous operation. As an example, there is a shiny new system that is currently being widely deployed to automate the deployment and management of containers. Part of this system is the automation of connectivity, including routing, between containers. The routing system being deployed as part of this system is essentially statically configured policy-based routing combined with network address translation.
Let me point something out that is not going to be very popular: this is a step backwards in terms of making the system autonomous. Automating static routing information is not a better solution than building a real, dynamic, proactive, autonomic, routing system. It’s not simpler—trust me, I say this as someone who has operated large networks which used automated static routes to do everything.
The “opsification of everything” is neat, but it shouldn’t be our end goal.
Now part of this, I know, is the fault of vendors. Vendors who push EGPs onto data center fabrics because, after all, “the configuration complexity doesn’t matter so long as you can automate it.” The configuration complexity does matter, because configuration complexity belies an underlying protocol complexity, and sets up long and difficult troubleshooting sessions that are completely unnecessary.
The second place we need to move in the networking world? The focus on automation is just another form of focusing on configuration. We abstract the configuration, and we touch a lot more devices at once, but we are still thinking about configuration. The more we think about configuration, the less we think about how the system should work, how it really works, what the gaps are, and how to bridge those gaps. So long as we are focused on the configuration, automated or not, we are not focused on how the network can bring value to the business. The longer we are focused on configuration, the less value we are bringing to the business, and the more likely we are to end up being replaced by … an automated system … no matter how poorly that automated system actually works.
And no, the cloud isn’t going to solve this. Containers aren’t going to solve this. The “automated configuration pattern” is already being repeated in the cloud. As more complex workloads are moved into the cloud, the problems there are only going to get harder. What starts out as a “simple” system using policy-based routing analogs and network address translation configured through an automation server will eventually look complex against the hardest problems we had to solve using T1’s, frame relay circuits, inverse multiplexers, wire down patch panels, and mechanical switch crossbar frames. It’s fun to pretend we don’t need dynamic routing to solve the problems that face the network—at least until you hit hard problems, and have to relearn the lessons of the last 20+ years.
Yes, I know vendors are partly to blame for this. I know that, for a vendor, it’s easier to get people to buy into your CLI, or your entire ecosystem, rather than getting them to think about how to solve the problems your business is handing them.
On the other hand, none of this is going to change from the top down. This is only going to change when the average network engineer starts asking vendors for truly simpler solutions that don’t require reams configuration information. It will change when network engineers get their heads out of the configuration and features, and into the business problems.
It’s time for a short lecture on complexity.
Networks are complex. This should not be surprising, as building a system that can solve hard problems, while also adapting quickly to changes in the real world, requires complexity—the harder the problem, the more adaptable the system needs to be, the more resulting design will tend to be. Networks are bound to be complex, because we expect them to be able to support any application we throw at them, adapt to fast-changing business conditions, and adapt to real-world failures of various kinds.
There are several reactions I’ve seen to this reality over the years, each of which has their own trade-offs.
The first is to cover the complexity up with abstractions. Here we take a massively complex underlying system and “contain” it within certain bounds so the complexity is no longer apparent. I can’t really make the system simpler, so I’ll just make the system simpler to use. We see this all the time in the networking world, including things like intent driven replacing the command line with a GUI, and replacing the command line with an automation system. The strong point of these kinds of solutions is they do, in fact, make the system easier to interact with, or (somewhat) encapsulate that “huge glob of legacy” into a module so you can interface with it in some way that is not… legacy.
One negative side of these kinds of solutions, however, is that they really don’t address the complexity, they just hide it. Many times hiding complexity has a palliative effect, rather than a real world one, and the final state is worse than the starting state. Imagine someone who has back pain, so they take pain-killers, and then go back to the gym to life even heavier weights than they have before. Covering the pain up gives them the room to do more damage to their bodies—complexity, like pain, is sometimes a signal that something is wrong.
Another negative side effect of this kind of solution is described by the law of leaky abstractions: all nontrivial abstractions leak. I cannot count the number of times engineers have underestimated the amount of information that leaks through an abstraction layer and the negative impacts such leaks will have on the overall system.
The second solution I see people use on a regular basis is to agglutinate multiple solutions into a single solution. The line of thinking here is that reducing the number of moving parts necessarily makes the overall system simpler. This is actually just another form of abstraction, and it normally does not work. For instance, it’s common in data center designs to have a single control plane for both the overlay and underlay (which is different than just not having an overlay!). This will work for some time, but at some level of scale it usually creates more complexity, particularly in trying to find and fix problems, than it solves in reducing configuration effort.
As an example, consider if you could create some form of wheel for a car that contained its own little engine, braking system, and had the ability to “warp” or modify its shape to produce steering effects. The car designer would just provide a single fixed (not moving) attachment point, and let the wheel do all the work. Sounds great for the car designer, right? But the wheel would then be such a complex system that it would be near impossible to troubleshoot or understand. Further, since you have four wheels on the car, you must somehow allow them to communicate, as well as having communication to the driver to know what to do from moment to moment, etc. The simplification achieved by munging all these things into a single component will ultimately be overcome by complexity built around the “do-it-all” system to make the whole system run.
Or imagine a network with a single transport protocol that does everything—host-to-host, connection-oriented, connectionless, encrypted, etc. You don’t have to think about it long to intuitively know this isn’t a good idea.
An example for the reader: Geoff Huston joins the Hedge this week to talk about DNS over HTTPS. Is this an example of munging systems together than shouldn’t be munged together? Or is this a clever solution to a hard problem? Listen to the two episodes and think it through before answering—because I’m not certain there is a clear answer to this question.
Finally, what a lot of people do is toss the complexity over the cubicle wall. Trust me, this doesn’t work in the long run–the person on the other side of the wall has a shovel, too, and they are going to be pushing complexity at you as fast as they can.
There are no easy solutions to solving complexity. The only real way to deal with these problems is by looking at the network as part of a larger system including applications, the business environment, and many other factors. Then figure out what needs to be done, how to divide the work up (where the best abstraction points are), and build replaceable components that can solve each of these problems while leaking the least amount of information, and are internally as simple as possible.
Every other path leads to building more complex, brittle systems.
We all use the OSI model to describe the way networks work. I have, in fact, included it in just about every presentation, and every book I have written, someplace in the fundamentals of networking. But if you have every looked at the OSI model and had to scratch your head trying to figure out how it really fits with the networks we operate today, or what the OSI model is telling you in terms of troubleshooting, design, or operation—you are not alone. Lots of people have scratched their heads about the OSI model, trying to understand how it fits with modern networking. There is a reason this is so difficult to figure out.
The OSI Model does not accurately describe networks.
What set me off in this particular direction this week is an article over at Errata Security:
The OSI Model was created by international standards organization for an alternative internet that was too complicated to ever work, and which never worked, and which never came to pass. Sure, when they created the OSI Model, the Internet layered model already existed, so they made sure to include today’s Internet as part of their model. But the focus and intent of the OSI’s efforts was on dumb networking concepts that worked differently from the Internet.
This is partly true, and yet a bit … over the top. 🙂 OTOH, the point is well taken: the OSI model is not an ideal model for understanding networks. Maybe a bit of analysis would be helpful in understanding why.
First, while the OSI model was developed with packet switching networks in mind, the general idea was to come as close as possible to emulating the circuit-switched networks widely deployed at the time. A lot of thought had gone into making those circuit-switched networks work, and applications had been built around the way they worked. Applications and circuit-switched networks formed a sort of symbiotic relationship, just as applications form with packet-switched networks today; it was unimaginable, at the time, that “everything would change.”
So while the designers of the OSI model understood the basic value of the packet-switched network, they also understood the value of the circuit-switched network, and tried to find a way to solve both sets of problems in the same network. Experience has shown it is possible to build a somewhat close-to-circuit switched network on top of packet switched networks, but not quite in the way, nor as close to perfect emulation, as those original designers thought. So the OSI model is a bit complex and perhaps overspecified, making it less-than-useful today.
Second, the OSI model largely ignored the role of middleboxes, focusing instead on the stacks implemented and deployed in hosts. This, again, makes sense, as there was no such thing as a device specialized in the switching of packets at the time. Hosts took packets in and processed them. Some packets were sent along to other hosts, other packets were consumed locally. Think PDP-11 with some rough code, rather than even an early Cisco CGS.
Third, the OSI model focuses on what each layer does from the perspective of an application, rather than focusing on what is being done to the data in order to transmit it. The OSI model is built “top down,” rather than “bottom up,” in other words. While this might be really useful if you are an application developer, it is not so useful if you are a network engineer.
So—what should we say about the OSI model?
It was much more useful at some point in the past, when networking was really just “something a host did,” rather than its own sort of sub-field, with specialized protocols, techniques, and designs. It was a very good attempt at sorting out what a network needed to do to move traffic, from the perspective of an application.
What it is not, however, is really all that useful for network engineers working within an engineering specialty to understand how to design protocols, and how to design networks on which those protocols will run. What should we replace it with? I would begin by pointing you to the RINA model, which I think is a better place to start. I’ve written a bit about the RINA model, and used the RINA model as one of the foundational pieces of Computer Networking Problems and Solutions.
Since writing that, however, I have been thinking further about this problem. Over the next six months or so, I plan to build a course around this question. For the moment, I don’t want to spoil the fun, or put any half-backed thoughts out there in the wild.