2019 – Page 3 – rule 11 reader

Dealing with Lock-In

A post on Martin Fowler’s blog this week started me thinking about lock-in—building a system that only allows you to purchase components from a single vendor so long as the system is running. The point of Martin’s piece is that lock-in exists in all systems, even open source, and hence you need to look at lock-in as a set of tradeoffs, rather than always being a negative outcome. Given that lock-in is a tradeoff, and that lock-in can happen regardless of the systems you decide to deploy, I want to go back to one of the foundational points Martin makes in his post and think about avoiding lock-in a little differently than just choosing between open source and vendor-based solutions.

If cannot avoid lock-in either by choosing a vendor-based solution or by choosing open source, then you have two choices. The first is to just give up and live with the results of lock-in. In fact, I have worked with a lot of companies who have done just this—they have accepted that lock-in is just a part of building networks, that lock-in results in a good transfer of risk to the vendor from the operator, or that lock-in results in a system that is easier to deploy and manage.

Giving in to lock-in, though, does not seem like a good idea on the surface, because architecture is about creating opportunity. If you cannot avoid lock-in and yet lock-in is antithetical to good architecture, what are your other options?

First, you can deepen your understanding of where and how lock-in matters, and when it doesn’t. Architecture often fails in one of three ways. The first way is designing something that is so closely fit-to-purpose today that is inflexible against future requirements. Think of a pair of pants that fit perfectly today; if you gain a few pounds, you might have to start over with your body-covering pant solution after that upcoming holiday. The second way is designing something that is so flexible it is wasteful. Think of a pair of pants that are sized to fit you no matter what your body size might be. Perhaps you might try to create a pair of pants you can wear when you are two, but also when you are seventy-two? You could design such a thing, I’m certain, but the amount of material you’d need to use in creating it would probably involve enough waste that you might as well just go ahead and sew multiple sets of pants across your lifetime. The third way is in building something that is flexible, but only through ossification. Something which can only be changed by adding or multiplying, and never through subtraction or division. Think of a pair of pants that can be changed to any color by adding a thin layer of new material each time.

All three of these are overengineered in one sense or another. They are trying to solve problems that probably don’t exist, that may never exist, or that can only be solved by adding another layer on top. All three of these are also forms of lock-in which should be avoided.

In order to buy the right kind of pants, you have to know how long you expect the pants to fit, and you have to know how your body is going to change over that time period. This is where the architect needs to connect with the business and try to understand what kinds of changes are likely, and what those changes might mean for the network. This is also the place where the architect needs to feed some technical reality back into the business; most of the time business leaders don’t think about technology challenges because no-one has ever explained to them that networks and information technology systems are not infinitely pliable balls of wax.

Second, you can try to contain lock-in, and intentionally refuse to be locked-in above a modular level. For instance, say you build a data center fabric using a marketecture sold by a vendor. If that marketecture can be contained within a single data center fabric, and you have more than one fabric, then you can avoid some degree of lock-in by building your other fabrics with different solutions. If the marketecture doesn’t allow for this, then maybe you should just say “no.” Modular design is not just good for controlling the leakage of state between modules, it can also be good for controlling the leakage of lock-in.

Finally, you can consider proper modularization vertically in your network, from layer to layer in your network stack. Enforce IPv4 or IPv6 as a single “wasp waist,” in the network transport; don’t allow Ethernet to wander all over the place, across modules and layers, as if it were the universal transport. Enforce separate and underlay and overlay control planes. Enforce a single form of storing information (such as all YANG formatted data, or all JSON formatted data, etc.) in your network management systems.

This is what open standards are good for. It will be painful to build to these standards, but these standards represent a point of abstraction where you can split lock-in systems apart, preventing them from entangling with one another in a way that prevents ever replacing one component with another.

Lock-in is not something you can avoid. On the other hand, you can learn to accept it when you know it will not impact your future options, and contain it when you know it will.

Posted in DESIGN, WRITTEN

(Effective) Habits of the Network Expert

For any field of study, there are some mental habits that will make you an expert over time. Whether you are an infrastructure architect, a network designer, or a network reliability engineer, what are the habits of mind those involved in the building and operation of networks follow that mark out expertise?

Experts involve the user

Experts don’t just listen to the user, they involve the user. This means taking the time to teach the developer or application owner how their applications interact with the network, showing them how their applications either simplify or complicate the network, and the impact of these decisions on the overall network.

Experts think about data

Rather than applications. What does the data look like? How does the business use the data? Where does the data need to be, when does it need to be there, how often does it need to go, and what is the cost of moving it? What might be in the data that can be harmful? How can I protect the data while at rest and in flight?

Experts think in modules, surfaces, and protocols

Devices and configurations can, and should, change over time. The way a problem is broken up into modules and the interaction surfaces (interfaces) between those modules can be permanent. Choosing the wrong protocol means choosing a different protocol to solve every problem, leading to accretion of complexity, ossification, and ultimately brittleness. Break the problem up right the first time, and choose the protocols carefully, and let the devices and configurations follow.

Choosing devices first is like selecting the hammer you’re going to use to build a house, and then selecting the design and materials used in the house based on what you can use the hammer for.

Experts think about tradeoffs

State, optimization, and surface is an ironclad tradeoff. If you increase state, you increase complexity while also increasing optimization. If you increase surfaces through abstraction, you are both increasing and decreasing state, which has an impact both on complexity and optimization. All nontrivial abstractions leak. Every time you move data you are facing the speed of serialization, queueing, and light, and hence you are dealing with the choice between consistency, availablity, and partitioning.

If you haven’t found the tradeoffs, you haven’t looked hard enough.

Experts focus on the essence

Every problem has an essential core—something you are trying to solve, and a reason for solving it. Experts know how to divide between the essential and the nonessential. Experts think about what they are not designing, and what they are not trying to accomplish, as well as what they are. This doesn’t mean the rest isn’t there, it just means it’s not quite in focus all the time.

Experts are mentally stimulated to simulate

Labs are great—but moving beyond the lab and thinking about how the system works as a whole is better. Experts mentally simulate how the data moves, how the network converges, how attackers might try to break in, and other things besides.

Experts look around

Interior designers go to famous spaces to see how others have designed before them. Building designers walk through cities and famous buildings to see how others have designed before them. The more you know about how others have designed, the more you know about the history of networks, the more of an expert you will be.

Experts reshape the problem space

Experts are unafraid to think about the problem in a different way, to say “no,” and to try solutions that have not been tried before. Best common practice is a place to start, not a final arbiter of all that is good and true. Experts do not fall to the “is/ought” fallacy.

Experts treat problems as opportunities

Whether the problem is a mistake or a failure, or even a little bit of both, every problem is an opportunity to learn how the system works, and how networks work in general.

Posted in CAREER, CULTURE, SKILLS, WRITTEN

IPv6 Backscatter and Address Space Scanning

Backscatter is often used to detect various kinds of attacks, but how does it work? The paper under review today, Who Knocks at the IPv6 Door, explains backscatter usage in IPv4, and examines how effectively this technique might be used to detect scanning of IPv6 addresses, as well. The best place to begin is with an explanation of backscatter itself; the following network diagram will be helpful—

Assume A is scanning the IPv4 address space for some reason—for instance, to find some open port on a host, or as part of a DDoS attack. When A sends an unsolicited packet to C, a firewall (or some similar edge filtering device), C will attempt to discover the source of this packet. It could be there is some local policy set up allowing packets from A, or perhaps A is part of some domain none of the devices from C should be connecting to. IN order to discover more, the firewall will perform a reverse lookup. To do this, C takes advantage of the PTR DNS record, looking up the IP address to see if there is an associated domain name (this is explained in more detail in my How the Internet Really Works webinar, which I give every six months or so). This reverse lookup generates what is called a backscatter—these backscatter events can be used to find hosts scanning the IP address space. Sometimes these scans are innocent, such as a web spider searching for HTML servers; other times, they could be a prelude to some sort of attack.

Kensuke Fukuda and John Heidemann. 2018. Who Knocks at the IPv6 Door?: Detecting IPv6 Scanning. In Proceedings of the Internet Measurement Conference 2018 (IMC ’18). ACM, New York, NY, USA, 231-237. DOI: https://doi.org/10.1145/3278532.3278553

Scanning the IPv6 address space is much more difficult because there are 2¹²⁸ addresses rather than 2³². The paper under review here is one of the first attempts to understand backscatter in the IPv6 address space, which can lead to a better understanding of the ways in which IPv6 scanners are optimizing their search through the larger address space, and also to begin understanding how backscatter can be used in IPv6 for many of the same purposes as it is in IPv4.

The researchers begin by setting up a backscatter testbed across a subset of hosts for which IPv4 backscatter information is well-known. They developed a set of heuristics for identifying the kind of service or host performing the reverse DNS lookup, classifying them into major services, content delivery networks, mail servers, etc. They then examined the number of reverse DNS lookups requested versus the number of IP packets each received.

It turns out that about ten times as many backscatter incidents are reported for IPv4 than IPv6, which either indicates that IPv6 hosts perform reverse lookup requests about ten times less often than IPv4 hosts, or IPv6 hosts are ten times less likely to be monitored for backscatter events. Either way, this result is not promising—it appears, on the surface, that IPv6 hosts will be less likely to cause backscatter events, or IPv6 backscatter events are ten times less likely to be reported. This could indicate that widespread deployment of IPv6 will make it harder to detect various kinds of attacks on the DFZ. A second result from this research is that using backscatter, the researchers determined IPv6 scanning is increasing over time; while the IPv6 space is not currently a prime target for attacks, it might become more so over time, if the scanning rate is any indicator.

The bottom line is—IPv6 hosts need to be monitored as closely, or more closely than IPv6 hosts, for scanning events. The techniques used for scanning the IPv6 address space are not well understood at this time, either.

Posted in RESEARCH, SECURITY, WRITTEN

The Hedge 9: Nash King and Ethics in IT

Nash King (@gammacapricorni) joins Russ White and Tom Ammon in a wide ranging discussion of ethics in IT, including being comfortable with standing up and saying “no” when asked to do something you consider unethical and the virtue ethic. This is meant to be the first of a series of episodes on this topic.

download

Posted in AUDIO, CAREER, CULTURE, HEDGE

SDN, AI, and DevOps

According to the recent SONAR report, 52% of respondents reported they are using Software Defined Networking (SDN) tools to automate their networks, while 57% reported they are using network management tools. The report notes “52% may be slightly exaggerated, depending on how one defines SDN…” Which leads naturally to the question—what the difference between SDN and DevOps is, and how does AI figure into both or either of these. SDN, DevOps, and AI describe separate and overlapping movements in the design, deployment, and management of networks. While they are easy to confuse, they have three different origins and meanings.

Software Defined Networking grew out of research efforts to build and deploy experimental control planes, either distributed or centralized. SDN, however, quickly became associated with replacing some or all the functions of a distributed control plane with a centralized controller, particularly in order to centralize policy related to the control plane such as traffic engineering. SDN solutions always work through a programmatic interface designed to primarily supply forwarding information to network devices.

Development Operations, or DevOps, is a movement away from human-centered interfaces towards machine-centered interfaces for the deployment, operation, and troubleshooting of networks. DevOps is centered on the deployment, configuration, and management of the entire device, rather than providing the information required to forward traffic. DevOps can either use a programmatic interface, such as YANG, or “screen scraping,” to configure and manage network devices.

Finally, Artificial Intelligence, or AI, in the context of computer networks, is focused on the use of data gathered from the network to improve operations, from decreasing the time required to troubleshoot a problem to making the network adapt more quickly to shifting application and business requirements. AI, applied to networks, is narrow in scope, so it is Artificial Narrow Intelligence, or ANI. Real implementations of AI in the networking field are often applications of Machine Learning, or ML; while these two terms are often used interchangeably, they are not quite the same thing.

The following illustration will be useful in understanding the relationship between these three concepts.

In the figure, the SDN and DevOps controllers interact with two different aspects of the network devices forwarding traffic; both SDN and DevOps can be deployed in the same network to solve different problems. For instance, DevOps might be used to configure network devices to reach the SDN controller so they can receive the information they need to forward packets. Or the DevOps system might be used to configure a distributed control plane, such as IS-IS, on all the network devices, and also to configure a centralized controller which can override the local decisions of the distribute routing protocol for traffic engineering.

There are some situations where the difference between SDN and DevOps solutions is not obvious. The most common example is DevOps could be used to configure routing information on each network device, performing the same function as an SDN controller. In this case, what is the difference?

First, an SDN solution is intended specifically to replace the distributed control plane, rather than to configure the entire device. Second, the configurations pushed to a device through DevOps is normally persistent; if a device reboots, the configuration pushed through DevOps will be loaded and enabled, impacting the operation of the device. In contrast, any information pushed to a device through an SDN controller would normally be ephemeral; when the device is rebooted, information pushed by the SDN controller will be lost.

Finally, AI and self-healing are shown on the right side of this diagram as a way to turn telemetry into actionable input for either the DevOps or the SDN system. The ability of ML networks to find and recognize patterns in streams of data means it is perfectly suited to find new patterns of network behavior and alert an operator, or to match current conditions to the past, anticipating future failures or finding an otherwise unnoticed problem.

While SDN, DevOps, and AI overlap, then, they serve different purposes in the realm of network engineering and operations. There are many areas of overlap, but they are also different enough to argue the three terms should be cleanly separated, with each adding a different kind of value to the overall system.

Posted in TECH, WRITTEN

The Hedge 8: Open Source and the Future of Routing Software

Almost every company relies on open source software in some way, which leads to the natural question—how will the heart of the network, routing and switching, be impacted by open source software? In this episode of the Hedge, Sue Hares, Donald Sharp, and Russ White discuss the current and future world of open source routing software. Donald is one of the main drivers of the FR Routing open source routing stack; Russ White is a maintainer on the project and is still deeply involved in commercial routing software, and Sue Hares was deeply involved in the origins of the GateD open source routing stack.

download

You can find previous episodes of the Hedge here, and you can subscribe to the Hedge on iTunes and other feed services.

Posted in AUDIO, HEDGE

Service Provider Tech Doesn’t Apply?

Service provider problems are not your problems. You should not be trying to solve your problems the same way service providers do.

This seems intuitively true—after all, just about everything about a train or a large over-the-road truck (or lorry) is different from a passenger car. If the train is the service provider network and the car is the “enterprise” network, it seems to be obvious the two have very little in common.

Or is it?

What this gets right is that if an operator sells access to their network, or a single application, their network is likely to be built differently than the more general-purpose designs used in organizations that must support a wide range of applications and purposes. These differences are likely to show up in the choice of hardware, how the network is operated, and the kinds of services offered (or not).

What this gets right is operators who sell access to their networks, or support a single application, always seem to build at a scale far beyond what more general-purpose networks ever reach. Microsoft and Facebook number their servers in the millions, and single purchase orders include thousands of routers. eBay and LinkedIn number their servers in the hundreds of thousands, and their routers and switches in the tens of thousands. How can a small enterprise network of a few hundred servers be anything like these larger networks?

What this gets wrong is assuming none of the technologies, tools, or attitudes from these larger-scale networks is every applicable to the smaller networks many engineers encounter on a day-to-day basis.

All those networks with BGP deployed in their data center fabrics are using technology designed primarily for interconnecting intermediate systems on the default-free zone—in other words, for connecting the networks of transit service providers. All those networks with OSPF deployed are using a link state protocol originally designed to provide edge-to-edge reachability in transit service provider networks. All those networks with IS-IS deployed are using a link state protocol originally designed to provide connectivity to large-scale telephony-style networks.

What about transport technologies? The only transport technologies originally designed specifically for “enterprise use” have long since been replaced by optical technologies designed for large-scale provider or “hyperscale” use. Token Ring and ARCnet are long gone, as is the original shared medium Ethernet, replaced by switched Ethernet largely over optical transport. Even current general WiFi is primarily designed for public operator use cases—look at 5G and WiFi 6 and note how public operator requirements have influenced these technologies.

The truth is there is no “pure” enterprise technology; following the dictum that you should not use “service-provider technologies” in your network would leave you with … no network at all.

There is a second realm where this line of argument falls flat, and its more important than the question of which technologies to use: the techniques and attitudes learned in the operation of truly large-scale networks hold valuable lessons for all network engineers. Should you use a spine and leaf topology in your data center, rather than a more traditional hierarchical design? The answer has nothing to do with scale, and everything to do with flexibility in design and operational agility. Should you automate your network, even if its only ten routers? The answer has nothing to do with what Amazon is doing, and everything to do with how much time you want to spend on configuring and troubleshooting versus responding to real business needs.

Think of it this way: the driver who drives the large over-the-road truck is still going to learn lessons and instincts about driving that will make them a better driver in a minivan.

Come join me at NXTWORK in November to continue the conversation in my master class on building and operating data center fabrics, as I explore how you can apply lessons from the hyperscale world to your network.

Posted in DESIGN, WRITTEN