LEFT

Shows in left side column — all but worth reading should be in this category

Weekend Reads 091418

Security

You install a new app on your phone, and it asks for access to your email accounts. Should you, or shouldn’t you? TL;DR? You shouldn’t. When an app asks for access to your email, they are probably reading your email, performing analytics across it, and selling that information. Something to think about: how do they train their analytics models? By giving humans the job of reading it.

When you shut your computer down, the contents of memory are not wiped. This means an attacker can sometimes grab your data while the computer is booting, before any password is entered. Since 2008, computers have included a subsystem that wipes system memory before starting any O/S launch—but researchers have found a way around this memory wipe.

You know when your annoying friend talks about the dangers of IoT when you bragging about your latest install of that great new electronic doorlock that works off your phone? You know the one I’m talking about. Maybe that annoying friend has some things right, and we should really be paying more attention to the problems inherent in large scale IoT deployments. For instance, what would happen if you could get the electrical grid in hot water using… hot water heaters?

Copyright

One of the seemingly intractable problems facing content creators today is copyright—this is largely an untold story, and it is also often “little folks” against “big folks.” As copyright infringement detection is automated, it is likely to become a big mess. One way to think about it: a thousand monkeys typing at a thousand typewriters are not going to produce the works of any great artist. On the other hand, a thousand humans writing pieces on the same new product announcements are bound to same the same things in the same way at some point. When everyone hits “publish” at the same time, and the bots of the big folks start calling for takedown on content written by a little folk, the disparity in legal resources that can be brought to bear is the controlling factor. This problem is made worse by the mandatory implementation of said bots through government action.

Other Stories

While most people think of monopolies in terms of physical goods, it seems possible for monopolies to form around information and services, as well. In fact, it would seem that control of information is at the heart of every monopoly. As anti-trust forces grow against the big content providers in the U.S., the courts will need to sort out when controlling access to information, by itself, becomes a monopoly. Who are the big targets, and what would a case look like against them?

Finally, Google wants to kill the URL. Is this a good idea, or a bad one? My initial reaction is—this is a bad idea. Users certainly find URL’s confusing, but this is in part our own fault. Why are URL’s confusing? Primarily because we have allowed systems to tack so much state information onto them. Perhaps an alternate solution is not to bury the complexity, forcing users to trust the machine, make the interface simple again, so users can actually tell what is going on. Of course, one of the oldest marketing tricks in the book is to make something so complicated that users cannot understand it, then offer to sell them a solution for the complexity you have created.

Research: Tail Attacks on Web Applications

When you think of a Distributed Denial of Service (DDoS) attack, you probably think about an attack which overflows the bandwidth available on a single link; or overflowing the number of half open TCP sessions a device can have open at once, preventing the device from accepting more sessions. In all cases, a DoS or DDoS attack will involve a lot of traffic being pushed at a single device, or across a single link.

TL;DR

  • Denial of service attacks do not always require high volumes of traffic
  • An intelligent attacker can exploit the long tail of service queues deep in a web application to bring the service down
  • These kinds of attacks would be very difficult to detect

 

But if you look at an entire system, there are a lot of places where resources are scarce, and hence are places where resources could be consumed in a way that prevents services from operating correctly. Such attacks would not need to be distributed, because they could take much less traffic than is traditionally required to deny a service. These kinds of attacks are called tail attacks, because they attack the long tail of resource pools, where these pools are much thinner, and hence much easier to attack.

There are two probable reasons these kinds of attacks are not often seen in the wild. First, they require an in-depth knowledge of the system under attack. Most of these long tail attacks will take advantage of the interaction surface between two subsystems within the larger system. Each of these interaction surfaces can also be attack surfaces if an attacker can figure out how to access and take advantage of them. Second, these kinds of attacks are difficult to detect, because they do not require large amounts of traffic, or other unusual traffic flows, to launch.

The paper under review today, Tail Attacks on Web Applications, discusses a model for understanding and creating tail attacks in a multi-tier web application—the kind commonly used for any large-scale frontend service, such as ecommerce and social media.

Huasong Shan, Qingyang Wang, and Calton Pu. 2017. Tail Attacks on Web Applications. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17). ACM, New York, NY, USA, 1725-1739. DOI: https://doi.org/10.1145/3133956.3133968

The figure below illustrates a basic service of this kind for those who are not familiar with it.

The typical application at scale will have at least three stages. The first stage will terminate the user’s session and render content; this is normally some form of modified web server. The second stage will gather information from various backend services (generally microservices), and pass the information required to build the page or portal to the rendering engine. The microservices, in turn, build individual parts of the page, and rely on various storage and other services to supply the information needed.

If you can find some way to clog up the queue at one of the storage nodes, you can cause every other service along the information path to wait on the prior service to fulfill its part of the job in hand. This can cause a cascading effect through the system, where a single node struggling because of full queues can cause an entire set of dependent nodes to become effectively unavailable, cascading to a larger set of nodes in the next layer up. For instance, in the network illustrated, if an attacker can somehow cause the queues at storage service 1 to fill up, even for a moment, this can cascade into a backlog of work at services 1 and 2, cascading into a backlog at the front-end service, ultimately slowing—or even shutting—the entire service down. The queues at storage service 1 may be the same size as every other queue in the system (although they are likely smaller, as they face internal, rather than external, services), but storage system 1 may be servicing many hundreds, perhaps thousands, of copies of services 1 and 2.

The queues at storage service 1—and all the other storage services in the system—represent a hidden bottleneck in the overall system. If an attacker can, for a few moments at a time, cause these internal, intra-application queue to fill up, the overall service can be made to slow down to the point of being almost unusable.

How plausible is this kind of attack? The researchers modeled a three-stage system (most production systems have more than three stages) and examined the total queue path through the system. By examining the queue depths at each stage, they devised a way to fill the queues at the first stage in the system by sending millibursts of valid sessions requests to the rend engine, or the use facing piece of the application. Even if these millibursts are spread out across the edge of the application, so long as they are all the same kind of requests, and timed correctly, they can bring the entire system down. In the paper, the researchers go further and show that once you understand the architecture of one such system, it is possible to try different millibursts on a running system, causing the same DoS effect.

This kind of attack, because it is built out of legitimate traffic, and it can be spread across the entire public facing edge of an application, would be nearly impossible to detect or counter at the network edge. One possible counter to this kind of attack would be increasing capacity in the deeper stages of the application. This countermeasure could be expensive, as the data must be stored on a larger number of servers. Further, data synchronized across multiple systems will subject to CAP limitations, which will ultimately limit the speed at which the application can run anyway. Operators could also consider fine grained monitoring, which increases the amount of telemetry that must be recovered from the network and processed—another form of monetary tradeoff.

 

Think Like an Engineer, not a Cheerleader

When you see a chart like this—

—you probably think if I were staking my career on technologies, I would want to jump from the older technology to the new just at the point where that adoption curve starts to really drive upward.

Over at ACM Queue, Peter J. Denning has an article up on just this topic. He argues that if you understand the cost curve and tipping point of any technology, you can predict—with some level of accuracy—the point at which the adoption s-curve is going to begin its exponential growth phase.

Going back many years, I recognize this s-curve. It was used for FDDI, ATM, Banyan Vines, Novell Netware, and just about every new technology that has ever entered the market.

TL;DR

  • There are technology jump points where an entire market will move from one technology to another
  • From a career perspective, it is sometimes wise to jump to a new technology when at the early stages of such a jump
  • However, there are risks invovled, such as hidden costs that prevent the jump from occurring
  • Hence, you need to be cautious and thoughtful when considering jumping to a new technology

 

The problem with this curve, especially when applied to every new technology ever invented, is it often makes it seem inevitable some new technology is going to replace an older, existing technology. This, however, makes a few assumptions that are not always warranted.

First, there is an underlying assumption that a current exponential reduction in technology costs will continue until the new technology is cheaper than the old. There are several problems in this neighborhood. Sometimes, for instance, the obvious or apparent costs are much less expensive, but the overall costs of adoption are not. To give one example, many people still heat their homes with some form of oil-based product. Since electricity is so much less expensive—or at least it seems to be at first glance—why is this so? I’m not an economist, but I can take some wild guesses at the answer.

For instance, electricity must be generated from heat. Someplace, then, heat must be converted to electricity, the electricity transported to the home, and then the electricity must be converted back to heat. A crucial question: is the cost of the double conversion and transportation more than the cost of simply transporting the original fuel to the home? If so, by how much? Many of these costs can be hidden—if every person in the world converted to electric heat, what would be the cost of upgrading and maintain an electric grid that could support this massive increase in power usage?

Hidden costs, and our inability to see the entire system at once, often make it more difficult than it might seem to predict the actual “landing spot” on the cost curve of a technology. Nor is it always possible to assume that once a technology has reached a “landing spot,” it will stay there. Major advances in some new technology may actually cross over into the older technology, so that both cost curves are driven down at the same time.

Second, there is the problem of “good enough.” Why are there no supersonic jets flying regularly across the Atlantic Ocean? Because people who fly, as much as they might complain (like me!) have ultimately decided with their wallets that the current technology is “good enough” to solve the problem at hand. That increasing the speed of flight just isn’t worth the risks and the costs.

Third, as Mike Bushong recently pointed out in a member’s Q&A at The Network Collective, many times a company (startup) will fail because it is too early in the cycle, rather than too late. I will posit that technologies can go the same way; a lot of people can invest in a technology really early and find it just does not work. The idea, no matter how good, will then go on the back burner for many years—perhaps forever—until someone else tries it again.

The Bottom Line

The bottom line is this: just because the curves seem to be converging does not mean a technology is going to follow the s-curve up and to the right. If you are thinking in terms of career growth, you have to ask hard questions, think about the underlying principles, and think about what the failure scenarios might look like for this particular technology.

Another point to remember is the staid and true rule 11. What problem does this solve, and how does it solve it? How is this solution like solutions attempted in the past? If those solutions failed, what will cause the result to be different this time? Think also in terms of complexity—is the added complexity driving real value?

I am not saying you should not bet on a new technology for your future. Rather—think like an engineer, rather than a cheerleader.

Weekend Reads 090718

Did the passage of gDPR impact the amount of spam on the ‘net, or not? It depends on who you ask.

The folks at the Recorded Future blog examined the volume of spam and the number of registrations for domains used in phishing activity, and determined the volume of spam was not impacted by the implementation of Europe’s new privacy laws.

There were many concerns that after the European Union’s General Data Protection Regulation (GDPR) went into effect on May 25, 2018, there would be an uptick in spam. While it has only been three months since the GDPR went into effect, based on our research, not only has there not been an increase in spam, but the volume of spam and new registrations in spam-heavy generic top-level domains (gTLDs) has been on the decline.

John Levine at CircleID, however, argues the measures used in the Recorded Future piece are not useful measures of spam volume in relation to the controls imposed by GDPR:

To understand the effect of GDPR, the relevant questions are: Is GDPR enabling damage, because it makes detection, blocking, and mitigation harder?

Note that the CircleID article only addresses the domain registration question, and does not address the question of spam volume directly.

I would normally download a paper like this and post a synopsis of it as a research post later on, but the synopsis provided by Monday Note is good enough just to read directly.

Testing across 7 browsers and 46 browser extensions, the authors find that for virtually every browser and extension combination there is a way to bypass the intended security policies.

Acoustic side channels are being discovered all the time; this new one uses the “whine” from electronic components in a monitor to determine what someone is looking at by listening to their microphone. While this might not seem like a big deal at first, consider this: anyone on a web conference can use this technology to determine what is on your screen.

Daniel Genkin of the University of Michigan, Mihir Pattani of the University of Pennsylvania, Roei Schuster of Cornell Tech and Tel Aviv University, and Eran Tromer of Tel Aviv University and Columbia University investigated a potential new avenue of remote surveillance that they have dubbed “Synesthesia”: a side-channel attack that can reveal the contents of a remote screen, providing access to potentially sensitive information based solely on “content-dependent acoustic leakage from LCD screens.”

Research: DNSSEC in the Wild

The DNS system is, unfortunately, rife with holes like Swiss Cheese; man-in-the-middle attacks can easily negate the operation of TLS and web site security. To resolve these problems, the IETF and the DNS community standardized a set of cryptographic extensions to cryptographically sign all DNS records. These signatures rely on public/private key pairs that are transitively signed (forming a signature chain) from individual subdomains through the Top Level Domain (TLD). Now that these standards are in place, how heavily is DNSSEC being used in the wild? How much safer are we from man-in-the-middle attacks against TLS and other transport encryption mechanisms?

TL;DR

  • DNSSEC is enabled on most top level domains
  • However, DNSSEC is not widely used or deployed beyond these TLDs

 

Crossposted at CircleID

Three researchers published an article in Winter ;login; describing their research into answering this question (membership and login required to read the original article). The result? While more than 90% of the TLDs in DNS are DNSEC enabled, DNSSEC is still not widely deployed or used. To make matter worse, where it is deployed, it isn’t well deployed. The article mentions two specific problems that appear to plague DNSSEC implementations.

First, on the server side, a number of domains either deploy weak or expired keys. An easily compromised key is often worse than having no key at all; there is no way to tell the difference between a key that has or has not been compromised. A weak key that has been compromised does not just impact the domain in question, either. If the weakly protected domain has subdomains, or its key is used to validate other domains in any way, the entire chain of trust through the weak key is compromised. Beyond this, there is a threshold over which a system cannot pass without the entire system, itself, losing the trust of its users. If 30% of the keys returned in DNS are compromised, for instance, most users would probably stop trusting any DNSSEC signed information. While expired keys are more obvious that weak keys, relying on expired keys still works against user trust in the system.

Second, DNSSEC is complex. The net result of a complex protocol combined with low deployment and demand on the server side is poor implementations in client implementations. Many implementations, according to the research in this paper, simply ignore failures in the certification validation process. Some of the key findings of the paper are—

  • One-third of the DNSSEC enabled domains produce responses that cannot be validated
  • While TLD operators widely support DNSSEC, registrars who run authoritative servers rarely support DNSSEC; thus the chain of trust often fails at the fist hop in the resolution process beyond the TLD
  • Only 12% of the resolvers that request DNSSEC records in the query process validate them

To discover the deployment of DNSSEC, the researchers built an authoritative DNS server and a web server to host a few files. They configured subdomains on the authoritative server; some subdomains were configured correctly, while others were configured incorrectly (a certificate was missing, expired, malformed, etc.). By examining DNS requests for the subdomains they configured, they could determine which DNS resolvers were using the included DNSSEC information, and which were not.

Based on their results, the authors of this paper make some specific recommendations, such as enabling DNSSEC on all resolvers, such as the recursive servers your company probably operates for internal and external use. Owners of domain names should also ask their registrars to support DNSSEC on their authoritative servers.

Ultimately, it is up to the community of operators and users to make DNSSEC a reality in the ‘net.

Is BGP Good Enough?

In a recent podcast, Ivan and Dinesh ask why there is a lot of interest in running link state protocols on data center fabrics. They begin with this point: if you have less than a few hundred switches, it really doesn’t matter what routing protocol you run on your data center fabric. Beyond this, there do not seem to be any problems to be solved that BGP cannot solve, so… why bother with a link state protocol? After all, BGP is much simpler than any link state protocol, and we should always solve all our problems with the simplest protocol possible.

TL;DR

  • BGP is both simple and complex, depending on your perspective
  • BGP is sometimes too much, and sometimes too little for data center fabrics
  • We are danger of treating every problem as a nail, because we have decided BGP is the ultimate hammer

 
Will these these contentions stand up to a rigorous challenge?

I will begin with the last contention first—BGP is simpler than any link state protocol. Consider the core protocol semantics of BGP and a link state protocol. In a link state protocol, every network device must have a synchronized copy of the Link State Database (LSDB). This is more challenging than BGP’s requirement, which is very distance-vector like; in BGP you only care if any pair of speakers have enough information to form loop-free paths through the network. Topology information is (largely) stripped out, metrics are simple, and shared information is minimized. It certainly seems, on this score, like BGP is simpler.

Before declaring a winner, however, this simplification needs to be considered in light of the State/Optimization/Surface triad.

When you remove state, you are always also reducing optimization in some way. What do you lose when comparing BGP to a link state protocol? You lose your view of the entire topology—there is no LSDB. Perhaps you do not think an LSDB in a data center fabric is all that important; the topology is somewhat fixed, and you probably are not going to need traffic engineering if the network is wired with enough bandwidth to solve all problems. Building a network with tons of bandwidth, however, is not always economically feasible. The more likely reality is there is a balance between various forms of quality of service, including traffic engineering, and throwing bandwidth at the problem. Where that balance is will probably vary, but to always assume you can throw bandwidth at the problem is naive.

There is another cost to this simplification, as well. Complexity is inserted into a network to solve hard problems. The most common hard problem complexity is used to solve is guarding against environmental instability. Again, a data center fabric should be stable; the topology should never change, reachability should never change, etc. We all know this is simply not true, however, or we would be running static routes in all of our data center fabrics. So why aren’t we?

Because data center fabrics, like any other network, do change. And when they do change, you want them to converge somewhat quickly. Is this not what all those ECMP parallel paths are for? In some situations, yes. In others, those ECMP paths actually harm BGP convergence speed. A specific instance: move an IP address from one ToR on your fabric to another, or from one virtual machine to another. In this situation, those ECMP paths are not working for you, they are working against you—this is, in fact, one of the worst BGP convergence scenarios you can face. IS-IS, specifically, will converge much faster than BGP in the case of detaching a leaf node from the graph and reattaching it someplace else.

Complexity can be seen from another perspective, as well. When considering BGP in the data center, we are considering one small slice of the capabilities of the protocol.

in the center of the illustration above there is a small grey circle representing the core features of BGP. The sections of the ten sided figure around it represent the features sets that have been added to BGP over the years to support the many places it is used. When we look at BGP for one specific use case, we see the one “slice,” the core functionality, and what we are building on top. The reality of BGP, from a code base and complexity perspective, is the total sum of all the different features added across the years to support every conceivable use case.

Essentially, BGP has become not only a nail, but every kind of nail, including framing nails, brads, finish nails, roofing nails, and all the other kinds. It is worse than this, though. BGP has also become the universal glue, the universal screw, the universal hook-and-loop fastener, the universal building block, etc.

BGP is not just the hammer with which we turn every problem into a nail, it is a universal hammer/driver/glue gun that is also the universal nail/screw/glue.

When you run BGP on your data center fabric, you are not just running the part you want to run. You are running all of it. The L3VPN part. The eVPN part. The intra-AS parts. The inter-AS parts. All of it. The apparent complexity may appear to be low, because you are only looking at one small slice of the protocol. But the real complexity, under the covers, where attack and interaction surfaces live, is very complex. In fact, by any reasonable measure, BGP might have the simplest set of core functions, but it is the most complicated routing protocol in existence.

In other words, complexity is sometimes a matter of perspective. In this perspective, IS-IS is much simpler. Note—don’t confuse our understanding of a thing with its complexity. Many people consider link state protocols more complex simply because they don’t understand them as well as BGP.

Let me give you an example of the problems you run into when you think about the complexity of BGP—problems you do not hear about, but exist in the real world. BGP uses TCP for transport. So do many applications. When multiple TCP streams interact, complex problems can result, such as the global synchronization of TCP streams. Of course we can solve this with some cool QoS, including WRED. But why do you want your application and control plane traffic interacting in this way in the first place? Maybe it is simpler just to separate the two?

Is BGP really simpler? From one perspective, it is simpler. From another, however, it is more complex.

Is BGP “good enough?” For some applications, it is. For others, however, it might not be.

You should decide what to run on your network based on application and business drivers, rather than “because it is good enough.” Which leads me back to where I often end up: If you haven’t found the trade-offs, you haven’t look hard enough.

Weekend Reads 083118

Portland, Maine—The Electronic Frontier Foundation (EFF) and the ACLU are urging the state’s highest courts in Massachusetts and Maine to rule that law enforcement agents need a warrant to access real-time location information from cell phones, a clear application of a landmark U.S. Supreme Court ruling from June. @EFF

Survey data from Qualia suggests that as DevOps becomes mainstream, both organizational resources and budget allocation tied to measurable business outcomes will be attached to this method of rapid application development. DevOps enables a faster iterative process that drives innovation while doing more with less and increasing efficiency. —Thomas MacIsaac @Data Center Journal

Cobalt Group (aka TEMP.Metastrike), active since at least late 2016, have been suspected in attacks across dozens of countries. The group primarily targets financial organizations, often with the use of ATM malware. Researchers also believe they are responsible for a series of attacks on the SWIFT banking system which costs millions in damages to the impacted entities. @Netscout

At the end of our last post we had just emerged from the Fireswamp only to be captured by the six-fingered man. No, no, that’s not right. We had just caught up to the present, in which the agency responsible for regulating the communications industry is trying to get rid of the rules it just put in place and is also trying to undermine its own authority to regulate at all. —Stan Adams @CDT

CLKscrew: Another side channel you didn’t know about

Network engineers focus on protocols and software, but somehow all of this work must connect to the hardware on which packets are switched, and data is processed. A big part of the physical side of what networks “do” is power—how it is used, and how it is managed. The availability of power is one of the points driving centralization; power is not universally available at a single price. If cloud is cheaper, it’s probably not because of the infrastructure, but rather because of the power and real estate costs.

A second factor in processing is the amount of heat produced in processing. Data center designers expend a lot of energy in dealing with heat problems. Heat production is directly related to power usage; each increase in power consumption for processing shows up as heat somewhere—heat which must be removed from the equipment and the environment.

It is important, therefore, to optimize power usage. To do this, many processors today have power management interfaces allowing software to control the speed at which a processor runs. For instance, Kevin Myers (who blogs here) posted a recent experiment with pings running while a laptop is plugged in and on battery—

Reply from 2607:f498:4109::867:5309: time=150ms
Reply from 2607:f498:4109::867:5309: time=113ms
Reply from 2607:f498:4109::867:5309: time=538ms
Reply from 2607:f498:4109::867:5309: time=167ms
Reply from 2607:f498:4109::867:5309: time=488ms
Reply from 2607:f498:4109::867:5309: time=231ms
Reply from 2607:f498:4109::867:5309: time=104ms
Reply from 2607:f498:4109::867:5309: time=59ms
Reply from 2607:f498:4109::867:5309: time=64ms
Reply from 2607:f498:4109::867:5309: time=57ms
Reply from 2607:f498:4109::867:5309: time=58ms
Reply from 2607:f498:4109::867:5309: time=64ms
Reply from 2607:f498:4109::867:5309: time=56ms
Reply from 2607:f498:4109::867:5309: time=62ms

There is a clear difference in the “plugged in” RTT and the “on battery” RTT. A common form of power management is Dynamic Voltage and Frequency Scaling (DVFS), which allows software to change the frequency at which a chip runs based on the kinds of processing being done, and power availability. The authors of this paper examined the interface between the software drivers that support energy management on some classes of processors and discover a series of vulnerabilities that allow an attacker to take control of the processing speed and voltage of the chip.

DVFS relies on two regulators in the processor, a voltage regulator that controls the amount of power supplied to the chip, and a Phased Lock Loop (PLL) regulator that determines the clock frequency of the chip. Software can reduce the amount of voltage supplied to the chip through this interface, as well as manage the speed at which the chip is running. Perhaps the simplest way to think about this is to conceive of a chip as a very complex set of interconnected sets of buckets. The pipeline of the chipset represents a “bucket brigade,” where each stage of the chip is filled with water from some previous stage. Because a little water is always spilled between stages, each stage must be “topped” off a little during processing

The faster you move water between the stages of buckets, the more the water “slops,” requiring a bit more “topping off” each time. The faster you move water between the stages, the more water is required at the input stage, as well, so the processing simply consumes more water. The voltage level can be seen as the level to which each bucket is filled, or how many buckets in each stage are used (or how many stages of buckets are available), all three. In each case, reducing the voltage impacts the amount of work the chip can accomplish in a given time period.

What happens if you force the chip to run at a higher speed than normal, or at a higher voltage than normal? You can cause the chip to overheat, for instance, or force processing errors in forcing the chip to not refill against the “interstage slop.”

This is precisely what the authors of this paper demonstrate. They take over the power management software of an ARM chipset through some simple attack vectors (it turns out power management software is not very hardened, and hence is pretty easy to take over). They then force errors in the chipset, and show how the chipset could be destroyed through this back door.

Part of the problem with the complex systems we build today is there are just too many attack surfaces for any one person to know about, or to account for. Complexity often drives insecurity—a lesson we are still in the process of learning.

Reaction: Centralization Wins

Warning: in this post, I am going to cross a little into philosophy, governance, and other odd subjects. Here there be dragons. Let me begin by setting the stage:

Decentralized systems will continue to lose to centralized systems until there’s a driver requiring decentralization to deliver a clearly superior consumer experience. Unfortunately, that may not happen for quite some time. —Todd Hoff @High Scalability

And the very helpful diagram which accompanies the quote—

The point Todd Hoff, the author makes, is that five years ago he believed the decentralized model would win, in terms of the way the Internet is structured. However, today he doesn’t believe this; centralization is winning. Two points worth considering before jumping into a more general discussion.

First, the decentralized model is almost always the most efficient in almost every respect. It is the model with the lowest signal-to-noise ratio, and the model with the highest gain. The simplest way to explain this is to note the primary costs in a network is the cost of connectivity, and the primary gain is the amount of support connections provide. The distributed model offers the best balance of these two.

Second, what we are generally talking about here is data and the people connections, rather than the physical infrastructure. While physical hardware is important, it will always lag behind the data and people connections by some amount of time.

These things said, what is the point?

If decentralized is the most efficient model, then why is centralization winning? There are two drivers that can cause centralization to win. The first is scarcity of resources. For instance, you want a centralized authority (within some geographic area) to build roads. Who wants two roads to their house? And would that necessitate two garages, one for each road system? Physical space is limited in a way that, in relation to road systems, centralization (within geographic areas!) makes sense.

The second is regulatory capture. Treating a resource that is not physically constrained as a scarce resource forces centralization, generally to the benefit of the resulting private/public partnership. The overall system is less efficient, but through regulatory capture, or rent seeking, the power accrues to a single entity, which then uses its power to enforce the centralized model.

In the natural order of things, there is disruption of some sort. During the disruption phase, things are changing, and the cost of maintaining a connection in the network is low, so the network tends towards a distributed model. Over time, a distributed model will emerge naturally. Finally, as one node, or small set of nodes, gain(s) power, the network will tend towards centralization. If regulatory bodies can be found and captured, then the centralized model will be enforced. The more capture, the more strongly the centralized model will become entrenched.

Over time, if innovation is allowed (there is often an attempt to suppress innovation, but this is a two-edged sword), some new “thing,” whether a social movement or a technology—generally a blend of both—will build a new distributed network of some sort, and thus disrupt the centralized network.

What does all of this have to do with network engineering? We are currently moving towards strong regulatory capture with highly centralized networks. The owners of those centralized resources are battling it out, trying to strengthen their regulatory capture in every way possible—the war over net neutrality is but one component of this ongoing battle (which is why it often seems there are no “good folks” and “bad folks” in this argument). At some point, that battle will be decisively won, and one kind of information broker is going to win over the other.

In the process, the “old guard,” the “vendors,” are being required to change their focus, and trying to survive. Disaggregation, and “software defined,” are two elements of this shift in power.

The question is: will we reach full centralization? Or will some new idea—a new/old technology that organizes information in a different way—disrupt the coalescing centralized powers?

The answer to this question impacts what skills you should be learning now, and how you approach the rest of your career (or life!). A lot of your career rests not just on understanding the command lines and hardware, but reaching beyond these into understanding the technologies, and even beyond the technologies to understand the way organizations work. What you should be learning now, and what you are paying attention to, should not reflect the last war, but the next one. Should you study for a certification? Which one? Should you focus on a vendor, or a software package, or a specific technology? For how long, and how deeply?

And yes, I fully understand you cannot sell your longer term technical ability to prospective employers; the entire hiring process today is tailored to hedgehogs rather than foxes. This, however, is a topic for some other time.