Archive for 2020
The Hedge 54: Bob Friday and AI in Networks
AI in networks is a hotly contested subject—so we asked Bob Friday, CTO of Mist Systems, to explain the value and future of AI in networks. Bob joins Tom Ammon and Russ White for this episode.
Random Thoughts

This week is very busy for me, so rather than writing a single long, post, I’m throwing together some things that have been sitting in my pile to write about for a long while.
From Dalton Sweeny:
This is precisely the way network engineering is. There is value in the kinds of knowledge that expire, such as individual product lines, etc.—but the closer you are to the configuration, the more ephemeral the knowledge is. This is one of the entire points of rule 11 is your friend. Learn the foundational things that make learning the ephemeral things easier. There are only four problems (really) in moving data from one place to another. There are only around four solutions for each of those problems. Each of those solutions is bounded into a small set (again, about four for each) sub-solutions, or ways of implementing the solution, etc.
I’m going to spend some time talking about this in the Livelesson I’m currently recording, so watch this space for an announcement sometime early next year about publication.
From Ivan P:
There are two ways to look at this. Either vendors should lead the market in building solutions, or they should follow whatever the customer wants. From my perspective, one of the problems we have right now is everything is a massive mish-mash of these two things. The operator’s design team thinks of a neat way to do X, and then promises the account team a big check if its implemented. It doesn’t matter that X could be solved some other way that might be simpler, etc.—all that matters is the check. In this case, the vendor stops challenging the customer to build things better, and starts just acting like a commodity provider, rather than an innovative partner.
The interaction between the customer and the vendor needs to be more push-pull than it currently is—right now, it seems like either the operator simply dictates terms to the vendor, or the vendor pretty much dictates architecture to the operator. We need to find a middle ground. The vendor does need to have a solid solution architecture, but the architecture needs to be flexible, as well, and the blocks used to build that architecture need to be usable in ways not anticipated by the vendor’s design folks.
On the other hand, we need to stop chasing features. This isn’t just true of vendors, this is true of operators as well. You get feature lists because that’s what you ask for. Often, operators ask for feature lists because that’s the easiest thing to measure, or because they already have a completely screwed up design they are trying to brownfield around. The worst is—“we have this brownfield that we just can’t get rid of, so we want to build yet another overlay on top, which will make everything simpler.” After about the twentieth overlay a system crash becomes a matter of when rather than if.
The Hedge 53: Deprecating Interdomain ASM
Interdomain Any-source Multicast has proven to be an unscalable solution, and is actually blocking the deployment of other solutions. To move interdomain multicast forward, Lenny Giuliano, Tim Chown, and Toerless Eckert wrote RFC 8815, BCP 229, recommending providers “deprecate the use of Any-Source Multicast (ASM) for interdomain multicast, leaving Source-Specific Multicast (SSM) as the recommended interdomain mode of multicast.”
Technologies that Didn’t: CLNS
Note: RFC1925, rule 11, reminds us that: “Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works.” Understanding the past not only helps us to understand the future, it also helps us to take a more balanced and realistic view of the technologies being created and promoted for current and future use.
The Open Systems Interconnect (OSI) model is the most often taught model of data transmission—although it is not all that useful in terms of describing how modern networks work. What many engineers who have come into network engineering more recently do not know is there was an entire protocol suite that went with the OSI model. Each of the layers within the OSI model, in fact, had multiple protocols specified to fill the functions of that layer. For instance, X.25, while older than the OSI model, was adopted into the OSI suite to provide point-to-point connectivity over some specific kinds of physical circuits. Moving up the stack a little, there were several protocols that provided much the same service as the widely used Internet Protocol (IP).
The Connection Oriented Network Service, or CONS, ran on top of the Connection Oriented Network Protocol, or CONP (which is X.25). Using CONP and CONS, a pair of hosts can set up a virtual circuit. Perhaps the closest analogy available in the world of networks today would be an MPLS Label Switched Path (LSP). Another protocol, the Connectionless Network Service, or CLNS, ran on top of the Connectionless Network Protocol, or CLNP. A series of Transport Protocols ran on top of CLNS (these might also be described as modes of CLNS in some sense). Together, CLNS and these transport protocols provided a set of services similar to IP, the Transmission Control Protocol (TCP), and the User Datagram Protocol (UDP)—or perhaps even closer to something like QUIC.
The routing protocol that held the network together by discovering the network topology, carrying reachability, and calculating loop free paths was the venerable Intermediate System to Intermediate System (IS-IS) protocol, which is still in wide use today. The OSI protocol suite had some interesting characteristics.
For instance, each host had a single address, rather than each interface. The host address was calculated automatically based on a local Media Access Control (MAC) address, combined with other bits of information that were either learned, assumed, or locally configured. A single host could have multiple addresses, using each one to communicate to the different networks it might be connected to, or even being used to contact hosts in different domains. The Intermediate System (IS) played a vital role in the process of address calculation and routing; essentially routing ran all the way down to the host level, which allowed smart calculation of next hops. While a default router could be configured in some implementations, it was not required, as the hosts participated in routing.
A tidbit of interesting history—one of the earliest uses of address families in protocols, and one of the main drivers of the Type-Length-Vector encoding of the IS-IS routing protocol, were driven by the desire to run CLNS and TCP/IP side-by-side on a single network. ISO-IGRP was one of the first multi-protocol routing protocols to use a concept similar to address families. Some of the initial work in BGP address families was also done to support CLNS routing, as the OSI stack did not specify an interdomain routing protocol.
Why is this interesting protocol not in wide use today? There are many reasons—probably every engineer who ever worked on both can give you a different set of reasons, in fact. From my perspective, however, there are a few basic reasons why TCP/IP “won” over the CLNS suite of protocols.
First, the addressing used with CLNS was too complex. There were spots in the address for the assigning authority, the company, subdivisions within the company, flooding domains, and other elements. While this was all very neat, it assumed either one of two things: that network topologies would be built roughly parallel to the organizational chart (starting from governmental entities), or that the topology of the network and addressing did not need to be related to one another. This flies in the face of Yaakov’s rule: either the topology must follow the addressing, or the addressing must follow the topology, if you expect the network to scale.
In the later years of CLNS, this entire mess was replaced by people just using the “private” organizational information to build internal networks, and assuming these networks would never be interconnected (much like using private IPv4 address space). Sometimes this worked, of course. Sometimes it did not.
This addressing scheme left users with the impression, true or false, that CLNS implementations had to be carefully planned. Numbers must be procured, the organizational structure must be consulted, etc. IP, on the other hand, was something you could throw out there and play with. You could get it all working, and plan later, or change you plans. Well, in theory, at least—things never seem to work out in real life the way they are supposed to.
Second, the protocol stack itself was too complex. Rather than solving a series of small problems, and fitting the solutions in a sort of ad-hoc layered set of protocols, the designers of CLNS carefully thought through every possible problem, and considered every possible solution. At least they thought they did, anyway. All this thinking, however, left the impression, again, that to deploy this protocol stack you had to carefully think about what you were about. Further, it was difficult to change things in the protocol stack when problems were found, or new use cases—that had not been thought of—were discovered.
It is better, most of the time, to build small, compact units that fit together, than it is to build a detailed grand architecture. The network engineering world tends to oscillate between the two extremes of no planning or all planned; rarely do we get the temperature of the soup (or porridge) just right.
While CLNS is not around today, it is hard to call the protocol stack a failure. CLNS was widely deployed in large scale electrical networks; some of these networks may well be in use today. Further, its effects are everywhere. For instance, IS-IS is still a widely deployed protocol. In fact, because of its heritage of multiprotocol design from the beginning, it is arguably one of the easiest link state protocols to work with. Another example is the multiprotocol work that has carried over into other protocols, such as BGP. The ideas of a protocol running between the host and the router and autoconfiguration of addresses, have come back in very similar forms in IPv6, as well.
CLNS is another one of those designs that shape the thinking of network engineers today, even if network engineers don’t even know it existed.
Reducing RPKI Single Point of Takedown Risk
The RPKI, for those who do not know, ties the origin AS to a prefix using a certificate (the Route Origin Authorization, or ROA) signed by a third party. The third party, in this case, is validating that the AS in the ROA is authorized to advertise the destination prefix in the ROA—if ROA’s were self-signed, the security would be no better than simply advertising the prefix in BGP. Who should be able to sign these ROAs? The assigning authority makes the most sense—the Regional Internet Registries (RIRs), since they (should) know which company owns which set of AS numbers and prefixes.
The general idea makes sense—you should not accept routes from “just anyone,” as they might be advertising the route for any number of reasons. An operator could advertise routes to source spam or phishing emails, or some government agency might advertise a route to redirect traffic, or block access to some web site. But … if you haven’t found the tradeoffs, you haven’t looked hard enough. Security, in particular, is replete with tradeoffs.
Every time you deploy some new security mechanism, you create some new attack surface—sometimes more than one. Deploy a stateful packet filter to protect a server, and the device itself becomes a target of attack, including buffer overflows, phishing attacks to gain access to the device as a launch-point into the private network, and the holes you have to punch in the filters to allow services to work. What about the RPKI?
When the RKI was first proposed, one of my various concerns was the creation of new attack services. One specific attack surface is the control a single organization—the issuing RIR—has over the very existence of the operator. Suppose you start a new content provider. To get the new service up and running, you sign a contract with an RIR for some address space, sign a contract with some upstream provider (or providers), set up your servers and service, and start advertising routes. For whatever reason, your service goes viral, netting millions of users in a short span of time.
Now assume the RIR receives a complaint against your service for whatever reason—the reason for the complaint is not important. This places the RIR in the position of a prosecutor, defense attorney, and judge—the RIR must somehow figure out whether or not the charges are true, figure out whether or not taking action on the charges is warranted, and then take the action they’ve settled on.
In the case of a government agency (or a large criminal organization) making the complaint, there is probably going to be little the RIR can do other than simply revoke your certificate, pulling your service off-line.
Overnight your business is gone. You can drag the case through the court system, of course, but this can take years. In the meantime, you are losing users, other services are imitating what you built, and you have no money to pay the legal fees.
A true story—without the names. I once knew a man who worked for a satellite provider, let’s call them SATA. Now, SATA’s leadership decided they had no expertise in accounts receivables, and they were spending too much time on trying to collect overdue bills, so they outsourced the process. SATB, a competing service, decided to buy the firm SATA outsourced their accounts receivables to. You can imagine what happens next… The accounting firm worked as hard as it could to reduce the revenue SATA was receiving.
Of course, SATA sued the accounting firm, but before the case could make it to court, SATA ran out of money, laid off all their people, and shut their service down. SATA essentially went out of business. They won some money later, in court, but … whatever money they won was just given to the investors of various kinds to make up for losses. The business itself was gone, permanently.
Herein lies the danger of giving a single entity like an RIR, even if they are friendly, honest, etc., control over a critical resource.
A recent paper presented at the ANRW at APNIC caught my attention as a potential way to solve this problem. The idea is simple—just allow (or even require) multiple signatures on a ROA. To be more accurate, each authorizing party issues a “partial certificate;” if “enough” pieces of the certificate are found and valid, the route will be validated.
The question is—how many signatures (or parts of the signature, or partial attestations) should be enough? The authors of the paper suggest there should be a “Threshold Signature Module” that makes this decision. The attestations of the various signers are combined in the threshold module to produce a single signature that is then used to validate the route. This way the validation process on the router remains the same, which means the only real change in the overall RPKI system is the addition of the threshold module.
If one RIR—even the one that allocated the addresses you are using—revokes their attestation on your ROA, the remaining attestations should be enough to convince anyone receiving your route that it is still valid. Since there are five regions, you have at least five different choices to countersign your ROA. Each RIR is under the control of a different national government; hence organizations like governments (or criminals!) would need to work across multiple RIRs and through other government organizations to have a ROA completely revoked.
An alternate solutions here, one that follows the PGP model, might be to simply have the threshold signature model consider the number and source of ROAs using the existing model. Local policy could determine how to weight attestations from different RIRs, etc.
This multiple or “shared” attestation (or signature) idea seems like a neat way to work around one of (possibly the major) attack surfaces introduced by the RPKI system. If you are interested in Internet core routing security, you should take a read through the post linked above, and then watch the video.
The Hedge 52: Tobi Metz and the Technologist Question
Tobi Metz asked What is a Technologists? in a recent blog post. Tobi joins Tom and Russ on this episode of the Hedge to expand on his answer, and get our thoughts on the question.
Everyone Must Learn to Code
The word on the street is that everyone—especially network engineers—must learn to code. A conversation with a friend and an article passing through my RSS reader brought this to mind once again—so once more into the breach. Part of the problem here is that we seem to have a knack for asking the wrong question. When we look at network engineer skill sets, we often think about the ability to configure a protocol or set of features, and then the ability to quickly troubleshoot those protocols or features using a set of commands or techniques.
This is, in some sense, what various certifications have taught us—we have reached the expert level when we can configure a network quickly, or when we can prove we understand a product line. There is, by the way, a point of truth in this. If you claim your expertise is with a particular vendor’s gear, then it is true that you must be able to configure and troubleshoot on that vendor’s gear to be an expert. There is also a problem of how to test for networking skills without actually implementing something, and how to implement things without actually configuring them. This is a problem we are discussing in the new “certification” I’ve been working on, as well.
This is also, in some sense, what the hiring processes we use have taught us. Computers like to classify things in clear and definite ways. The only clear and definite way to classify networking skills is by asking questions like “what protocols do you understand how to configure and troubleshoot?” It is, it seems, nearly impossible to test design or communication skills in a way that can be easily placed on a resume.
Coding, I think, is one of those skills that is easy to appear to measure accurately, and it’s also something the entire world insists is the “coming thing.” No coding skills, no job. So it’s easy to ask the easy question—what languages do you know, how many lines of code have you written, etc. But again, this is the wrong question (or these are the wrong questions).
What is the right question? In terms of coding skills, more along the lines of something like, “do you know how to build and use tools to solve business problems?” I phrase it this way because the one thing I have noticed about every really good coder I have known is they all spend as much time building tools as they do building shipping products. They build tools to test their code, or to modify the code they’ve already written en masse, etc. In fact, the excellent coders I know treat functions like tools—if they have to drive a nail twice, they stop and create a hammer rather than repeating the exercise with some other tool.
So why is coding such an important skill to gain and maintain for the network engineer? This paragraph seems to sum it up nicely for me—
It’s not the coding that matters, it’s “being able to model problems and use computers to solve them.” This is the essence of tool building or engineering—seeing the problem, understanding the problem, and then thinking through (sometimes by trial and error) how to build a tool that will solve the problem in a consistent, easy to manage way. I fear that network engineers are taking their attitude of configuring things and automating it to make the configuration and troubleshooting faster. We seem to end up asking “how do I solve the problem of making the configuration of this network faster,” rather than asking “what business problem am I trying to solve?”
To make effective use of the coding skills we’re telling everyone to learn, we need to go back to basics and understand the problems we’re trying to solve—and the set of possible solutions we can use to solve those problems. Seen this way, the routing protocol becomes “just another tool,” just like a function call, that can be used to solve a specific set of problems—instead of a set of configuration lines that we invoke like a magic incantation to make things happen.
Coding skills are important—but they require the right mindset if we’re going to really gain the sorts of efficiencies we think are possible.
