Service Provider Tech Doesn’t Apply?

Service provider problems are not your problems. You should not be trying to solve your problems the same way service providers do.

This seems intuitively true—after all, just about everything about a train or a large over-the-road truck (or lorry) is different from a passenger car. If the train is the service provider network and the car is the “enterprise” network, it seems to be obvious the two have very little in common.

Or is it?

What this gets right is that if an operator sells access to their network, or a single application, their network is likely to be built differently than the more general-purpose designs used in organizations that must support a wide range of applications and purposes. These differences are likely to show up in the choice of hardware, how the network is operated, and the kinds of services offered (or not).

What this gets right is operators who sell access to their networks, or support a single application, always seem to build at a scale far beyond what more general-purpose networks ever reach. Microsoft and Facebook number their servers in the millions, and single purchase orders include thousands of routers. eBay and LinkedIn number their servers in the hundreds of thousands, and their routers and switches in the tens of thousands. How can a small enterprise network of a few hundred servers be anything like these larger networks?

What this gets wrong is assuming none of the technologies, tools, or attitudes from these larger-scale networks is every applicable to the smaller networks many engineers encounter on a day-to-day basis.

All those networks with BGP deployed in their data center fabrics are using technology designed primarily for interconnecting intermediate systems on the default-free zone—in other words, for connecting the networks of transit service providers. All those networks with OSPF deployed are using a link state protocol originally designed to provide edge-to-edge reachability in transit service provider networks. All those networks with IS-IS deployed are using a link state protocol originally designed to provide connectivity to large-scale telephony-style networks.

What about transport technologies? The only transport technologies originally designed specifically for “enterprise use” have long since been replaced by optical technologies designed for large-scale provider or “hyperscale” use. Token Ring and ARCnet are long gone, as is the original shared medium Ethernet, replaced by switched Ethernet largely over optical transport. Even current general WiFi is primarily designed for public operator use cases—look at 5G and WiFi 6 and note how public operator requirements have influenced these technologies.

The truth is there is no “pure” enterprise technology; following the dictum that you should not use “service-provider technologies” in your network would leave you with … no network at all.

There is a second realm where this line of argument falls flat, and its more important than the question of which technologies to use: the techniques and attitudes learned in the operation of truly large-scale networks hold valuable lessons for all network engineers. Should you use a spine and leaf topology in your data center, rather than a more traditional hierarchical design? The answer has nothing to do with scale, and everything to do with flexibility in design and operational agility. Should you automate your network, even if its only ten routers? The answer has nothing to do with what Amazon is doing, and everything to do with how much time you want to spend on configuring and troubleshooting versus responding to real business needs.

Think of it this way: the driver who drives the large over-the-road truck is still going to learn lessons and instincts about driving that will make them a better driver in a minivan.

Come join me at NXTWORK in November to continue the conversation in my master class on building and operating data center fabrics, as I explore how you can apply lessons from the hyperscale world to your network.

Is it planning… or just plain engineering?

Over at the ECI blog, Jonathan Homa has a nice article about the importance of network planning–

In the classic movie, The Graduate (1967), the protagonist is advised on career choices, “In one word – plastics.” If you were asked by a young person today, graduating with an engineering or similar degree about a career choice in telecommunications, would you think of responding, “network planning”? Well, probably not.

Jonathan describes why this is so–traffic is constantly increasing, and the choice of tools we have to support the traffic loads of today and tomorrow can be classified in two ways: slim and none (as I remember a weather forecaster saying when I “wore a younger man’s shoes”). The problem, however, is not just tools. The network is increasingly seen as a commodity, “pure bandwidth that should be replaceable like memory,” made up of entirely interchangeable parts and pieces, primarily driven by the cost to move a bit across a given distance.

This situation is driving several different reactions in the network engineering world, none of which are really healthy. There is a sense of resignation among people who work on networks. If commodities are driven by price, then the entire life of a network operator or engineer is driven by speed, and speed alone. All that matters is how you can build ever larger networks with ever fewer people–so long as you get the bandwidth you need, nothing else matters.

This is compounded by a simple reality–network world has driven itself into the corner of focusing on the appliance–the entire network is appliances running customized software, with little thought about the entire system. Regardless of whether this is because of the way we educate engineers through our college programs and our certifications, this is the reality on the ground level of network engineering. When your skill set is primarily built around configuring and managing appliances, and the world is increasingly making those appliances into commodities, you find yourself in a rather depressing place.

Further, there is a belief that there is no more real innovation to be had–the end of the road is nigh, and things are going to look pretty much like they look right now for the rest of … well, forever.

I want you, as a network engineer, operator, or whatever you call yourself, to look these beliefs in the eye and call them what they are: nonsense on stilts.

The real situation is this: the current “networking industry,” such as it is, has backed itself into a corner. The emphasis on planning Jonathan brings out is valid, but it is just the tip of the proverbial iceberg. There is a hint in this direction in Jonathan’s article in the list of suggestions (or requirements). Thinking across layers, thinking about failure, continuous optimization… these are all… system level thinking, To put this another way, a railway boxcar might be a commodity, but the railroad system is not. The individual over-the-road truck might be a commodity, and the individual road might not be all that remarkable, but the road system is definitely not a commodity.

The sooner we start thinking outside the appliance as network engineers or operators (or whatever you call yourself), the sooner we will start adding value to the business. This means thinking about algorithms, protocols, and systems–all that “theory stuff” we typically decry as being less than usefl–rather than how to configure x on device y. This means thinking about security across the network, rather than as how you configure a firewall. This means thinking about the tradeoffs with implementing security, including what systemic risk looks like, and when the risks are acceptable when trying to accomplish as specific goal, rather than thinking about how to route traffic through a firewall.

If demand is growing, why is the networking world such a depressing place right now? Why do I see lots of people saying things like “there will be no network engineers in enterprises in five years?” Rather than blaming the world, maybe we should start looking at how we are trying to solve the problems in front of us.

Autonomic, Automated, and Reality

Once the shipping department drops the box off with that new switch, router, or “firewall,” what happens next? You rack it, cable it up, turn it on, and start configuring, right? There are access to controls to configure—SSH, keys, disabling standard accounts, disabling telnet—interface addresses to configure, routing adjacencies to configure, local policies to configure, and… After configuring all of this, you can adjust routing in the network to route around the new device, and then either canary the device “in production” (if you run your network the way it should be run), or find some prearranged maintenance time to bring the new device online and test things out. After all of this, you can leave the new device up and running in the network, and move on to the next task.

Until it breaks.

Then you consult the documentation to remind yourself why it was configured this way, consult the documentation to understand how the application everyone is complaining about not working should work, etc. There are the many hours spent sitting on the console gathering information by running various commands and the output of various logs. Eventually, once you find the problem, you can either replace the right parts, or reconfigure the right bits, and get everything running again.

In the “modern” world (such as it is), we think it’s a huge leap forward to stop configuring devices manually. If we can just automate the configuration of all that “stuff” we have to do at the beginning, after the box is opened and before the device is placed into service, we think we have this whole networking thing pretty well figured out.

Even if you had everything in your network automated, you still haven’t figured this networking thing out.

We need to move beyond automation. Where do we need to move to? It’s not one place, but two. The first is we need to move beyond automation to autonomous operation. As an example, there is a shiny new system that is currently being widely deployed to automate the deployment and management of containers. Part of this system is the automation of connectivity, including routing, between containers. The routing system being deployed as part of this system is essentially statically configured policy-based routing combined with network address translation.

Let me point something out that is not going to be very popular: this is a step backwards in terms of making the system autonomous. Automating static routing information is not a better solution than building a real, dynamic, proactive, autonomic, routing system. It’s not simpler—trust me, I say this as someone who has operated large networks which used automated static routes to do everything.

The “opsification of everything” is neat, but it shouldn’t be our end goal.

Now part of this, I know, is the fault of vendors. Vendors who push EGPs onto data center fabrics because, after all, “the configuration complexity doesn’t matter so long as you can automate it.” The configuration complexity does matter, because configuration complexity belies an underlying protocol complexity, and sets up long and difficult troubleshooting sessions that are completely unnecessary.

The second place we need to move in the networking world? The focus on automation is just another form of focusing on configuration. We abstract the configuration, and we touch a lot more devices at once, but we are still thinking about configuration. The more we think about configuration, the less we think about how the system should work, how it really works, what the gaps are, and how to bridge those gaps. So long as we are focused on the configuration, automated or not, we are not focused on how the network can bring value to the business. The longer we are focused on configuration, the less value we are bringing to the business, and the more likely we are to end up being replaced by … an automated system … no matter how poorly that automated system actually works.

And no, the cloud isn’t going to solve this. Containers aren’t going to solve this. The “automated configuration pattern” is already being repeated in the cloud. As more complex workloads are moved into the cloud, the problems there are only going to get harder. What starts out as a “simple” system using policy-based routing analogs and network address translation configured through an automation server will eventually look complex against the hardest problems we had to solve using T1’s, frame relay circuits, inverse multiplexers, wire down patch panels, and mechanical switch crossbar frames. It’s fun to pretend we don’t need dynamic routing to solve the problems that face the network—at least until you hit hard problems, and have to relearn the lessons of the last 20+ years.

Yes, I know vendors are partly to blame for this. I know that, for a vendor, it’s easier to get people to buy into your CLI, or your entire ecosystem, rather than getting them to think about how to solve the problems your business is handing them.

On the other hand, none of this is going to change from the top down. This is only going to change when the average network engineer starts asking vendors for truly simpler solutions that don’t require reams configuration information. It will change when network engineers get their heads out of the configuration and features, and into the business problems.

Stop Using the OSI Model

We all use the OSI model to describe the way networks work. I have, in fact, included it in just about every presentation, and every book I have written, someplace in the fundamentals of networking. But if you have every looked at the OSI model and had to scratch your head trying to figure out how it really fits with the networks we operate today, or what the OSI model is telling you in terms of troubleshooting, design, or operation—you are not alone. Lots of people have scratched their heads about the OSI model, trying to understand how it fits with modern networking. There is a reason this is so difficult to figure out.

The OSI Model does not accurately describe networks.

What set me off in this particular direction this week is an article over at Errata Security:

The OSI Model was created by international standards organization for an alternative internet that was too complicated to ever work, and which never worked, and which never came to pass. Sure, when they created the OSI Model, the Internet layered model already existed, so they made sure to include today’s Internet as part of their model. But the focus and intent of the OSI’s efforts was on dumb networking concepts that worked differently from the Internet.

This is partly true, and yet a bit … over the top. 🙂 OTOH, the point is well taken: the OSI model is not an ideal model for understanding networks. Maybe a bit of analysis would be helpful in understanding why.

First, while the OSI model was developed with packet switching networks in mind, the general idea was to come as close as possible to emulating the circuit-switched networks widely deployed at the time. A lot of thought had gone into making those circuit-switched networks work, and applications had been built around the way they worked. Applications and circuit-switched networks formed a sort of symbiotic relationship, just as applications form with packet-switched networks today; it was unimaginable, at the time, that “everything would change.”

So while the designers of the OSI model understood the basic value of the packet-switched network, they also understood the value of the circuit-switched network, and tried to find a way to solve both sets of problems in the same network. Experience has shown it is possible to build a somewhat close-to-circuit switched network on top of packet switched networks, but not quite in the way, nor as close to perfect emulation, as those original designers thought. So the OSI model is a bit complex and perhaps overspecified, making it less-than-useful today.

Second, the OSI model largely ignored the role of middleboxes, focusing instead on the stacks implemented and deployed in hosts. This, again, makes sense, as there was no such thing as a device specialized in the switching of packets at the time. Hosts took packets in and processed them. Some packets were sent along to other hosts, other packets were consumed locally. Think PDP-11 with some rough code, rather than even an early Cisco CGS.

Third, the OSI model focuses on what each layer does from the perspective of an application, rather than focusing on what is being done to the data in order to transmit it. The OSI model is built “top down,” rather than “bottom up,” in other words. While this might be really useful if you are an application developer, it is not so useful if you are a network engineer.

So—what should we say about the OSI model?

It was much more useful at some point in the past, when networking was really just “something a host did,” rather than its own sort of sub-field, with specialized protocols, techniques, and designs. It was a very good attempt at sorting out what a network needed to do to move traffic, from the perspective of an application.

What it is not, however, is really all that useful for network engineers working within an engineering specialty to understand how to design protocols, and how to design networks on which those protocols will run. What should we replace it with? I would begin by pointing you to the RINA model, which I think is a better place to start. I’ve written a bit about the RINA model, and used the RINA model as one of the foundational pieces of Computer Networking Problems and Solutions.

Since writing that, however, I have been thinking further about this problem. Over the next six months or so, I plan to build a course around this question. For the moment, I don’t want to spoil the fun, or put any half-backed thoughts out there in the wild.

DNS Query Minimization and Data Leaks

When a recursive resolver receives a query from a host, it will first consult any local cache to discover if it has the information required to resolve the query. If it does not, it will begin with the rightmost section of the domain name, the Top Level Domain (TLD), moving left through each section of the Fully Qualified Domain Name (FQDN), in order to find an IP address to return to the host, as shown in the diagram below.

This is pretty simple at its most basic level, of course—virtually every network engineer in the world understands this process (and if you don’t, you should enroll in my How the Internet Really Works webinar the next time it is offered!). The question almost no-one ever asks, however, is: what, precisely, is the recursive server sending to the root, TLD, and authoritative servers?

Begin with the perspective of a coder who is developing the code for that recursive server. You receive a query from a host, you have the code check the local cache, and you find there is no matching information available locally. This means you need to send a query out to some other server to determine the correct IP address to return to the host. You could keep a copy of the query from the host in your local cache and build a new query to send to the root server.

Remember, however, that local server resources may be scarce; recursive servers must be optimized to process very high query rates very quickly. Much of the user’s perception of network performance is actually tied to DNS performance. A second option is you could save local memory and processing power by sending the entire query, as you have received it, on to the root server. This way, you do not need to build a new query packet to send to the root server.

Consider this process, however, in the case of a query for a local, internal resource you would rather not let the world know exists. The recursive server, by sending the entire query to the root server, is also sending information about the internal DNS structure and potential internal server names to the external root server. As the FQDN is resolved (or not), this same information is sent to the TLD and authoritative servers, as well.

There is something else contained here, however, that is not so obvious—the IP address of the requestor is contained in that original query, as well. Not only is your internal namespace leaking, your internal IP addresses are leaking, as well.

This is not only a massive security hole for your organization, it also exposes information from individual users on the global ‘net.

There are several things that can be done to resolve this problem. Organizationally, running a private DNS server, hard coding resolving servers for internal domains, and using internal domains that are not part of the existing TLD infrastructure, can go a long way towards preventing information leaking of this kind through DNS. Operating a DNS server internally might not be ideal, of course, although DNS services are integrated into a lot of other directory services used in operational networks. If you are using a local DNS server, it is important to remember to configure DHCP and/or IPv6 ND to send the correct, internal, DNS server address, rather than an external address. It is also important to either block or redirect DNS queries sent to public servers by hosts using hard-coded DNS server configurations.

A second line of defense is through DNS query minimization. Described in RFC7816, query minimization argues recursive servers should use QNAME queries to only ask about the one relevant part of the FQDN. For instance, if the recursive server receives a query for www.banana.example, the server should request information about .example from the root server, banana.example from the TLD, and send the full requested domain name only to the authoritative server. This way, the full search is not exposed to the intermediate servers, protecting user information.

Some recursive server implementations already support QNAME queries. If you are running a server for internal use, you should ensure the server you are using supports DNS query minimization. If you are directing your personal computer or device to publicly reachable recursive servers, you should investigate whether these servers support DNS query minimization.

Even with DNS query minimization, your recursive server still knows a lot about what you ask for—the topic of discussion on a forthcoming episode of the Hedge, where our guest will be Geoff Huston.

There is Always a Back Door

A long time ago, I worked in a secure facility. I won’t disclose the facility; I’m certain it no longer exists, and the people who designed the system I’m about to describe are probably long retired. Soon after being transferred into this organization, someone noted I needed to be trained on how to change the cipher door locks. We gathered up a ladder, placed the ladder just outside the door to the secure facility, popped open one of the tiles on the drop ceiling, and opened a small metal box with a standard, low security key. Inside this box was a jumper board that set the combination for the secure door.
First lesson of security: there is (almost) always a back door.

I was reminded of this while reading a paper recently published about a backdoor attack on certificate authorities. There are, according to the paper, around 130 commercial Certificate Authorities (CAs). Each of these CAs issue widely trusted certificates used for everything from TLS to secure web browsing sessions to RPKI certificates used to validate route origination information. When you encounter these certificates, you assume at least two things: the private key in the public/private key pair has not been compromised, and the person who claims to own the key is really the person you are talking to. The first of these two can come under attack through data breaches. The second is the topic of the paper in question.

How do CAs validate the person asking for a certificate actually is who they claim to be? Do they work for the organization they are obtaining a certificate for? Are they the “right person” within that organization to ask for a certificate? Shy of having a personal relationship with the person who initiates the certificate request, how can the CA validate who this person is and if they are authorized to make this request?

They could do research on the person—check their social media profiles, verify their employment history, etc. They can also send them something that, in theory, only that person can receive, such as a physical letter, or an email sent to their work email address. To be more creative, the CA can ask the requestor to create a small file on their corporate web site with information supplied by the CA. In theory, these electronic forms of authentication should be solid. After all, if you have administrative access to a corporate web site, you are probably working in information technology at that company. If you have a work email address at a company, you probably work for that company.

These electronic forms of authentication, however, can turn out to be much like the small metal box which holds the jumper board that sets the combination just outside the secure door. They can be more security theater than real security.
In fact, the authors of this paper found that some 70% of the CAs could be tricked into issuing a certificate for just about any organization—by hijacking a route. Suppose the CA asks the requestor to place a small file containing some supplied information on the corporate web site. The attacker creates a web server, inserts the file, hijacks the route to the corporate web site so it points at the fake web site, waits for the authentication to finish, and then removes the hijacked route.

The solution recommended in this paper is for the CAs to use multiple overlapping factors when authenticating a certificate requestor—which is always a good security practice. Another solution recommended by the authors is to monitor your BGP tables from multiple “views” on the Internet to discover when someone has hijacked your routes, and take active measures to either remove the hijack, or at least to detect the attack.

These are all good measures—ones your organization should already be taking.

But the larger point should be this: putting a firewall in front of your network is not enough. Trusting that others will “do their job correctly,” and hence that you can trust the claims of certificates or CAs, is not enough. The Internet is a low trust environment. You need to think about the possible back doors and think about how to close them (or at least know when they have been opened).

Having personal relationships with people you do business with is a good start. Being creative in what you monitor and how is another. Firewalls are not enough. Two-factor authentication is not enough. Security is systemic and needs to be thought about holistically.
There are always back doors.

What’s in your DNS query?

Privacy problems are an area of wide concern for individual users of the Internet—but what about network operators? In this issue of The Internet Protocol Journal, Geoff Huston has an article up about privacy in DNS, and the various attempts to make DNS private on the part of the IETF—the result can be summarized with this long, but entertaining, quote:

 The Internet is largely dominated, and indeed driven, by surveillance, and pervasive monitoring is a feature of this network, not a bug. Indeed, perhaps the only debate left today is one over the respective merits and risks of surveillance undertaken by private actors and surveillance by state-sponsored actors. … We have come a very long way from this lofty moral stance on personal privacy into a somewhat tawdry and corrupted digital world, where “do no evil!” has become “don’t get caught!”

Before diving into a full-blown look at the many problems with DNS security, it is worth considering what kinds of information can leak through the DNS system. Let’s ignore the recent discovery that DNS queries can be used to exfiltrate data; instead, let’s look at more mundane data leakage from DNS queries.

For instance, say you work in a marketing department for a company that is just about to release a new product. In order to build the marketing and competitive materials your sales critters will need to stand in front of customers, you do a lot of research around competitor products. In the process, you examine, in detail, each of the competing product’s pages. Or perhaps you work in a company that is determining whether or another purchasing or merging with another company might be a good idea. Or you are working on a new externally facing application, or component in an existing application, that relies on a new connection point into your network.

All of these processes can lead to a lot of DNS queries. For someone who knows what they are looking for, the pattern of queries may be enough to examine strings queried from search engines and other information, ultimately leading to someone being able to guess a lot about that new product, what company your company is thinking about buying or merging with, what your new application is going to do, etc. DNS is a treasure trove of information at a personal and organizational level.

Operators and protocol designers have been working for years to resolve these problems, making DNS queries “more private;” Geoff Huston’s article provides a good overview of many of these attempts. DNS over HTTPS (DoH), a recent (and ongoing) attempt bears a closer look.

DNS is normally sent “in plain text” over the network; anyone who can capture the packets can read not only the query, but also the responses. The simplest way to solve this problem is to encrypt the DNS data in flight using something like TLS—hence DoT, or DNS over TLS. One problem with DoT is it is carried over a unique port number, which means it is probably blocked by default by most packet filters, and can easily be blocked by administrators who either do not know what this traffic is, or do not want it on their network. To solve this, DoH carries TLS encrypted traffic in a way that makes it look just like an HTTPS session. If you block DoH traffic, you will also block access to web servers running HTTPS. This is the logical “end” of carrying everything else over HTTPS to avoid the impact of stateful and stateless packet filters and the impact of middle boxes on Internet traffic.

The good result is, in fact, that DNS traffic can no longer be “spied on” by anyone outside servers in the DNS system itself. Whether or not this is “enough” privacy is a matter of conjecture, however. Servers within the DNS system can still collect information about what queries you are making; if the server has access to other information about you or your organization, combining this data into a profile, or using it to determine some deeper investigation is warranted by looking at other sources of data, is pretty simple. Ultimately, DoH is only really useful if you trust your DNS provider.

Do you? Perhaps more importantly—should you?

DNS providers are like any other business; they must buy hardware, connectivity, and the time of smart people who can make the system work, troubleshoot the system when it fails, and think about ways of improving the system. If the service is free…

DoH, however, has another problem Geoff outlines in his article—DNS is moved up the stack so it no longer runs over TCP and UDP directly, but rather it runs over HTTPS. This means local applications, like browsers, can run DNS queries independently of the operating system. In fact, because these queries are TLS encrypted, the operating system itself cannot even “see” the contents of these DNS queries. This might be a good thing—or might be a bad thing. If nothing else, it means the browser, or any other application, can choose to use a resolver not configured by the local operating system. A browser maker, for instance, can direct their browser to send all DNS queries made within the browser to their DNS server, exposing another source of information about users (and the organizations they work for).

Remember that time you typed an internal hostname incorrectly in your browser? Thankfully, you had a local DNS server configured, so the query did not go out to a resolver on the Internet. With DoH, the query can go out to an open resolver on the Internet regardless of how your local systems are configured. Something to ponder.

The bottom line is this—the nature of DNS makes it extremely difficult to secure. Somehow you have to have someone operate, and pay for, an open database of names which translate to addresses. Somehow you have to have a protocol that allows this database to be queried. All of these “somehows” expose information, and there is no clear way to hide that information. You can solve parts of the problem, but not the whole problem. Solving one part of the problem seems to make another part of the problem worse.

If you haven’t found the tradeoff, you haven’t looked hard enough.

In the end, though, the privacy of DNS queries at a personal and organizational level is something you need to think about.