I recently joined Ethan Banks for a Packet Pushers episode around the trade offs of hiding information in the control plane. This was a terrific show; you can listen to it by clicking on the link below.
Today on the Priority Queue, we’re gonna hide some information. Oh, like route summarization? Sure, like route summarization. That’s an example of information hiding. But there’s much more to the story than that. Our guest is Russ White. Russ is a serial networking book author, network architect, RFC writer, patent holder, technical instructor, and much of the motive force behind the early iterations of the CCDE program.
“News” means “novelty”, not “truth”. In much of the computer networking world, news is what sells products, rather than business need. In turn, Novelty is what drives the news. The “straight line” connection, then is from novelty to news to product, and product manufacturers know this. This is not just a vendor driven problem, however; this is also driven by recruitment, and padding resumes, and many other facets of the networking nerd culture.
On the other hand, novelty is never a good starting place for network design. Rather, network design needs to start with problems that need to be solved, proceeds by considering how those problems can be solved with technologies, then builds requirements based on the problems and technologies, and finally considers which products can be used to implement all of this at the lowest long term cost. This is not to say novelty is not useful, or is not justified, but rather that novelty is not the point.
How can you overcome the drive to novelty through the news cycle? Go back to basics. Every “novel” thing you are looking at in the latest news story is something that has been invented and implemented before in a different package, and with a different name. Apply rule 11 liberally to all marketing claims, look for the problem to be solved, push back on the requirements, think systemically, manage your own expectations, and go back to basics.
To a user, “the network” is whatever isn’t on their desk or in their device. This is a point folks who work on the network for a living often forget. Talking to a non-networking person about networking technology is often like talking to someone who commutes on the train about how the train works; it might be interesting, but they often just do not care. There are several implications here: the first is that if your business relies on the network (and most do, whether or not they realize it), as the network engineer, you need to go beyond just making the train work, to helping others understand that why and how the network (the train) runs is important to reaching the overall business goals. There is an entire movement within the networking world that would say: “networks are a commodity, just like the train is, just move the packets and shut up.” I do not tend to agree with this; for a city, a train is not a commodity, it is a vital resource that grows business and interacts with people’s lives. The network is like the train to a city; it might be a commodity for the person riding it, but it is not for the overall business.
There’s no substitute for knowing what you’re doing. But what does it mean to “know what you are doing?” In a large complex system, you can know what is on “your layer,” or “your piece of the system,” plus one or two levels above and below. The rest is rumor and pop psychology.
In a world where there is just too much information, how can you “know what you are doing?” First, you can use rule 11 to your advantage, and realize that everything that is, has been before. If you know the underlying technology, then the implementation is much easier to learn (if you need to learn it at all!). If you know the pattern, then you can see the details much more easily. Second, you can insist on radical simplicity, which will make the process of knowing the entire system much easier. Third, you can intentionally think systematically, and functionally, rather than orienting yourself to products.
BGP is one of the foundational protocols that make the Internet “go;” as such, it is a complex intertwined system of different kinds of functionality bundled into a single set of TLVs, attributes, and other functionality. Because it is so widely used, however, BGP tends to gain new capabilities on a regular basis, making the Interdomain Routing (IDR) working group in the Internet Engineering Task Force (IETF) one of the consistently busiest, and hence one of the hardest to keep up with. In this post, I’m going to spend a little time talking about one area in which a lot of work has been taking place, the building and maintenance of peering relationships between BGP speakers.
The first draft to consider is Mitigating the Negative Impact of Maintenance through BGP Session Culling,which is a draft in an operations working group, rather than the IDR working group, and does not make any changes to the operation of BGP. Rather, this draft considers how BGP sessions should be torn down so traffic is properly drained, and the peering shutdown has the minimal effect possible. The normal way of shutting down a link for maintenance would be to for administrators to shut down BGP on the link, wait for traffic to subside, and then take the link down for maintenance. However, many operators simply do not have the time or capability to undertake scheduled shutdowns of BGP speakers. To resolve this problem, graceful shutdown capability was added to BGP in RFC8326. Not all implementations support graceful shutdown, however, so this draft suggests an alternate way to shut down BGP sessions, allowing traffic to drain, before a link is shut down: use link local filtering to block BGP traffic on the link, which will cause any existing BGP sessions to fail. Once these sessions have failed, traffic will drain off the link, allowing it to be safely shut down for maintenance. The draft discusses various timing issues in using this technique to reduce the impact of link removal due to maintenance (or other reasons).
Graceful shutdown, itself, is also in line to receive some new capabilities through Extended BGP Administrative Shutdown Communication. This draft is rather short, as it simply allows an operator to send a short freeform message (presumably in text format) along with the standard BGP graceful shutdown notification. This message can be printed on the console, or saved to syslog, to provide an operator with more information about why a particular BGP has been shut down, whether it coming back up again, how long the shutdown is expected to last, etc.
Graceful Restart (GR) is a long available feature in many BGP implementations that aims to prevent the disruption of traffic flow; the original purpose was to handle a route processor restart in a router where the line cards could continue forwarding traffic based on local forwarding tables (the FIB), including cases where one route processor fails, causing the router switches to a backup route processor in the same chassis. Over time, GR began to be applied to NOTIFICATION messages in BGP. For instance, if a BGP speaker receives a malformed message, it is required (by the BGP RFCs) to send a NOTIFICATION, which will cause the BGP session to be torn down and restarted. GR has been adapted to these situations, so traffic flow is either not impacted, or minimally impacted through the NOTIFICATION/session restart process. This same processing takes place for a hold timer timeout in BGP.
The problem is that only one of the two speakers in a restarting pair will normally retain its local forwarding information. The sending speaker will normally flush its local routing tables, and with them its local forwarding tables, on sending a BGP NOTIFICATION. Notification Message support for BGP Graceful Restartchanges this processing, allowing both speakers to enter the “receiving speaker” mode, so both speakers would retain their local forwarding information. A signal is provided to allow the sending speaker to indicate the sessions should be hard reset, rather than gracefully reset, if needed.
Finally, BGP allows speakers to send a route with a next hop other than themselves; this is called a third party next hop, and is illustrated in the figure below.
In this network, router C’s best path to 2001:db8:3e8:100::/64 might be through A, but the operator may prefer this traffic pass through B. While it is possible to change the preferences so C chooses the path through B, there are some situations where it is better for A to advertise C as the next hop towards the destination (for instance, a route server would not normally advertise itself as the nexthop towards a destination). The problem with this situation is that B might not have the same capabilities as a BGP speaker as A. If B, for instance, cannot forward for IPv6, the situation shown in the illustration would clearly not work.
According to Roman philosophers, simplicity is the hallmark of truth. And yet, networks have become ever more complex over time. Why is this? Because complexity sells. In this short take, I talk about why complexity sells, and some of the mental habits you can use to overcome our natural tendency to prefer the complex.
Link speeds in data center fabrics continue to climb, with 10g, 25g, 40g, and 100g widely available, and 400g promised in just a few short years. What isn’t so obvious is how these higher speeds are being reached. A 100g link, for instance, is really four 25g links bundled as a single link at the physical layer. If the optics are increasing in speed, and the processors are increasing in their ability to switch traffic, why are these higher speed links being built in this way? According to the paper under investigation today, the reason is the speed of the chips that serialize traffic from and deserialize traffic off the optical medium. The development of the Complementary metal–oxide–semiconductor, of CMOS, chips required to build ever faster optical interfaces seems to have stalled out at around 25g, which means faster speeds must be achieved by bundling multiple lower speed links.
Mellette, William M., Alex C. Snoeren, and George Porter. “P-FatTree: A Multi-Channel Datacenter Network Topology.” In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, 78–84. HotNets ’16. New York, NY, USA: ACM, 2016. https://doi.org/10.1145/3005745.3005746.
The authors then point out that many data operators have moved towards some form of chassis device in order to reduce the costs of cabling and optics. Chassis devices most often use some form of spine and leaf internally to switch traffic between the input and output ports across a short run copper fabric, resulting in a switching path within the chassis router that looks something like the following figure.
The spine and leaf in connecting the switching ASICs are one of the main reasons data center operators move away from chassis devices; the number of hops through the network becomes unstable with the addition of these internal spine and leaf fabrics, the backpressure and quality of service is essentially unmanageable across this fabric on most devices, and there is little in the way of traffic analysis that can be done on this internal fabric. The authors do not address these problems, however.
Rather, they address the added set of switching ASICs in the spine layer of the internal spine and leaf network. As it turns out, the switching ASICs themselves are a major consumer of power, and heat generator, in switches. They argue that removing this internal spine layer would greatly reduce the amount of power required in a fabric, as well as the amount of heat generated. To do this, they propose unbundling the links attached to each SerDes CMOS chip, exposing them as individual links to the control plane. This would allow the switching path to be shortened to something like the figure below.
Exposing the unbundled links to the external control plane allows each stage of the internal fabric to be treated as another hop in the network, and hence for “normal” ECMP to choose the path through the chassis fabric.
The authors suggest the four unbundled links attached to a single switching ASIC can be treated as a member of a different “switching plane,” which, in effect, creates four virtual topologies across the fabric, each of which is one quarter the speed of the total fabric bandwidth. Each virtual topology could run its own control plane, producing four somewhat redundant networks, and the ability to steer traffic onto any given plane at the edge of the network for traffic engineering, policy separation, or any other purpose. The result is a fabric that is more flexible in use, while retaining a fixed hop count through the fabric, and reducing the ASIC count in the fabric by around one third.
This is an interesting concept, but it would require an entire fabric to be built this way from the ground up; there is little chance of a brown field deployment of this kind of design. One tradeoff in this kind of design would be the additional control plane state, including assigning four addresses to each host (although this might be mitigated by the clever use of anycast), and the maintenance of four control planes, etc. Another design tradeoff would be the shared risk link groups involved in splitting a single optical fiber and ASIC into four circuits—these aren’t exactly “virtual circuits,” but they share many of the same characteristics.