Architecture and Process

Driving through some rural areas east of where I live, I noticed a lot of collections of buildings strung together being used as homes. The process seems to start when someone takes a travel trailer, places it on blocks (a foundation of sorts) and builds a spacious deck just outside the door. Over time, the deck is covered, then screened, then walled, becoming a room.

Once the deck becomes a room, a new deck is built, and the process begins anew. At some point, the occupants decide they need a place to store some sort of equipment, so they build a shed. Later, the shed is connected to the deck, the whole thing becomes an extension of the living space, and a new shed is built.

These … interesting … places to live are homes to the people who live in them. They are often, I assume, even happy homes.

But they are not houses in the proper sense of the word. There is no unifying theme, no thought of how traffic should flow and how people should live. They are a lot like the paths crisscrossing a campus—built where the grass died.

Our networks are like these homes—they are not houses so much as historical records of every new idea and vendor marketing drive. There is no architecture, there are many architectures strung together with a set of tightly wound and closely followed processes.

We need to support some new application or service? Throw a new overlay on top. There was a massive failure last night? Let’s spend hours closely examining our process and find some way to prevent the failure by adding a few new steps.

We never ask if our goals are realistic because we don’t have any goal beyond: “Let’s solve this problem right now.” We never ask if there is some future goal might be better served by using this solution or that—the future will take care of itself.

Why do we fail to attend to architecture?

Architecture is hard, and we often fail to correctly anticipate the future. This perceives architecture as a detailed plan—but there’s no reason it should be. An architecture can be a rough, and slow-changing, outline of how the network is laid out, a set of services the network supports, and a set of technologies the network will use to support those services. An architecture recognizes and defines limits as well as capabilities.

Processes are comforting. When things fail, we can always take comfort in saying: “I followed the process!”

We live in a culture of now. All problems take two hours, two days, two weeks, or too long. There is no history, there is no future, there is only an ever-present now. If I cannot have it now, it is not worth having at all.

These problems are hard to solve because they are cultural rather than technical—and the network engineering world has a strong bias towards “don’t tell me how it works, tell me how to configure it.” We present this as a problem-solving mentality, even though it causes more problems than it solves.

We need to rebalance the way we think about architectures and processes—perhaps we would get better results by combining lightweight architectures with lightweight processes, instead of relying on heavy processes with no architecture to build maintainable networks and sustainable lives.

Simple or Complex?

A few weeks ago, Daniel posted a piece about using different underlay and overlay protocols in a data center fabric. He says:

There is nothing wrong with running BGP in the overlay but I oppose to the argument of it being simpler.

One of the major problems we often face in network engineering—and engineering more broadly—is confusing that which is simple with that which has lower complexity. Simpler things are not always less complex. Let me give you a few examples, all of which are going to be controversial.

When OSPF was first created, it was designed to be a simpler and more efficient form of IS-IS. Instead of using TLVs to encode data, OSPF used fixed-length fields. To process the contents of a TLV, you need to build a case/switch construction where each possible type a separate bit of code. You must count off the correct length for the type of data, or (worse) read a length field and count out where you are in the stream.

Fixed-length fields are just much easier to process. You build a structure matching the layout of the fixed-length fields in memory, then point this structure at the packet contents in-memory. From there, you can just use the structure’s contents to directly access the data.

Over time, however, as new requirements have been pushed into IGPs, OSPF has become much more complex while IS-IS has remained relatively constant (in terms of complexity). IS-IS went through a bit of a mess when transitioning from narrow to wide metrics, but otherwise the IS-IS we use today is the same protocol we used when I first started working on networks (back in the early 1990s).

OSPF’s simplicity, in the end, did not translate into a less complex protocol.

Another example is the way we transport data in BGP. A lot of people do not know that BGP’s original design allowed for carrying information other than straight reachability in the protocol. BGP speakers can negotiate multiple sessions, with each session carrying a different kind of information. Rather than using this mechanism, however, BGP has consistently been extended using address families—because it is simpler to create a new address family than it is to define a new kind of data parallel with address families.

This has resulted in AFs that are all over the place, magic numbers, and all sorts of complexity. The AF solution is simpler, but ultimately more complex.

Returning to Daniel’s example, running a single protocol for underlay and overlay is simpler, while running two different protocols is less simple. However, I’ve observed—many times—that running different protocols for underlay and overlay is less complex.

Why? Daniel mentions a couple of reasons, such as each protocol has a separate purpose, and we’re pushing features into BGP to make it serve the role of an IGP (which is, in the end, going to cause some major outages—if it hasn’t already).

Consider this: is it easier to troubleshoot infrastructure reachability separately from vrf reachability? The answer is obvious—yes! What about security? Is it easier to secure a fabric when the underlay never touches any attached workload? Again—yes!

We get this tradeoff wrong all the time. A lot of times this is because we are afraid of what we do not know. Ten years ago I struggled to convince large operators to run BGP in their networks. Today no-one runs anyone other than BGP—and they all say “but we don’t have anyone who knows OSPF or IS-IS.” I’ve no idea what happened to old-fashioned network engineering. Do people really only have one “protocol slot” in their brains? Can people really only ever learn one protocol?

Or maybe we’ve become so fixated on learning features that we no longer no protocols?

I don’t know the answer to these questions, but I will say this—over the years I’ve learned that simpler is not always less complex.

Hedge 144: IPv6 Lessons Learned

We don’t often do a post-mortem on the development and deployment of new protocols … but here at the Hedge we’re going to brave these deep waters to discuss some of the lessons we can learn from the development and deployment of IPv6, especially as they apply to design and deployment cycles in the “average network” (if there is such at thing). Join us as James Harr, Tom Ammon, and Russ White consider the lessons we can learn from IPv6’s checkered history.

download

Route Servers and Loops

From the question pile: Route servers (as opposed to route reflectors) don’t change anything about a BGP route when re-advertising it to a peer, whether iBGP or eBGP. Why don’t route servers cause routing loops (or other problems) in a BGP network?

Route servers are often used by Internet Exchange Points (IXPs) to distribute routes between connected BGP speakers. BGP route servers

  • Don’t change anything about a received BGP route when advertising the route to its peers (other BGP speakers)
  • Don’t install routes received through BGP into the local routing table

Shouldn’t using route servers in a network—pontentially, at least—cause routing loops or other BGP routing issues? Maybe a practical example will help.

Assume b, e, and s are all route servers in their respective networks. Starting at the far left, a receives some route, 101::/64, and sends it on to b,, which then sends the unmodified route to c. When c receives traffic destined to 101::/64, what will happen? Regardless of whether these routers are running iBGP or eBGP, b will not change the next hop, so when c receives the route, a is still the next hop. If there’s no underlying routing protocol, c won’t know how to reach A, so it will ignore the route and drop the traffic. Even if there is an underlying routing protocol, c’s route to 101::/64’s route passes through b, and b isn’t installing any routing information learned from BGP into its local routing table (because it’s a route server). b is going to drop traffic destined to 101::/64.

We can solve this simple problem by adding a new link between the two clients of the route server, as shown in the center diagram. Here, d sends 101::/64 to e, which then sends the unchanged route to g. Since g has a direct connection to d, we can assume g will send traffic destined to 101::/64 directly to d, where it will be forwarded to the destination. Why wouldn’t d and g peer directly instead of counting on e to carry routes between them? In most cases this kind of indirect peering is done to increase network scale. If there are thousand routes like d and g, it will be simpler for them all to peer to e than to build a full mesh of connections.

Why not use a route reflector rather than a route server in this situation? Route reflectors can only be used to carry routes between iBGP peers. If d, e, and g are all in different autonomous systems, route reflectors cannot be used to solve this problem.

But this brings us back to the original question—route reflectors use the cluster list to prevent loops within an AS (the cluster list is similar in form and function to the AS path carried between autonomous systems, but it uses router ID’s rather than AS numbers to describe the path)?

If you have multiple route servers connected to one another you can, in fact, form routing loops.

In this network, a is sending 101::/64 to b, which is then sending the route, unmodified, to e. Because of some local policy, e is choosing the path through a, which means e forwards traffic destined to 101::/64 to c. At the same time, e is advertising 101::/64 to b, which is then sending the route (unmodified) to a, and a is choosing the path through c. In this case, a permanent (persistent) routing loop is formed through the control plane, primarily because no single BGP speaker has a complete view of the topology. The two route servers, by hiding the real path to 101::/64, makes is possible to form a routing loop.

The deploy route servers without forming these kinds of loops—

  • BGP speakers learning routes from route servers should be directly connected—there should not be destinations reachable via some “hidden” intermediate hop
  • Route servers should send all the routes they learn from clients; they should not use bestpath to choose which routes to send to clients

These restrictions prevent routing loops from forming when deploying route servers—but they also restrict the use of route servers to situations like carrying routes between BGP speakers connected to a single fabric.

Cisco filed a patent some time back describing a method to prevent routing loops when using BGP route servers; it makes interesting reading for folks who want to dive a little deeper.

RFC9199: Lessons in Large-scale Service Deployment

While RFC9199 (are we really in the 9000’s?) is targeted at large-scale DNS deployments–specifically root zone operators–so it might seem the average operator won’t find a lot of value here.

This is, however, far from the truth. Every lesson we’ve learned in deploying large-scale DNS root servers applies to any other large-scale user-facing service. Internally deployed DNS recursive servers are an obvious instance, but the lessons here might well apply to a scheduling, banking, or any other multi-user application accessed from a lot of places by a lot of different users. There are some unique points in DNS, such as the relatively slower pace of database synchronization across nodes, but the network-side lessons can still be useful for a lot of applications.

What are those lessons?

First, using anycast dramatically improves performance for these kinds of services. For those who aren’t familiar with the concept, anycase turns an IP address into a service identifier. Any host with a copy (or instance) or a given service advertises the same address, causing the routing table to choose the (topologically) closest instance of the service. If you’re using anycast, traffic destined to your service will automatically be forwarded to the closest server running the application, providing a kind of load sharing among multiple instances through routing. If there are instances in New York, California, France, and Taipei, traffic from users in North Carolina will be routed to New York and traffic from users in Singapore will be routed to Taipei.

You can think of an anycast address something like a cell tower; users within a certain desintance will be “captured” by a particular instance. The more copies of the service you deploy, the smaller the geographic region the service will support. Hence you can control the number of users using a particular copy of the service by controlling the number and location of service copies.

To understand where and how to deploy service instances, create anycast catchment maps. Again, just like a wifi signal coverage map, or a cellphone tower coverage map, it’s important to understand which users will be directed to which instances. Using a catchment map will help you decide where new instances need to be deployed, which instances need the fastest links and hardware, etc. The RIPE ATLAS pobes and looking glass servers are good ways to start building such a map. If the application supports a large number of users, you might be able to convince the application developer to include some sort of geographic information in requests to help build these maps.

Third, when deploying service instance, pay as much attention to routing and connectivity as you do the number of instances deployed. As the authors note, sometimes eight instances will provide the same level of service as several thousand instances. The connectivity available into each instance of the service–bandwidth, delay, availability, etc.–still has a huge impact on service speed.

Fourth, reduce the speed at which the database needs to be synchronized where possible. Not every piece of information needs to be synchronized at the same rate. The less data being synchronized, the more consistent the view from multiple users is going to be.

RFC9199 is well worth reading, even for the average network engineer.

Hedge 134: Ten Things

One of the many reasons engineers should work for a vendor, consulting company, or someone other than a single network operator at some point in their career is to develop a larger view of network operations. What are common ways of doing things? What are uncommon ways? In what ways is every network broken? Over time, if you see enough networks, you start seeing common themes and ideas. Just like history, networks might not always be the same, but the problems we all encounter often rhyme. Ken Celenza joins Tom Ammon, Eyvonne Sharp, and Russ White to discuss these common traits—ten things I know about your network.

download

Revisiting BGP Convergence

My video on BGP convergence elicited a lot of . . . feedback, mainly concerning the difference between convergence in a data center fabric and convergence in the DFZ. Let’s begin here—BGP hunt and the impact of the MRAI are very real in the DFZ. Withdrawing a route can take several minutes.

What about the much more controlled environment of a data center fabric?

Several folks pointed out that the MRAI is often set to 0 in DC fabrics (and many implementations by default). Further, almost all implementations will use an MRAI of 0 for the first received update, holding the second and subsequent advertisements by the MRAI. Several folks also pointed out that all the paths through a DC fabric are the same length, so the second part of the equation is also very small.

These are good points—how do they impact BGP convergence? Let’s use the network below, a small slice of a five-stage butterfly fabric, to think it through. Assume every router is in a different AS, so all the peering sessions are eBGP.

Start with A losing its connection to 101::/64—

  • T1: A withdraws its route from B and C
  • T2: B withdraws its route from D and E, C withdraws its route from F and G
  • T3: D and E withdraw their routes from H, F and G withdraw their routes from K
  • T4: H and K withdraw their routes from L

Note that L cannot receive one withdraw to remove the route from its local table; it must receive withdraws from both H and K. There’s no way at L to tell whether a withdraw from H means 101::/64 is no longer reachable at all or it is no longer reachable through H. For path-vector protocols, like distance-vector, the neighbor through each path must be considered independently.

What does an MRAI of 0 do? Each of the routers in the network will process the withdraw as soon as they receive it and send a withdraw to their peers as soon as they’re done processing it. The process still takes the same number of steps but each step is much faster.

What is the impact of all the paths’ equal length? So long as every router processes the withdraw at around the same speed, there is no hunt. If H and K send their withdraws simultaneously, L should receive them simultaneously and remove the route to 101::/64 from its table rather than switching from one path to the other. Even if they send their withdraws at different times, L removes entries from its ECMP table until it receives the last withdrawal.

If MRAI slows down convergence, why set it to anything other than 0? Because it’s improbable that every router in the network will process each withdraw simultaneously.

Before 101::/64 is withdrawn, H will be using the paths through D and E for ECMP, but it is only going to be advertising one of these two routes to L—say the path through E. When B sends withdraws to D and E, assume E processes the withdraw just a little faster than D. When H receives D’s withdraw, it will send an implicit withdraw to L, updating the AS path to include D rather than E. A few moments later, D sends a withdraw. H processes this withdraw and sends a withdraw to L.

L has received one implicit withdraw and one withdraw from H because of processing time differentials. In a larger fabric, with a much larger fan-out, the likelihood of differences in timing is much higher and spread across a broader range of possibilities. You can (generally) expect H to send about half as many implicit withdraws as it has paths towards the destination before sending an actual withdraw. If there are eight paths between B and H, H would likely send 3 or 4 implicit withdraws before sending a withdraw.

What if the MRAI were set to 1 second at H? H would receive E’s withdrawal and set the MRAI timer. Assuming D’s withdraw arrives within that 1-second MRAI, H will receive D’s withdraw, squash the implicit withdraw, and send a single withdraw to L instead. Setting the MRAI to something other than 0 reduces the number of updates and reduces processing.

Setting the MRAI to 1 second, and forcing it to trigger across all updates, might improve convergence time—or not. Without experimenting with setting the MRAI to different values at different places in a real network, it is hard to know. Replacing the routers, link speeds, changing processor load, and increasing memory can all have an impact on the “best” settings for optimal convergence.

the bottom line

There will be no hunt in BGP convergence in a network with multiple equal-length/equal-cost paths. This is what we should expect. Because the maximum path length minus the best (current) path length will always be 0, the network will converge as quickly as each router can process and advertise withdraws, bounded by the MRAI.

Setting the MRAI to 0 improves convergence speed at the cost of additional updates, especially in wide fan-out data center fabrics. It’s hard to know whether setting the MRAI to 0 or 1 will give you better convergence speeds; you have to try it to see.

I still think we should be moving away from BGP as our underlay protocol in all but the largest data center fabrics. IGPs (like IS-IS and RIFT) will converge more quickly, are easier to configure and manage, and using different protocols for the underlay and overlay breaks up failure and security domains in useful ways. I know I’m tilting at a windmill on this point, but still …