Rethinking BGP on the DC Fabric (part 3)

The fist post on this topic considered some basic definitions and the reasons why I am writing this series of posts. The second considered the convergence speed of BGP on a dense topology such as a DC fabric, and what mechanisms we normally use to improve BGP’s convergence speed. This post considers some of the objections to slow convergence speed—convergence speed is not important, and ECMP with high fanouts will take care of any convergence speed issues. The network below will be used for this discussion.

Two servers are connected to this five-stage butterfly: S1 and S2 Assume, for a moment, that some service is running on both S1 and S2. This service is configured in active-active mode, with all data synchronized between the servers. If some fabric device, such as C7, fails, traffic destined to either S1 or S2 across that device will be very quickly (within tens of milliseconds) rerouted through some other device, probably C6, to reach the same destination. This will happen no matter what routing protocol is being used in the underlay control plane—so why does BGP’s convergence speed matter? Further, if these services are running in the overlay, or they are designed to discover failed servers and adjust accordingly, it would seem like the speed at which the underlay converges just does not matter.

Consider, however, the case where the services running on S1 and S2 are both reachable through an eVPN overlay with tunnel tail-ends landing on the ToR switch through which each server connects to the fabric. Applications accessing these services, for this example, either access the service via a layer 2 MAC address or through a single (anycast) IP address representing the service, rather than any particular instance. To make all of this work, there would be one tunnel tail-end landing on A8, and another landing on E8.

Now what happens if A8 fails? For the duration of the underlay control plane convergence the tunnel tail-end at A8 will appear to be reachable to the overlay. Thus the overlay tunnel will remain up and carrying traffic to a black hole on one of the routers adjacent to A8. In the case of a service reachable via anycast, the application can react in one of two ways—it can fail out operations taking place during the underlay’s convergence, or it can wait. Remember that one second is an eternity in the world of customer-facing services, and that BGP can easily take up to one second to converge in this situation.

A rule of thumb for network design—it’s not the best-case that controls network performance, it’s the worst-case convergence.

The convergence speed of the underlay leaks through to the state of the overlay. The questions that should pop into your mind about right now is—can you be certain this kind of situation cannot happen in your current network, can you be certain it will never happen, and can you be certain this will not have an impact on application performance? I don’t see how the answer to those questions can be yes. The bottom line: convergence speed should be left out of the equation when building a DC fabric. There may be times when you control the applications, and hence can push the complexity of dealing with slow convergence to the application developers—but this seems like a vanishingly small number of cases. Further, is pushing solving for slow convergence to the application developer optimal?

My take on the argument that convergence speed doesn’t matter, then, is that it doesn’t hold up under deeper scrutiny.

as I noted when I started this series—I’m not arguing that we should rip BGP out of every DC fabric … instead, what I’m trying to do is to stir up a conversation and to get my readers to think more deeply about their design choices, and how those design choices work out in the real world


  1. JeffT on 15 February 2021 at 5:03 pm

    Hey Russ,

    there’s more to that, the above is not necessarily always correct 🙂

    If S1 is dual-homed (if it isn’t – there’s no convergence) to A7 and A8, convergence speed in many cases is subject to overlay control plane convergence (EVPN is this case). If ESI-LAG (ESI != 0) is in use (L2), on failure of S1=> A8 withdrawal of RT-1 will cause sending routers remove A8 from ES group and update forwarding not to send L2 traffic to S1 over A8. Modern EVPN implementations usually give RT-1/ES highest priority, so they don’t get queued behind other BGP updates. So convergence speed could be depending on how fast would A7 detect A8 went away, rather than pure BGP underlay convergence. Depending on how L3 multi-homing has been implemented – let’s take basic IP recursion, S1 L3 will be known over A8 and A7, every sender will recursively load-share ECMP bundle->VTEP->underlay next-hop, convergence (removal of A8 from ECMP bundle) would be triggered by:
    1. underlay convergence (VTEP removal)
    2. overlay convergence A8=>S1 removal