Getting to the point of dual homing
I wonder how many times I’ve seen this sort of diagram across the many years I’ve been doing network design?
It’s usually held up as an example of how clever the engineer running the network is about resilience. “You see,” the diagram asserts, “I’m smart enough to purchase connectivity from two providers, rather than one.”
Can I point something out? Admittedly it might not be all that obvious from the diagram, but… Reality is just about as likely to squish your network connectivity like a bug no a windshield as it is any other network. Particularly if both of these connections are in the same regional area. The tricky part is knowing, of course, what a “regional area” might happen to mean for any particular provider.
The problem with this design is very basic, and tied to the concept of shared link risk groups. But let me start someplace a little simpler than that—with the basic, and important, point that putting fiber in the ground, and maintaining fiber that’s in the ground, is expensive. Unless you live in Greenland, fiber can be physically buried pretty easily (fiber in Greenland is generally buried with dynamite by a blasting crew, or through conduit that’s bolted to the surface of the ubiquitous rock). But it’s not the burying that costs a lot of money—it’s the politics.
To bury a cable, you must get a right of way. Getting a right of way could well be very expensive in any given city. I remember encountering one particular situation where the land under consideration was owned, in theory, by a railroad. Well, it was close enough to an old station that it must have been. But it took several years of looking through old piles of paper to find the correct paper trail and figure out how, precisely actually owned the land in a legally provable way. This is not a task for the faint of heart.
What has this to do with the image above? A lot, actually. It’s so expensive to install last mile fiber providers often share this last mile. To explain, let’s look at a small picture, just below, that might be helpful.
This is the way many providers actually build their last mile. There is (normally a pair of) fiber ring(s), with a set of ROADM’s at key locations in the region (ROADM actually means “randomly dropping all de traffic that matters,” but don’t tell anyone, it’s a secret). When a customer is connected to the network, they are assigned a lightwave on the fiber that carries their traffic, from the customer edge device, over a virtual layer 2 circuit (generally point-to-point, but not always), to a central office or exchange point. Here the different lightwaves are split up and handed to different providers through good old fashioned routing. One provider normally owns the fiber, and other providers lease wavelengths, or bandwidth, etc., to reach customers in the region.
Looking at this second image, you might be able to see what the problem is with the first. It’s possible—actually probable, in fact I’ve seen it happen in real life—that a single backhoe fade within the same region will take out both provider’s circuits at the same time.
The problem here isn’t really the lack of diversity. Rather, it’s that the lack of diversity is hidden through the magical abstraction of virtualization. Two logical circuits that share the the same fate because they both run on the same physical media, by the way, are called a Shared Risk Link Group (SRLG). Providers aren’t likely to tell you when you’re at risk from this sort of problem for several reasons.
First, telling you who leased fiber from whom is bad business. Second, they may not actually know enough about their competitors to point this problem out. Third, it’s really in their business interest to try to convince you not to do this, but rather to buy all your upstream from them.
So—what can you do about this?
If you’re going to connect to two providers, try to do so in two different regions. This is often difficult, as you don’t really know where the regions are, and connecting two sites that provide backup for one another across multiple regional rings can be a challenge for geographical reasons.
One alternative here is to connect to a local exchange point (an IXP), and from their fabric to the various providers. While the IXP will likely lease their circuits from others, they will have a much better idea of where the cables physically run, and how to provide diverse circuits (but only if you know what you’re asking for).
Another alternative is to simply stick with a single provider, and insist on physical diversity in any resilient links. This plays into the provider’s hand of trying to get you to buy from a single source, but it gets around the problem of trying to figure out what cable is where, and who uses what (information you’re not generally going to be able to find anyway), and puts it on the shoulders of the provider—who does know, at least for their network.
The next time you think you’ve solved the resilience problem by quickly and easily dual homing, remember shared risk, and remember to look for the deeper problem that’s been hidden away through an abstraction—an abstraction that far too often is leaky.
Well, if you chose two or more providers using different technologies (DSL, fiber and/or cable for example, wifi and/or 3g also), you are probably lessening your downtime risk considerably.
You are probably correct — particularly if you’re using two different technologies across two different providers. It’s always possible the two different technologies still share fate in the middle someplace. For instance, perhaps that 5g wireless is actually on the same metro ring as the metro Ethernet…So it’s a better chance, but not “perfect.”
Thanks for the comment!
Concur if you’re talking about a sizeable business trying to connect a campus to the Internet through two independently failing paths.
When I see this picture, however, my first thought is to my home internet, which is used by my kids for gaming (not to mention me for work) so therefore seen as mission critical by at least part of the family. We have DSL service from a very small regional phone company, for about 15 years. Over that period we have experienced:
* some OAM tool got turned on which pinged our router (which was stealthed) every few minutes, and blocked all traffic in both directions on our link for a timeout period. (After lots of Wireshark sleuthing, the answer to this one was telling the router it was allowed to respond to pings.)
* DHCP leases down in the couple of minutes range, with the response time of the DHCP server in some poor router’s supervisor up in the 15 second range, and of course all of the traffic blocked during those 15 seconds
* DNS outages of hours (DNS is at an IP which does not belong to the local telco, so presumably contracted out). I assume that some denial-of-service (or how many DNS references per x hours had been paid for) counter tripped, and they didn’t just forget to pay the DNS bill.
* A failed DSL modem on Friday to which the response was “our office will be open on Monday and you can come by and pick up a replacement then. Oh, and don’t forget the failed unit, which you don’t own.”
* And most recently, an outage of several hours because some curbside box had a hardware failure (this I can deal with) followed by 4 months of dropping maybe 10 seconds worth of packets every 5 minutes. This causes Lync calls to drop, particularly when presenting, not just games. Best I can tell the unit was left misconfigured and trying to auto fail back to another path every 5 minutes. For 4 months. With tech support people who understood phone wires but not IP routing condescendingly telling me nothing was wrong. That earned a scathing letter to the CEO of the parent company.
In all of these circumstances, the vendor considered our DSL fully up and operational…
So yes, if I were rich (or even not staggering from my kids’ college costs) you bet I’d be dual homed between these guys and the-cable-company-which-must-not-be-named. Being able to keep working after some idiot trips over the power cord…
I didn’t even think about personal ‘net connections — the DNS server and large buffers are the most common problems I see on my home connection. Thanks for stopping by and commenting!