Research ‘net: Decoding Firepath
While it is true that huge scale is a different mindset, and not just “more of the same only bigger,” there are also a lot of lessons to learn by looking at how truly large scale networks are built. In this vein, Google released a paper explaining the evolution of their network. While the hardware bits are interesting, I’ve been working in control planes for a very long time, so the control plane piece was what I found intriguing. The paper is pretty bare on details, but there are a few paragraphs describing Firepath, the name of their control plane. Here’s my take on decoding Firepath. Let’s start with what we know from the paper itself.
First, all switches are configured with the baseline or intended topology. The switches learn actual configuration and link state through pair-wise neighbor discovery. Next, routing proceeds with each switch exchanging its local view of connectivity with a centralized Firepath master, which redistributes global link state to all switches. Switches locally calculate forwarding tables based on this current view of network topology. To maintain robustness, we implement a Firepath master election protocol.
Let’s decode this a bit to see what they’re probably actually doing here. First, they’re using a link state protocol. They make this much more clear later in the paper, where they discuss building a link state database (LSD) that’s distributed to all the routers on the fabric.
One point here: I really don’t like our recent conversion to calling anything that handles packets “switches.” It used to be that routers switched packets in hardware and rewrote the packet header, while switches handled packets in hardware and didn’t rewrite the packet header. We seem to have taken the path of, “if it does things in hardware, it’s a switch; if it does things in software, it’s a router.” This doesn’t seem like a useful distinction to me, as just about everything of any scale does things in hardware, leaving us with no way to tell the difference between a device that’s rewriting the header on packets and one that’s not. Ugh. I prefer to use the term router for devices that rewrite headers, no matter how they do it—software, hardware, or by passing the packets through a magic bean.
So each router is prepopulated with a view of the topology (because the topology is relatively constant). Each router then discovers its neighbors (they’ve added quite a bit to neighbor discovery here that’s interesting), and then reports the state of its links to a centralized controller on the fabric that is dynamically elected. To put this in more understandable terms, this is what it sounds like to me:
They’re using something like IS-IS (my guess would be IS-IS, because it’s much easier to easily modify than OSPF), but they’ve extended the Designated Intermediate System concept to the fabric level, rather than the broadcast domain level.
This is actually a neat idea, in that it drops flooding on the fabric to almost nothing, and flooding is the one point where link state protocols really don’t scale all that well. In fact, there were similar efforts with link state protocol in the mobile ad-hoc space, both in modifying the way the DIS/DR acted, and in electing a remote DIS/DR. I’ve created a local copy of a paper published around this type of thinking from an old issue of IPJ (Cisco has apparently dumped the old archive, as the links don’t work—something I need to email Ole about).
What can we learn from this design? While I’m certain Google has moved on from this design, we can still learn a couple of things of interest.
First, local discovery and calculation is still faster/more efficient than centralized. We live in a world where centralized control has become the cat’s meow. While I think centralized control has it’s uses, it’s important for the networking industry not to swing to the other side of the pendulum. We’re going to need to learn to centralize what makes sense, and distribute what makes sense, rather than counting on a single solution to solve all our problems. I still argue that centralized policy with distributed topology discovery and calculation of base reachability is going to be the best blend, one that leads to the best scaling and performance characteristics. Each network might need a slightly different blend of centralized and decentralized, but we’re not going to build anything less complex/simpler to operate by simply tossing the problems over the cubicle walls to our coder friends. Another way to put this is: the CAP theorem is still in effect, and the speed of light (combined with the speed of serialization/etc.) are still important factors. Energy is still an important factor, as well, though we often act like power is “just there.”
Overall, this paper, from late last year, is a fascinating read. While we might not have Google’s scale, we can still learn lessons from their work along the way.