The Floating Point Fix

Floating point is not something many network engineers think about. In fact, when I first started digging into routing protocol implementations in the mid-1990’s, I discovered one of the tricks you needed to remember when trying to replicate the router’s metric calculation was always round down. When EIGRP was first written, like most of the rest of Cisco’s IOS, was written for processors that did not perform floating point operations. The silicon and processing time costs were just too high.

What brings all this to mind is a recent article on the problems with floating point performance over at The Next Platform by Michael Feldman. According to the article:

While most programmers use floating point indiscriminately anytime they want to do math with real numbers, because of certain limitations in how these numbers are represented, performance and accuracy often leave something to be desired.

For those who have not spent a lot of time in the coding world, a floating point number is one that has some number of digits after the decimal. While integers are fairly easy to represent and calculate over in the binary processors use, floating point numbers are much more difficult, because floating point numbers are very difficult to represent in binary. The number of bits you have available to represent the number makes a very large difference in accuracy. For instance, if you try to store the number 101.1 in a float, you will find the number stored is actually 101.099998 To store 101.1, you need a double, which is twice as long as a float

Okay—this is all might be fascinating, but who cares? Scientists, mathematicians, and … network engineers do, as a matter of fact. Fist, carrying around double floats to store numbers with higher precision means a lot more network traffic. Second, when you start looking at timestamps and large amounts of telemetry data, the efficiency and accuracy of number storage becomes a rather big deal.

Okay, so the current floating point storage format, called IEEE754, is inaccurate and rather inefficient. What should be done about this? According to the article, John Gustafson, a computer scientist, has been pushing for the adoption of a replacement called posits. Quoting the article once again:

It does this by using a denser representation of real numbers. So instead of the fixed-sized exponent and fixed-sized fraction used in IEEE floating point numbers, posits encode the exponent with a variable number of bits (a combination of regime bits and the exponent bits), such that fewer of them are needed, in most cases. That leaves more bits for the fraction component, thus more precision.

Did you catch why this is more efficient? Because it uses a variable length field. In other words, posits replaces a fixed field structure (like what was originally used in OSPFv2) with a variable length field (like what is used in IS-IS). While you must eat some space in the format to carry the length, the amount of "unused space" in current formats overwhelms the space wasted, resulting in an improvement in accuracy. Further, many numbers that require a double today can be carried in the size of a float. Not only does using a TLV format increase accuracy, it also increases efficiency.

From the perspective of the State/Optimization/Surface (SOS) tradeoff, there should be some increase in complexity somewhere in the overall system—if you have not found the tradeoffs, you have not looked hard enough. Indeed, what we find is there is an increase in the amount of state being carried in the data channel itself; there is additional state, and additional code that knows how to deal with this new way of representing numbers.

It's always interesting to find situations in other information technology fields where discussions parallel to discussions in the networking world are taking place. Many times, you can see people encountering the same design tradeoffs we see in network engineering and protocol design.