Data Gravity and the Network

One “sideways” place to look for value in the network is in a place that initially seems far away from infrastructure, data gravity. Data gravity is not something you might often think about directly when building or operating a network, but it is something you think about indirectly. For instance, speeds and feeds, quality of service, and convergence time are all three side effects, in one way or another, of data gravity.

As with all things in technology (and life), data gravity is not one thing, but two, one good and one bad—and there are tradeoffs. Because if you haven’t found the tradeoffs, you haven’t looked hard enough. All of this is, in turn, related to the CAP Theorem.

Data gravity is, first, a relationship between applications and data location. Suppose you have a set of applications that contain customer information, such as forms of payment, a record of past purchases, physical addresses, and email addresses. Now assume your business runs on three applications. The first analyzes past order information and physical location to try to predict potential interests in current products. The second fills in information on an order form when a customer logs into your company’s ecommerce site. The third is used by marketing to build and send email campaigns.

Think about where you would store this information physically, especially if you want the information used by all three applications to be as close to accurate as possible in near-real time. CAP theorem dictates that if you choose to partition the data, say by having one copy of the customer address information in an on-premises fabric for analysis and another in a public cloud service for the ecommerce front-end, you are you going to have to live with either locking one copy of the record or the other while it is being updated, or you are going to have to live with one of the two copies of the record being out-of-date while its being used (the database can either have a real time locking system or be eventually consistent). If you really want the data used by all three applications to be accurate in near-real time, the only solution you have is to run all three applications in the same physical infrastructure so there can be one unified copy of all three data.

This causes the first instance of data gravity. If you want the ordering front end to work well, you need to unify the data stores, and move them physically close to the application. If you want the other applications that interact with this same data to work well, you need to move them to the same physical infrastructure as the data they rely on. Data gravity, in this case, acts like a magnet; the more data that is stored in a location, the more applications will want to be in that same location. The more applications that run on a particular infrastructure, the more data will be stored there, as well. Data follows applications, and applications follow data. Data gravity, however, can often work against the best interests of the business. The public cloud is not the most efficient place to house every piece of data, nor is it the best place to do all processing.

Minimizing the impact of the CAP theorem by building and operating a network that efficiently moves data minimizes the impact of this kind of data gravity.

The second instance of data gravity is sometimes called Kai-Fu Lee’s Virtuous Cycle, or the KL-VC. Imagine the same set of applications, only now look at the analysis part of the system more closely. Clearly, if you knew more about your customers than their location, you could perform better analysis, know more about their preferences, and be able to target your advertising more carefully. This might (or might not—there is some disagreement over just how fine-grained advertising can be before it moves from useful to creepy and productive to counterproductive) lead directly to increased revenues flowing from the ecommerce site.

But obtaining, storing, and processing this data means moving this data, as well—a point not often considered when thinking about how rich of a shopping experience “we might be able to offer.” Again, the network provides the key to moving the data so it can be obtained, stored, and processed.

At first glance, this might all still seem like commodity stuff—it is still just bits on the wire that need to be moved. Deeper inspection, however, reveals that this simply is not true. Understanding where data lives and how applications need to use it, the tradeoffs of CAP against that data, and the changes in the way the company does business if that data could be moved more quickly and effectively. Understanding which data flow has real business impact, and when, is a gateway into understanding how to add business value.

If you could say to your business leaders: let’s talk about where data is today, where it needs to be tomorrow, and how we can build a network where we can consciously balance between network complexity, network cost, and application performance, would that change the game at all? What if you could say: let’s talk about building a network that provides minimal jitter and fixed delay for defined workloads, and yet allows data to be quickly tied to public clouds in a way that allows application developers to choose what is best for any particular use case?

This topic is a part of my talk at NXTWORK 2019—if you’ve not yet registered to attend, right now is a good time to do so.

2 Comments

  1. Evil CCIE on 8 November 2019 at 1:15 am

    Registrations closed. Hope you guys record the event breakout sessions.



    • Russ on 8 November 2019 at 3:02 pm

      No — it’s not being recorded… 🙂