On differential privacy

On differential privacy

Over the past several weeks, there’s been a lot of talk about something called “differential privacy.” What does this mean, how does it work, and… Is it really going to be effective? The basic concept is this: the reason people can identify you, personally, from data collected off your phone, searches, web browser configuration, computer configuration, etc., is you do things just different enough from other people to create a pattern through cyber space (or rather data exhaust). Someone looking hard enough can figure out who “you” are by figuring out patterns you don’t even think about—you always install the same sorts of software/plugins, you always take the same path to work, you always make the same typing mistake, etc.

The idea behind differential security, considered here by Bruce Schneier, here, and here, is that you can inject noise into the data collection process that doesn’t impact the quality of the data for the intended use, while it does prevent any particular individual from being identified. If this nut can be cracked, it would be a major boon for online privacy—and this is a nut that deserves some serious cracking.

But I doubt it can actually be cracked for a couple of reasons.

First, in context, differential privacy is a form of data abstraction. And anyone who’s paying attention knows that from summary aggregates to protocol layers, abstractions leak. This isn’t a bad thing or a good one, it’s just the way it is—the only way to truly prevent personally identifiable information from leaking through an information gathering process is to detach the data from the people entirely by making the data random.

Which brings up the second problem—the concept of gathering all this data is to be able to predict what you, as a person, are going to do. In fact, the point of big data isn’t just to predict, but to shape and influence. As folks from Google have a habit of asserting, the point is to get to absolute knowledge by making the sample size under study the entire population.

The point of differential privacy is that you can take the information and shape it in such a way is to predict what all females of a certain heritage, of a certain age, and in a certain life position will do in reaction to a specific stimulus so advertisers can extract more value from these people (and, in the background, the part that no-one wants to talk about, so that other folks can control their attitudes and behaviors so they do “the right thing” more often). If you follow this train of thought, it’s obvious the more specific you get, the more predictive power and control you’re going to have. There’s not much point in “the flu project” if my doctor can’t predict that I, personally, will catch the flu this year (or not). The closer you can get to that individual prediction, the more power data analytics has.

Why look at everyone when you can focus on a certain gender? Why focus on everyone of a certain gender when you can focus on everyone of a certain gender who has a particular heritage? There doesn’t appear to be any definable point where you can stand athwart the data collection process and say, “beyond this point, no new value is added.” At least no obvious place. The better the collection, the more effective targeting is going to be. As a commenter on Bruce Schneier’s post above says—

The more information you intend to “ask” of your database, the more noise has to be injected in order to minimize the privacy leakage. This means that in DP there is generally a fundamental tradeoff between accuracy and privacy, which can be a big problem when training complex ML models.

We’re running into the state versus optimization pair in the complexity triangle here; there’s no obvious way out of the dilemma.

Which brings me to the third point: Someone still has to hold the original data to be able to detune it in the process of asking specific questions. The person who holds the data ultimately controls the accuracy of the questions other people ask of it, while allowing themselves more accuracy, and hence a business advantage over their rivals. To some degree—and it might just be my cynicism showing—this type of thing seems like it’s aimed as much at competitors as it is at actually “solving” privacy.

I have great hopes that we can eventually find a way to stand athwart data collection and yell “stop” at the “right moment” at some point in the future. I don’t know if we’ve really figured out what that moment is, nor if we’ve figured out human nature well enough to keep people from sticking their hands in the cookie jar and using the power of leaky abstractions to respect the limits.

It’s an interesting idea, I just don’t know how far it will really go.