Saturday, March 13, 2010

The End of Anonymity, The Beginning of Privacy

Contrary to what the movie Fight Club may have you believe, you actually are a unique snowflake. For evidence of this, just take a look at the Netflix Prize dataset (it contains 500 thousand records - users, and about 200 movie ratings per record). Statistically speaking, 90% of the records do not have a single other record that is more that 30% similar to it. In other words, the vast majority of Netflix users have rated a very unique set of movies.

Netflix, being very concerned about the privacy of their users, has removed or anonymized so called personally identifying information (name, email, age, address, etc) before releasing the dataset. However, the very notion of personally identifying information is flawed. In fact, all information can be personally identifiable. That is, any information could be used to identify an individual.

Taking a look at the Netflix dataset again, we find that on average, two movies is enough to reduce the candidate records to eight. And four movies is all it takes to uniquely identify one record. Another way to put it is, if you know just four movies that your friend has rated and you know that he is in the Netflix dataset, then it's very likely that you can find his record and learn the other movies that he has rated. This could be potentially embarrassing (gay porn) or dangerous (politically charged movies) for him. In fact, people have carried out attacks like this by linking the "anonymized" Netflix data with publicly available IMDB rating data to learn the identities of several "anonymized" Netflix users.

Previous definitions of privacy that were based on personally identifying information (quasi-identifiers) were flawed for this reason. k-Anonymity (syntactically transforming the dataset so that quasi-identifiers must appear in at least k records) does not guarantee privacy. Privacy is not a property of the data, but a property of computation carried out on the data. A better definition of privacy is differential privacy. Differential privacy basically means that including or not including a particular record has no significant effect on the computation result. Or in other words, your privacy has the same chance of being violated whether you participate in the computation or not.

[Reference Video]

No comments: