Anonymization’s Murky Waters
Data experts aim to balance privacy risk, research potential
Anonymization at its simplest is the idea that removing personally identifiable information, such as Social Security numbers and addresses, from a data set leaves you with clean, risk-free information to store, examine and distribute without the fear of privacy breach or burden of compliance. Sleep tight; it’s anonymized.
But the details turn out to be remarkably difficult, says Chris Clifton, associate professor of computer science at Purdue University. Clifton leads a researching anonymization methods and limitations project, which the National Science Foundation recently bolstered with a $1.5 million grant.
At stake is private and public researchers’ ability to better learn from the streams of data that flow endlessly through our society. “There’s a lot of value in that data. A lot of good things can be done with it,” Clifton says. A DVD distributor last year awarded a $1 million prize to a team that improved its movie-recommendation formula by 10 percent using a scrubbed data set. If similar strides could be made with healthcare, education and other data, successful anonymization could unlock information for everyone’s gain. But this can only happen if data holders can share information without risking privacy.
Much of the complexity for data stewards and policymakers stems from a need to understand what really constitutes personally identifiable data. Clearly, a customer’s credit-card number is worthy of strict protection. But what about the fact that some unnamed ISP subscriber searched the Web for “the best season to visit Italy”? Or a list of favorite movies, scrubbed of personal details? In two unrelated 2006 cases, anonymized data sets released for noble research purposes fueled public awareness of the creepily complete pictures seemingly meaningless data can paint. New York Times reporters tracked down an ISP subscriber based only on a list of search queries, and University of Texas researchers wrote an algorithm to personally identify a DVD distribution service user who also used imdb.com, based only on their movie ratings.
Search our new 2013 Buyer's Guide.
Web Exclusive | Seven charities that innovate for good
Web Exclusive | Data experts aim to balance privacy risk, research potential