[SMR3] A gentle introduction to differential privacy
Disclaimer: This blog is part of the Harvard Kennedy School course requirement with Professor David Eaves. All views expressed in this blog post are solely for the purposes of an assignment and should only be interpreted in this context.
Introduction
There is growing awareness about the amount of data that is concentrated in the hands of a few large corporations and how that data is being aggregated in uncomfortable ways.[1] These corporations have frequently failed to uphold the public’s expectations when guarding that data and determining how it can be used. At the same time, there have been a growing number of data breachesthat diminish a company’s reputation in the eyes of its users. The IBM 2020 Data Breach Report states that lost business due to diminished reputation was the highest cost of a data breach with a yearly average of $1.52 million.[2] In addition, as countries develop more automated techniques to check for GDPR and CCPA compliance, the average fines for breaches of data privacy regulation are expected to increase. Both these factors have driven an increased focus on adopting privacy-preserving techniques such as differential privacy to handle and analyze user data be it at Google, Apple, the U.S. Census Bureau, or Uber.[3]
Motivation
There is growing backlash against the vast amounts of data private companies and governments are collecting on people. In addition, the immense amount of data collected also increases the chances of “linkage attacks” such as that demonstrated by HKS Professor Latanya Sweeney. In 1997, Professor Sweeney used a publicly available voter registration list from Cambridge, MA, and an anonymized dataset from the Group Insurance Commission (GIC) to successfully re-identify Governor William Weld’s health records.[4] The medical record data from GIC was anonymized correctly. “However, there was only one record in the dataset that matched Weld’s gender, ZIP code, and date of birth — obtained from voter registration records. Traditional anonymization techniques such as removing columns containing personally identifiable information or data masking can be susceptible to reidentification.”[5] Attacks like this, in turn, have prompted stricter regulations[6] to protect individual privacy and the passing of new laws.[7] Yet, at the same time, our recent past shows us the benefits of publicly available data often outweigh the costs since data powers innovation. For example, “Public access to sensitive health records sped up the development of the messenger-RNA coronavirus vaccines produced by Moderna and Pfizer. Better economic data could vastly improve policy responses to the next crisis.”[8]
Another example that demonstrates the transformative power of open data is the Human Genome Project which aimed to map the entire sequence of Human DNA between 1990 and 2005. A study by Stanford economist Heidi Williams showed that the Human Genome project spurred much more innovation than a contemporaneous gene sequencing project by a private company. On the flip side, a lack of data on firm financial health led the Department of Treasury to overspend on the Coronovarius Aid, Relief, and Economic Security (CARES) Act. Economists estimated that the government spent $150,000 to $377,000 per job saved, a very high price tag for a program that lasted a few months.[9] Data’s increasing value as an economic resource requires a new way of thinking. Techniques like differential privacy can help regulators balance privacy concerns while enabling the social and economic benefits of greater data access.
Overview of technology
Differential privacy, first developed in 2006, is a mathematical technique that transforms a data set by adding a controlled amount of randomness to it to prevent specific user identifiers from being obtained.[10] The amount of randomly generated noise added to the dataset is controlled by a tunable parameter called “privacy loss budget” — often donated by epsilon (ε). Epsilon controls the tradeoff between privacy and data utility: a high value of ε means more accurate but less private data. Because the randomness is controlled, the resulting data set is still accurate enough to generate aggregate insights while maintaining user privacy. Differential privacy techniques can be applied either locally or globally, depending on the trust placed in the aggregator. In the local model, noise is added to individual data before it is centralized in a database. In the global model, noise is added to data after it is collected centrally. There are no off-the-shelf applications for meeting the requirements and standards of differential privacy. The application built and value of epsilon chosen depends on the type of data set and its uses.
Differential privacy use cases
U.S. Census Bureau
The U.S. Census Bureau has a dual mandate of ensuring an accurate population count while protecting users’ privacy. The Bureau has used numerous “Disclosure Avoidance Techniques” since the 1960s.[11] Between 2000 to 2020, “data swapping” between census blocks was the primary technique used. However, research had confirmed that current computational technologies had rendered the methods used in 2010 and earlier censuses ineffective against reidentification attacks. For the first time in the 2020 Census, the Bureau adopted differential privacy as its disclosure avoidance technique.
The Bureau created a differential privacy algorithm that used different epsilon values for their redistricting data than for the person’s file.[12] The total privacy loss budget for the redistricting product was ε =19.61, which includes ε =17.14 for the persons file and ε =2.47 for the housing unit data.[13] For their algorithm, a value of epsilon 0 offered zero accuracy. In contrast, a value of infinity offered zero protection again data reconstruction threats. Finding the correct epsilon value can be tricky. Preliminary tests of the census data have left many analysts worried about their limited ability to use 2020 census statistics, particularly data about small geographic areas and minority groups within communities that many governments rely on for planning.
Google and Apple
Google was one of the first companies to create a practical application of differential privacy in 2014 when it applied learnings from Project RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) to the Chrome Browser. [14] The company now has expanded the use of differential privacy to other products such as Maps, Play Console, Assistant, and more recently, the COVID-19 Community Mobility Reports.[15] In addition, in 2019, the company open-sourced its differential privacy library and is now open-sourcing a differentially private SQL database query language extension[16] that is used in thousands of queries done every day internally at Google to obtain business insights and observe product trends.
Apple is another company that uses differential privacy techniques to improve the intelligence of features such as quick type, emoji suggestions, and health usage. The company claims that they use local differential privacy on the user’s device to ensure Apple servers don’t receive any user data. “The Apple analysis system then ingests the differentially private contributions, dropping I.P. addresses and other metadata. The final stage is aggregation, where the privatized records are processed to compute the relevant statistics, and the aggregate statistics are then shared with relevant Apple teams.”[17] However, Apple did receive a fair bit of controversy on their differential privacy techniques since they keep the code and epsilon value private and can change it with little oversight. [18]
Risks and considerations
While differential privacy solves a lot of issues around protecting user privacy it does come with tradeoffs. Firstly, analysis at the individual level is almost impossible. So, this makes the method incompatible with use cases that require that level of granularity, such as fraud detection in banking. This is why despite the excitement in academia and industry about differential privacy, the enthusiasm at government agencies and national statistical organizations (NSOs) has been limited so far.
Secondly, the technique doesn’t work well for small data sets. For example, analysis of the initial release of census data found significant discrepancies in data for minority populations where small localities had their population doubled or halved. This could lead to litigation by undercounted counties — Alabama was the first such state to do so. [19] Lastly, there is no consensus on the optimal level of epsilon and the control of the noise-adding algorithm, value (as is in the case of Apple). As a result, we need to create guidelines to define the optimal level based on use cases.
Conclusion
While differential privacy solves many issues around user privacy, government agencies should consider whether it is the right solution for them. Unlike the tech industry, government agencies have different requirements. The amount of data collected tends to be much more limited (most agencies rely on surveys for data collection with n<=10,000), “the users of the data are interested in making inferences for particular target populations and data must be available for many years.” [20] For instance, the U.S. Census Bureau has maintained data for over 70 years, implying the privacy budget they set needs to protect data for all these years. Differential privacy provides a way to manage the level of privacy vs. utility and provides a powerful alternative to overcome the limitations of traditional anonymization approaches, but policymakers should work with researchers to determine the optimal level of tradeoff for their unique use cases.
[1] “Revealed: 50 Million Facebook Profiles Harvested for Cambridge Analytica in Major Data Breach | Cambridge Analytica | The Guardian,” accessed November 27, 2021, https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election.
[2] Emile Joubert says, “Differential Privacy: What It Is, How It Works, Benefits & Use Cases,” June 1, 2021, https://research.aimultiple.com/differential-privacy/.
[3] “Uber Releases Open Source Project for Differential Privacy | by Uber Privacy & Security | Uber Privacy & Security | Medium,” accessed November 29, 2021, https://medium.com/uber-security-privacy/differential-privacy-open-source-7892c82c42b6.
[4] Latanya Sweeney, “K-ANONYMITY: A MODEL FOR PROTECTING PRIVACY,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, no. 05 (October 2002): 557–70, https://doi.org/10.1142/S0218488502001648.
[5] says, “Differential Privacy.”
[6] Cecilia Kang, David McCabe, and Jack Nicas, “Biden Is Expected to Keep Scrutiny of Tech Front and Center,” The New York Times, November 10, 2020, sec. Technology, https://www.nytimes.com/2020/11/10/technology/biden-tech-antitrust-privacy.html.
[7] Jill Cowan and Natasha Singer, “How California’s New Privacy Law Affects You,” The New York Times, January 3, 2020, sec. U.S., https://www.nytimes.com/2020/01/03/us/ccpa-california-privacy-law.html.
[8] David Deming, “Balancing Privacy With Data Sharing for the Public Good,” The New York Times, February 19, 2021, sec. Business, https://www.nytimes.com/2021/02/19/business/privacy-open-data-public.html.
[9] Greg Rosalsky, “The Dark Side Of The Recovery Revealed In Big Data,” NPR, October 27, 2020, sec. Newsletter, https://www.npr.org/sections/money/2020/10/27/927842540/the-dark-side-of-the-recovery-revealed-in-big-data.
[10] Cem Dilmegani, “Differential Privacy: What It Is, How It Works, Benefits & Use Cases,” AI Multiple, June 1, 2021, https://research.aimultiple.com/differential-privacy/.
[11] Laura McKenna, “Disclosure Avoidance Techniques Used for the 1960 Through 2010 Decennial Censuses of Population and Housing Public Use Microdata Samples,” Research and Methodology Directorate, April 2019, 15.
[12] US Census Bureau, “Comparing Differential Privacy With Older Disclosure Avoidance Methods,” Census.gov, August 12, 2021, https://www.census.gov/library/fact-sheets/2021/comparing-differential-privacy-with-older-disclosure-avoidance-methods.html.
[13] Christopher T. Kenny et al., “The Use of Differential Privacy for Census Data and Its Impact on Redistricting: The Case of the 2020 U.S. Census,” Science Advances 7, no. 41 (October 8, 2021): eabk3283, https://doi.org/10.1126/sciadv.abk3283.
[14] “Learning Statistics with Privacy, Aided by the Flip of a Coin,” Google AI Blog (blog), accessed November 21, 2021, http://ai.googleblog.com/2014/10/learning-statistics-with-privacy-aided.html.
[15] “Reports to Help Combat COVID-19 — The Keyword,” accessed November 21, 2021, https://blog.google/technology/health/covid-19-community-mobility-reports/.
[16] “How We’re Helping Developers with Differential Privacy,” Google Developers Blog (blog), accessed November 21, 2021, https://developers.googleblog.com/2021/01/how-were-helping-developers-with-differential-privacy.html.
[17] “Differential_Privacy_Overview.Pdf,” accessed October 31, 2021, https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf.
[18] Andy Greenberg, “How One of Apple’s Key Privacy Safeguards Falls Short,” Wired, accessed October 31, 2021, https://www.wired.com/story/apple-differential-privacy-shortcomings/.
[19] “New System to Protect Census Data May Compromise Accuracy, Some Experts Say,” Washington Post, accessed November 21, 2021, https://www.washingtonpost.com/local/social-issues/2020-census-differential-privacy-ipums/2021/06/01/6c94b46e-c30d-11eb-93f5-ee9558eecf4b_story.html.
[20] Joerg Drechsler, “Differential Privacy for Government Agencies — Are We There Yet?,” ArXiv:2102.08847 [Cs, Stat], February 17, 2021, http://arxiv.org/abs/2102.08847.