Sensitive US census data is vulnerable to theft and exposure

Hackers at work - Getty Images
Computer scientists have designed a “reconstruction attack” that shows US Census data could be stolen or leaked using a laptop and machine learning code

A team of computer scientists say US citizens could have their identities stolen and exploited in a reverse-engineering exercise which attackers can accomplish using machine learning algorithms on a regular computer laptop.

The "reconstruction attack" is part of a study led by the Aaron Roth of the University of Pennsylvania School of Engineering and Applied Science, who is the Henry Salvatori Professor of Computer and Cognitive Science in Computer and Information Science (CIS); and Michael Kearns, the National Center Professor of Management and Technology in CIS. The study was published in the Proceedings of the National Academy of Sciences (PNAS). 

The researchers used machine learning and a standard laptop to demonstrate how protected information about individual respondents can be reverse-engineered from US Census Bureau statistics, potentially compromising the privacy of the US population.

This study establishes a benchmark for unacceptable susceptibility to exposure and highlights the likelihood of identity theft or discrimination resulting from this attack. The researchers also demonstrate how an attacker can determine the probability that a reconstructed record corresponds to the data of a real person.

“Over the last two decades, it has become clear that practices in widespread use for data privacy — anonymising or masking records, coarsening granular responses or aggregating individual data into large-scale statistics — do not work,” says Kearns. “In response, computer scientists have created techniques to probably guarantee privacy.”

“The private sector,” says Roth, “has been applying these techniques for years. But the Census’ long-running statistical programs and policies have additional complications attached.”

Data is critical for political, economic, and social purposes

The US Census Bureau is required by the Constitution to conduct a full population survey every decade, and the data collected is critical for various political, economic, and social purposes, including apportioning House seats, drawing district boundaries, allocating federal funding for state and local programs, disaster relief, welfare programs, infrastructure development, and demographic research.

While Census information is publicly available, strict laws are in place to protect individual privacy. To protect privacy, publicly available statistics aggregate respondents' survey answers, ensuring mathematical precision in the population's overall picture without directly revealing individuals' personal information.

However, attackers can use these aggregated statistics to reverse-engineer sets of records consistent with confirmed statistics, a process known as "reconstruction." In response to these risks, the Census conducted an internal reconstruction attack between the 2010 and 2020 surveys to evaluate the need for changes in reporting. The findings led to the implementation of "differential privacy," a provable protection technique that preserves the integrity of the larger data set while concealing individual data.

Differential privacy, invented by Cynthia Dwork, a computer science professor at Harvard University and a collaborator on the study, introduces strategic amounts of false data, known as "noise," to conceal individual data. While the noise's impact on statistical correctness is negligible at large scales, it can cause complications in demographic statistics describing small populations.

Experts suggest that the trade-off between accuracy and privacy is complex. While some social scientists argue that publishing aggregate statistics poses no inherent risk, Roth and Kearns' work has proven that the likelihood of reconstructing individual records is higher than previously thought. 

“What’s novel about our approach is that we show that it’s possible to identify which reconstructed records are most likely to match the answers of a real person,” says Kearns. “Others have already demonstrated it’s possible to generate real records, but we are the first to establish a hierarchy that would allow attackers to, for example, prioritise candidates for identity theft by the likelihood their records are correct.”

On the matter of complications posed by adding error to statistics that play such a significant role in the lives of the US population, the researchers say they are being realistic. “The Census is still working out how much noise will be useful and fair to balance the trade-off between accuracy and privacy,” says Roth. “And, in the long run, it may be that public policymakers decide that the risks posed by non-noisy statistics are worth the transparency.” 

Share

Featured Articles

Founder Shield MD on Navigating Multi-Cloud Complexities

Founder Shield Managing Director Jonathan Selby talks strategies to navigating the complexities of multi-cloud set ups

Qodea CISO Explains How Cyber Threats Could Outrun Cost

Qodea CISO Business Manager Ed Russell explains how growth in sophistication and volume of attacks means current investment in defences falls short

Nokia and NL-ix Deploy Europe’s Largest IXP-Based Anti-DDoS

This collaboration between Nokia and NL-ix is unprecedented both being Largest IXP-Based Anti-DDoS, but the first anti-DDoS solution deployed by an IXP

Bridging the Gap: Examining the UK-US Data Bridge

Data Breaches

Hiddenlayer CSO Tells Why It Made an AI Security Council

Technology & AI

Cooperation Key Theme at Microsoft Endpoint Security Summit

Cyber Security