Sensitive US census data is vulnerable to theft and exposure

Computer scientists have designed a “reconstruction attack” that shows US Census data could be stolen or leaked using a laptop and machine learning code

A team of computer scientists say US citizens could have their identities stolen and exploited in a reverse-engineering exercise which attackers can accomplish using machine learning algorithms on a regular computer laptop.

The "reconstruction attack" is part of a study led by the Aaron Roth of the University of Pennsylvania School of Engineering and Applied Science, who is the Henry Salvatori Professor of Computer and Cognitive Science in Computer and Information Science (CIS); and Michael Kearns, the National Center Professor of Management and Technology in CIS. The study was published in the Proceedings of the National Academy of Sciences (PNAS). 

The researchers used machine learning and a standard laptop to demonstrate how protected information about individual respondents can be reverse-engineered from US Census Bureau statistics, potentially compromising the privacy of the US population.

This study establishes a benchmark for unacceptable susceptibility to exposure and highlights the likelihood of identity theft or discrimination resulting from this attack. The researchers also demonstrate how an attacker can determine the probability that a reconstructed record corresponds to the data of a real person.

“Over the last two decades, it has become clear that practices in widespread use for data privacy — anonymising or masking records, coarsening granular responses or aggregating individual data into large-scale statistics — do not work,” says Kearns. “In response, computer scientists have created techniques to probably guarantee privacy.”

“The private sector,” says Roth, “has been applying these techniques for years. But the Census’ long-running statistical programs and policies have additional complications attached.”

Data is critical for political, economic, and social purposes

The US Census Bureau is required by the Constitution to conduct a full population survey every decade, and the data collected is critical for various political, economic, and social purposes, including apportioning House seats, drawing district boundaries, allocating federal funding for state and local programs, disaster relief, welfare programs, infrastructure development, and demographic research.

While Census information is publicly available, strict laws are in place to protect individual privacy. To protect privacy, publicly available statistics aggregate respondents' survey answers, ensuring mathematical precision in the population's overall picture without directly revealing individuals' personal information.

However, attackers can use these aggregated statistics to reverse-engineer sets of records consistent with confirmed statistics, a process known as "reconstruction." In response to these risks, the Census conducted an internal reconstruction attack between the 2010 and 2020 surveys to evaluate the need for changes in reporting. The findings led to the implementation of "differential privacy," a provable protection technique that preserves the integrity of the larger data set while concealing individual data.

Differential privacy, invented by Cynthia Dwork, a computer science professor at Harvard University and a collaborator on the study, introduces strategic amounts of false data, known as "noise," to conceal individual data. While the noise's impact on statistical correctness is negligible at large scales, it can cause complications in demographic statistics describing small populations.

Experts suggest that the trade-off between accuracy and privacy is complex. While some social scientists argue that publishing aggregate statistics poses no inherent risk, Roth and Kearns' work has proven that the likelihood of reconstructing individual records is higher than previously thought. 

“What’s novel about our approach is that we show that it’s possible to identify which reconstructed records are most likely to match the answers of a real person,” says Kearns. “Others have already demonstrated it’s possible to generate real records, but we are the first to establish a hierarchy that would allow attackers to, for example, prioritise candidates for identity theft by the likelihood their records are correct.”

On the matter of complications posed by adding error to statistics that play such a significant role in the lives of the US population, the researchers say they are being realistic. “The Census is still working out how much noise will be useful and fair to balance the trade-off between accuracy and privacy,” says Roth. “And, in the long run, it may be that public policymakers decide that the risks posed by non-noisy statistics are worth the transparency.” 


Featured Articles

Global events driving rise in DDoS attacks, says Netscout

Report by Netscout found that DDoS attacks grew 31% YoY in the first half of 2023 with a staggering 44,000 each day, fuelled by world events

UK police cyberattack a reminder of third party risk

Cyber criminals use back-door suppliers cyberattack to spread alarm through Britain's biggest police force

Building Cyber Resilience into ‘OT in Manufacturing’ webinar

Join Acronis' webinar, Building Cyber Resilience into ‘OT in Manufacturing’, 21st September 2023

Trustwave report on hospitality industry security threats

Cyber Security

Barracuda Managed XDR uses AI to uncover cyber incidents

Technology & AI

Imperva: 32% of work data breaches could have been avoided

Operational Security