Sensitive US census data is vulnerable to theft and exposure

Computer scientists have designed a “reconstruction attack” that shows US Census data could be stolen or leaked using a laptop and machine learning code

A team of computer scientists say US citizens could have their identities stolen and exploited in a reverse-engineering exercise which attackers can accomplish using machine learning algorithms on a regular computer laptop.

The "reconstruction attack" is part of a study led by the Aaron Roth of the University of Pennsylvania School of Engineering and Applied Science, who is the Henry Salvatori Professor of Computer and Cognitive Science in Computer and Information Science (CIS); and Michael Kearns, the National Center Professor of Management and Technology in CIS. The study was published in the Proceedings of the National Academy of Sciences (PNAS). 

The researchers used machine learning and a standard laptop to demonstrate how protected information about individual respondents can be reverse-engineered from US Census Bureau statistics, potentially compromising the privacy of the US population.

This study establishes a benchmark for unacceptable susceptibility to exposure and highlights the likelihood of identity theft or discrimination resulting from this attack. The researchers also demonstrate how an attacker can determine the probability that a reconstructed record corresponds to the data of a real person.

“Over the last two decades, it has become clear that practices in widespread use for data privacy — anonymising or masking records, coarsening granular responses or aggregating individual data into large-scale statistics — do not work,” says Kearns. “In response, computer scientists have created techniques to probably guarantee privacy.”

“The private sector,” says Roth, “has been applying these techniques for years. But the Census’ long-running statistical programs and policies have additional complications attached.”

Data is critical for political, economic, and social purposes

The US Census Bureau is required by the Constitution to conduct a full population survey every decade, and the data collected is critical for various political, economic, and social purposes, including apportioning House seats, drawing district boundaries, allocating federal funding for state and local programs, disaster relief, welfare programs, infrastructure development, and demographic research.

While Census information is publicly available, strict laws are in place to protect individual privacy. To protect privacy, publicly available statistics aggregate respondents' survey answers, ensuring mathematical precision in the population's overall picture without directly revealing individuals' personal information.

However, attackers can use these aggregated statistics to reverse-engineer sets of records consistent with confirmed statistics, a process known as "reconstruction." In response to these risks, the Census conducted an internal reconstruction attack between the 2010 and 2020 surveys to evaluate the need for changes in reporting. The findings led to the implementation of "differential privacy," a provable protection technique that preserves the integrity of the larger data set while concealing individual data.

Differential privacy, invented by Cynthia Dwork, a computer science professor at Harvard University and a collaborator on the study, introduces strategic amounts of false data, known as "noise," to conceal individual data. While the noise's impact on statistical correctness is negligible at large scales, it can cause complications in demographic statistics describing small populations.

Experts suggest that the trade-off between accuracy and privacy is complex. While some social scientists argue that publishing aggregate statistics poses no inherent risk, Roth and Kearns' work has proven that the likelihood of reconstructing individual records is higher than previously thought. 

“What’s novel about our approach is that we show that it’s possible to identify which reconstructed records are most likely to match the answers of a real person,” says Kearns. “Others have already demonstrated it’s possible to generate real records, but we are the first to establish a hierarchy that would allow attackers to, for example, prioritise candidates for identity theft by the likelihood their records are correct.”

On the matter of complications posed by adding error to statistics that play such a significant role in the lives of the US population, the researchers say they are being realistic. “The Census is still working out how much noise will be useful and fair to balance the trade-off between accuracy and privacy,” says Roth. “And, in the long run, it may be that public policymakers decide that the risks posed by non-noisy statistics are worth the transparency.” 


Featured Articles

Tech & AI LIVE: Key Events that are Vital for Cybersecurity

Connecting the world’s technology and AI leaders, Tech & AI LIVE returns in 2024, find out more on what’s to come in 2024

MWC Barcelona 2024: The Future is Connectivity

Discover the latest in global technology and connectivity at MWC Barcelona 2024, where industry giants converge to discuss 5G, AI and more industry trends

AI-Based Phishing Scams Are On The Rise This Valentine’s Day

Research from Egress Threat Intelligence, Avast, Cequence Security & KnowBe4 outlines how AI is being used in dating app phishing scams on Valentine’s Day

Speaker Lineup Announced for Tech Show London 2024

Technology & AI

Darktrace predicts AI deepfakes and cloud vulnerabilities

Cloud Security

Secure 2024: AI’s impact on cybersecurity with Integrity360

Technology & AI