How census data puts trans kids at risk
Every decade, the US Census Bureau counts the people in the United States, trying to observe the balance between collecting accurate information and protecting the privacy of the people described in that data. But current technology can reveal a person’s transgender identity by linking seemingly anonymized information such as their neighborhood and age to find that their gender has been reported differently in successive censuses. The ability to anonymize sex and other data could spell disaster for trans people and families living in states that seek to criminalize them.
In places like Texas, where families seeking medical care for trans children can be accused of child abuse, the state would need to know which teens are trans to complete their investigations. We were concerned that census data could be used to facilitate this type of investigation and sanction. Could a weakness in the way publicly available datasets are anonymized be exploited to find trans children – and to punish them and their families? It’s a similar concern that underscored the public outcry in 2018 over the census asking people to reveal their citizenship – that the data would be used to find people living illegally in the United States to punish them.
Using our expertise in data science and data ethics, we took simulated data designed to mimic the datasets the Census Bureau releases publicly and tried to re-identify trans teens, or at least narrow down where they might live, and unfortunately, we succeeded. Using the data anonymization approach used by the Census Bureau in 2010, we were able to identify 605 trans children. Fortunately, the Census Bureau is undertaking a new differential privacy approach that will improve overall privacy, but it’s still a work in progress. When we looked at the most recent published data, we found that the office’s new approach reduced the identification rate by 70%, which is much better, but there are still improvements to be made.
Even as researchers who use census data to answer questions about life in the United States as part of our work, we firmly believe that privacy is important. The bureau is currently undertaking a public comment period on the design of the 2030 census. Submissions could shape how the census is undertaken and how the bureau will proceed with data anonymization. Here’s why it matters.
The federal government collects census data to make decisions about things like the size and shape of congressional districts, or how to disburse funds. Yet government agencies aren’t the only people using the data. Researchers in various fields, such as economics and public health, use the published information to study the state of the nation and make policy recommendations.
But the risks of de-anonymizing data are real, and not just for trans children. In a world where the collection of private data and access to powerful computer systems are increasingly ubiquitous, it may be possible to untie the privacy protections that the Census Bureau embeds into the data. Perhaps most famously, computer scientist Latanya Sweeney showed that nearly 90% of American citizens could be re-identified based on their zip code, date of birth, and assigned gender.
In August 2021, the Census Bureau responded. The organization used the differential privacy approach preferred by cryptographers to protect its redistricting data. Mathematicians and computer scientists have been drawn to the mathematical elegance of this approach, which involves intentionally introducing a controlled amount of error into key census figures and then cleaning up the results to ensure they remain consistent in internal. For example, if the census counted precisely 16,147 people identified as Native American in a specific county, it might report a close but different number, such as 16,171. Sounds simple, but counties are made up of census tracts, which are themselves made up census blocks. This means that, to get a number close to the original count, the census must also change the number of Native Americans in each census block and tract; the art of the Census Bureau approach is to make all these close but different numbers add up to another close but different number.
You would think that protecting people’s privacy is a given. But some researchers, primarily those whose work depends on the existing approach to data privacy, have a different view. These changes, they say, will make it harder for researchers to do their jobs in practice, while the privacy risks the Census Bureau protects against are largely theoretical.
Remember: we have shown that risk is not theoretical. Here’s a bit of how we did it.
We put together a complete list of people under 18 in each census block so that we could know their age, sex, race and ethnicity in 2010. Then we matched this list with the analogous list in 2020 to find people now. 10 years older and with a different declared sex. This method, called the reconstruction-encouraged linking attack, only requires publicly available datasets. When we reviewed it and presented it officially to the census, it was strong and disturbing enough to inspire researchers at Boston University and Harvard University to contact us for more details about our work.
We’ve simulated what a bad actor might do, so how do you make sure attacks like this don’t happen? The Census Bureau takes this aspect of privacy seriously, and researchers using this data should not stand in the way.
The census was carried out at great expense and hard work, and we will all benefit from the data produced by this effort. But that data can also do harm, and the Census Bureau’s work to protect privacy has gone a long way to mitigating that risk. We must encourage them to continue.
This is an opinion and analytical article, and the opinions expressed by the author or authors are not necessarily those of American scientist.