Unless you work with survey data all the time, you probably don’t know this, but big changes are underway with the US Census
The Census has its namesake mandate; that is, to conduct a census which tallies up the number of people in each state. Over time what they do has grown in scope to collect all sorts of statistics about people and businesses, not just their number.
The Census actually publishes much more detailed data, at ever smaller levels of geography, about all sorts of things. It even publishes “microdata”, individual records from the Census, that correspond to individuals or households
Originally, businesses were concerned that data published about business activity in a small-enough geography could essentially reveal the performance of their business (imagine you’re Vandelay Industries, you’re the only big business in your county -- the county revenue figures are essentially your revenue figures, and everyone knows it), and this led to the Census adding noise or suppressing the publication of certain data extracts to avoid “indirect disclosure”
Since the 1940s, this was applied to people as well, and since the advent of computers, the Census has had to work ever harder to ensure that the publication of certain aggregated statistics or microdata don’t inadvertently reveal facts about individual people
Until the 2020 census, this has meant doing one of two things: (1) simply not publishing the table of data if doing so would risk revealing something private, or (2) “swapping” records.
Let’s use a hypothetical for (2). Imagine (and you’re really going to have to imagine here) that there was just one 34 year old white male in census tract NY/0551.00 (there are probably hundreds of us in these few square blocks). If they disclosed a table about what percentage of white 34 year olds had a college degree, income within a certain band, etc, they’d essentially be disclosing this information about yours truly. All you’d have to know is where I lived, and you’d know everything else about me.
A more realistic example would be to imagine that if instead of non-denominational, I were a Hasidic Jew. Why? Because there are probably very few (are there any?) in my census tract, but just a few blocks south of where I live is one of the largest Hasidic neighborhoods in the US. What the Census might decide to do in this case is to “swap” the records -- they would take my (again, you’re imagining that I’m a Hasidic Jew here) record, and swap it with someone else who would “blend in” to my original Census tract. That way there would be no way to identify a particular person in the published tables.
But “swapping” is not perfect. With more widely available data available to purchasers (e.g. voter files, customer files, etc.), more powerful computers, and more advanced algorithms, it has become more and more possible to identify individuals.
A not-very-illuminating (the presentation is more like a “whiff” than a gist) deck is available here. The upshot is that internal researchers at the Census were able to successfully “reconstruct” individual’s data from a combination of aggregated data and commercially available data.
Enter differential privacy! Starting with this most recent Census, the Census Bureau will be using this technique to mathematically secure the data that it releases. What is it? At a high level, it is adding a very specific amount of noise to the data (changing answers and reported statistics) such that you can’t recover any individual’s data from what’s published. The general idea is that the more noise you add to the data, the more “private” it is; but more noise is more noise, making the published data less accurate. There is a tradeoff here, and the math makes the tradeoff explicit and transparent.
Okay but… what is it? How does it work? If you’re a particularly loyal reader, you may remember my post about collecting data on sensitive subjects which gives much the same idea. With this method, we can compute the average response from the population of respondents, but not any individual’s responses
But to work with tabular data where we do actually collect the responses (say, where you live and what your income is, versus a number related to your responses), we need a different method. The method is explained well here.
At a high level, let’s say you have this data:
If you knew that your friend Bob was a male, and knew that your other friend Billy, also a male, had an income less than $100k, then you can deduce that Bob has an income greater than $100k.
To preserve the privacy of the individuals in the dataset, what you can do is add a third column:
The data that you actually report is this:
Now, it’s impossible to tell exactly who has which income, because the person trying to deduce Bob’s income knows that some of the income answers have been flipped. Importantly, there is also some error here, which is that the relationship between gender and income no longer accurately reflects the original data, even though the aggregate income figures are still correct.
I’m not exactly sure if this is the method that the Census is using, but they’re using something like this.
Are people happy? No! Some people in the field are very unhappy. They don’t like the additional error -- and the error matters! The linked-to document shows that many school districts had inaccurate counts of children.
What do I think? I think it’s cool, frankly. I love math and I think it’s cool that the US Government is using state of the art techniques to protect everyone’s data.
There was a great podcast from Data Skeptic last year on this with the computer scientist leading the changes. In addition to giving more detail, there's an interesting discussion on moving from private methods with undisclosed details to public methods (documenting their differential privacy algo).
https://podcasts.apple.com/us/podcast/differential-privacy-at-the-us-census/id890348705?i=1000497481372
Also interesting is this paper: https://arxiv.org/abs/1809.02201. It covers implementation issues the Census Bureau (and it's consumers) faced and is a relatively light read.