Tuesday, August 9, 2016

Gross overconfidence with public data

The Australian Buerau of Statistics is showing all the signs of being grossly overconfident with every aspect of the 2016 Census, bordering on incompetent.

You've heard all about the data retention in broad terms, but what exactly does it mean? And why could it be bad? After all the data is "anonymized" such that personally identifiable data is removed before being shared, right? Their original non-anonymized versions are encrypted and safe in the hands of ABS administration, so there's nothing to worry about.

Well, it's not that simple.

Lets talk about anonymization vs aggregation, how de-anonymization works, and why the "statistical linkage key" is appallingly flawed.

Anonymization vs Aggregation

When dealing with data for groups of people, the data starts out as a collection of information associated with the person to whom that data refers. Like a Census form with a name, address, age, postcode, annual income, sexuality, etc recorded on it. Many people don't want such raw data shared freely with the world and are less likely to answer truthfully if they don't trust that the data will remain private. So the value of the data is degraded.

For that reason, if you're not going to keep data completely private you want to assure people of their privacy and data security using tools like encryption, hashing, anonymization, and aggregation, so they're more likely to answer truthfully.


Anonymization in general refers to transforming data such that the person described by the data cannot be identified by the recipient of the information. Simplistically you'd just delete the name, address, date of birth, postcode, and other personally identifying features.

Anonymized data remains individual; there's still one data record per person, the personally identifying information is just removed.


Aggregation goes a step further. It combines the data of multiple people together using categories to group, sum, and average (mean/median, etc) the data. Instead of 'Person 1 has HIV, works in construction and is female, Person 2 does not have HIV, works with children and is male" you instead have "50% of the sample of two people have HIV, 1/2 works with children, 1/2 works in construction, and 50% of them are male".

Aggregated data is much harder to link with other data sets. If you have some separate data that says that John Doe is male and has HIV, you can't reliably tell whether they're represented in the data, and you can't link them to the other data to learn more about them (like whether they work with children).

This also makes the data less useful for research, harder to work with, and quite frustrating for many purposes. It's great to be able to link separate data sets together for population health research and all sorts of other really good causes. So having to aggregate data is really annoying.

De-anonymization and de-aggregation

If you know a few reasonably uncommon things about a person, or a greater number of more common things, it's very easy to "de-anonymize" data. It's still individual records, just with the easily personally identifying stuff removed. So if you know that someone was born in Hawera, New Zealand, in the 1960s, immigrated to Australia in the 1980s where they lived in NSW until the '90s then moved to Perth, you've got a really good chance of matching that person in any data set containing place of birth and migration information. You don't need an address, date of birth, etc.

The more complete a data set is, and the more information is in any way available elsewhere, the easier it is to de-anonymize. De-anonymization can be done en-masse by matching two or more databases together on common points then fishing in the results to see if you find anything interesting. It can also be done in a targeted manner, where you're trying to fish for a specific person in an anonymized dataset by trying to research enough other details to link them to their personally identifying information. Both are highly effective for different things.

One of the most famous recent examples is when Netflix ran a competition for improved recommendation algorithms, giving contestants an anonymized version of their customer database to use for testing. One team took the research in a different direction and showed that they could de-anonymize the data to learn what individual people liked to watch. The same principle can be easily applied to web browsing histories, library records, etc given an anonymized dataset to start with.

To find out just how little information it can take to identify someone, read up on the practice of "Doxxing" anonymous Internet forum accounts, where people research what they write and from little mentions of places, experiences, etc, determine who they are then publish their name, address, etc.

De-aggregation is harder because data has actually been removed. But if you have enough different axes in which the data is projected and aggregated, and especially if you can request new aggregation runs (like you can with the Census), data can be "de-aggregated" back into reasonable approximations of individual records. As access to large amounts of computer power becomes easy and cheap with services like AWS EC2 and the technology and statistical theory around it improves, de-aggregation is getting rapidly easier.

Like de-anonymization, you can use de-aggregation techniques to fish for specific pieces of information ("is celebrity $x gay") or find sets of people ("women with children under 4 who live alone in a high crime area").

De-anonymization is made dramatically easier by the ability to run new aggregate queries to refine the dataset. The ability to link new pieces of data into the data set before aggregation, like the ABS will now allow, makes it immensely easy, almost embarrassingly simple.

The Census

To date, the Census has mostly aggregated data. It publishes a bunch of useful standard reports, and you can order more specific custom reports where the ABS runs queries on the data. As far as I can tell they haven't published individual data records before, even completely anonymized to entirely delete personally identifying information. there are services that let you query the aggregated data.

If they did publish fully redacted records you'd still possibly be able to link them back to a person, but you'd have to know a lot more about them since you can't use the common stuff - name, address, date of birth, etc. So it's harder to do mass data matching, linking up stolen commercial databases to the data to learn things with privacy, health and safety implications about people.

What's changed?

The ABS is now allowing individual records to be matched with external data sources using a statistical linkage key. If you've studied any IT you'll be thinking about hashes at this point. Nope. No such luck.


"A key that enables two or more records belonging to the same individual to be brought together. It is represented by a code consisting of the 2nd, 3rd and 5th characters of a person's family name, the 2nd and 3rd letters of the persons' given name, the day, month and year when the person was born and the sex of the person, concatenated in that order."

This isn't anonymized. It's just stupid. Not to mention issues with short names, people without a first or last name, what exactly "family" or "given" even mean, people with multiple names, etc, it's horrifically insecure... and very prone to clashes as well, but without a disambiguation key. It's the worst of all worlds and I cannot possibly imagine what they were thinking.

Have they even heard of cryptographic hashes? Or ... ANYTHING? This makes the anonymized netflix dataset look like random numbers in comparison.

It's really, really simple to match this key to all sorts of other data sets to de-anonymize the data. Really, really easy. Which is the idea, I guess, so we have to assume they don't plan to release the individual records alongside their keys, but continue the current practice of running queries on them in-house, and just letting you supply your own data to be linked and aggregated.

But even then, de-aggregating it will be made dramatically simpler.


The ABS assured everyone that the website won't crash, it's been thoroughly load-tested.

“We have load tested it at 150 per cent of the number of people we think are going to be on it on Tuesday for eight hours straight and it didn’t look like flinching,” he said.

Well, they didn't do that very well. It's been down for hours. My guess: they probably made the classic mistake of "load testing" it with a fully automated load that doesn't stop for user "think time", doesn't simulate network dropouts and retries, slow connections, etc, and probably wasn't overly bandwidth-restricted. So it worked just fine, but when it collided with real world networks it fell down. This is a very common mistake for "green" IT people attempting big public web projects for the first time, but it's one that people doing population stats studies should be able to avoid. Distributed databases, CDNs, auto-scaling, network simulation... it's not that hard.

These are the same people who're blithely assuring us that the data's security is complete. It's stored encrypted and safe at the ABS and properly anonymized when outside the ABS. I don't believe either; their anonymization is pathetic, and encryption is only useful so long as you store the keys securely and don't keep the data online. Betcha they're not handling that well either.

So they're changing things dramatically by making individual records available for data matching, even if they only do the matching in-house using data you supply then return aggregated results. Even if we presume the ABS database will remain secure and the identifying information kept secret, it's trivial to link what they are making available to all sorts of common databases that are easy to get your hands on. Like, say, the Ashley Madison hack database. Or one of the endless stolen databases out there. Like LinkedIn's user database.

The thing is ... de-aggregation is relatively practical even when you can't add in your own arbitrarily selected data to link and then do multiple runs. It should be pretty damn simple to use the data and a few crafted data sets for linking to go fishing for specific questions about specific people, even if you only technically get aggregated results.

This is a monumentally stupid idea, and as far as I can tell the upper management are either colossally ignorant or wilfully deceitful to not only push it forward but so casually dismiss all concerns about it.

Also... the ABS will supposedly keep the personally identifying data encrypted and only store the matching key. Putting aside the appalling stupidity of the matching key (at least use a hash, people!), we're supposed to believe that their network is secure and their staff are all trustworthy. Hey, try something for me. Try to hack into the US military. You think I'm joking, right? Well, the truth is you don't even have to, because you can buy access to hacked US Military computers easily on the shadier bits of the Internet like DDoS botnet sales forums. Do you think the ABS will do better?

Even if they do, if their staff have internal access there's the potential for intentional abuse. People are flawed. People like money, say, like the money offered by dodgy journalists, private investigators, etc about persons of interest. People get curious about their neighbours/friends/business partners. People figure it won't hurt anyone. Well, sometimes it does, and even if it doesn't it's grossly unethical and it shouldn't be possible. I haven't found much on who within the ABS will have access, at what level, and under what controls, but I bet it won't be better than the already problematic Centerlink database, Australian Police databases, etc.

I suspect there are ABS IT staff and statisticians screaming at management about this whole mess, and they likely have been for some time. But of course public sector gag laws, social media codes of conduct, etc mean they can't speak out even in the broadest terms without losing their jobs and future employment opportunities, and possibly facing prosecution.

This is a real shame

This doesn't just matter for privacy.

It matters for your security. It increases the risk of identity theft, which is a major pain, a serious financial risk, etc. It could put your physical security at risk if you're a minority at risk, you have a violent ex looking for you, kids at risk of kidnapping by extended family/ex, you're a witness to a violent crime, etc, etc, etc.

It also has the potential to greatly reduces the quality of the data because if we don't trust how it's going to be used, we're less likely to give complete and honest answers. That's sad for Australia. It's an opportunity we'll never get back, it'll hurt future Census runs too, and we can't ever fix the harm to the data quality of this Census. Nor will we ever know how much there is.

People who would've previously replied with accurate data and allowed its release after 70-100 years for historical research may instead give incomplete or false data and/or refuse its release. Again, what a loss.

The biggest problem has been communication. The ABS has been incredibly arrogant and condescending, and it's obviously not taking this seriously. They've squandered a lot of public trust and that's going to hurt in the long term.

"Your information cannot be identified outside of the ABS and is safe inside of the ABS"

Right. Sure..

How it could've been done better

Well, I'm not an expert in the area. There are many others much better qualified to speak on this. Unfortunately, they either don't work at the ABS, or more likely the ABS's management aren't listening. I'll speak my piece, take it with a grain or four.

The idea is, fundamentally, not that bad. There's always some risk of de-aggregation, but there are mitigation options available. Limiting the number of variables, number of different analysis runs, etc.

Data linking should've been really restricted in terms of who can do it, how, and when. With more controls over the process. Hell, maybe it is, but the ABS hasn't been bothering to talk about that, they're too busy handwaving about how we're all being a bunch of tinfoil hat drama queens.

Oh well, since the website doesn't work I guess it's a moot point anyway.

This makes me sad. We really do need better ways of collecting and analysing population data, including health records etc. There's so much we can learn about difficult illnesses and syndromes, social problems, ... the potential benefits are endless. Ham-handed idiocy like this that undermines public trust works against, not for, such goals.

See also

No comments:

Post a Comment

Captchas suck. Bots suck more. Sorry.