It is no doubt true that the coming of GDPR has raised the profile of privacy-related issues in the public consciousness, although the most popular illustrations of this (cookie popups and telephone/e-mail marketing consents) are both in reality manifestations of the much older ePrivacy legislation (the PECR in the UK). But the impact on those working with large data sets, and at the cutting edge of new technology, is less widely recognised, while also being far more significant.
Development of tools such as artificial intelligence or facial recognition frequently depends on having very large data sets to work with, in order to train and validate the tools. Where the finished product is intended to work with personal data, as is often the case, that implies the need for a significant quantity of personal data to be collected and used at the training stage. This comes with a couple of significant risks. Firstly, collecting and storing any large amount of personal data includes an inherent risk of that data being lost, misused or otherwise subject to a breach which impacts on the individuals that the data is about. This is a particular risk in circumstances where the data set has been built without the awareness of the individuals whose data is included (facial recognition tools built using images scraped from the internet, for example) because the data might be compromised in circumstances where the individual didn't even know that it was vulnerable.
The second risk comes from flaws that are inherent in that real world data. This has most commonly been encountered to date in the area of inherent bias - where the real world data is skewed by preconception or misconception, with prejudicial outcomes for the individuals who are the eventual subjects of the tool. So, for example, data about hiring practices in the past that are used to train an AI about what to look for when triaging CV's, might reflect the bias of previous hirers on grounds of gender or ethnicity, and "bake in" that bias in the final tool.
Synthetic data is a so-called Privacy Enhancing Technique or "PET" that is primarily (at the moment at least) directed to the former risk. As the article quoted below (from the well-respected AI think tank the Turing Institute) explains, building a synthetic data set is intended to produce the quantities of realistic data that are needed to train an AI tool, but in ways that are not referable to any real-world individual. One of the main immediate benefits of this is that such a repository can be maintained and added to indefinitely, without concerns about compliance with data protection legislation, and without the risk of an impact on data subjects in the event that the data set is compromised.
But it would be a mistake to see this development, significant though it is, as a complete solution to the challenges involved in developing AI that is based on personal data. Firstly, the development of synthetic data often depends on an initial pool of personal data that needs to be processed, albeit for a more limited period of time. So it does not eliminate the requirement for compliance around the collection and processing of that initial pool.
But secondly, considerable care will be required if the synthetic data is not also to be subject to the same concerns around inherent or unidentified bias. Indeed, if attempts are made to generate larger sets of synthetic data from a smaller sample size of real world inputs, the risks of such bias may be magnified rather than eliminated. These are issues that are going to have to be tackled, before the widespread use of synthetic data can be welcomed unreservedly.
Synthetic data is better for privacy preservation than ‘anonymised’ datasets because removing identifiers is not always enough to safeguard confidentiality. Data scientists and adversaries use increasingly sophisticated methods to demonstrate the potential to link (often publicly available) anonymised datasets, demonstrating how the identity of data subjects can be revealed in anonymised datasets, even by those with little prior knowledge. Synthetic data is thus said to hold a great deal of promise to enable insights where data is scarce, incomplete or where the privacy of data subjects needs to be preserved. It may also be ‘layered’ with other PETs. When used in Trusted Research Environments, for example, synthetic data may help researchers to refine their queries and build provisional models, therefore enabling experimentation while keeping safe any sensitive data (such as patient data in healthcare settings).