× close
Credit: Pixabay/CC0 Public Domain
A new study says clear guidelines need to be established for the generation and processing of synthetic data to ensure transparency, accountability and fairness.
Synthetic data, generated through machine learning algorithms from original real-world data, is gaining attention as it has the potential to provide a privacy-preserving alternative to traditional data sources. This is especially useful when the actual data is too sensitive to share, is missing, or is of too low quality.
Synthetic data differs from real-world data because it is generated by algorithmic models called synthetic data generators, such as generative adversarial networks or Bayesian networks.
The study warns that existing data protection laws that apply only to personal data are not sufficiently equipped to regulate the processing of synthetic data of all kinds.
Laws such as GDPR only apply to the processing of personal data. The GDPR definition of personal data includes “any information relating to an identified or identifiable natural person.” However, not all synthetic datasets are completely artificial. Some may contain personal information or pose a risk of re-identification. Fully synthetic datasets are generally exempt from GDPR rules, unless there is a possibility of re-identification.
It remains unclear what level of re-identification risk is sufficient to trigger an application in the context of fully synthetic data processing. This creates legal uncertainties and practical difficulties in processing such datasets.
The study, by Professor Anna Beduski from the University of Exeter, was published in the journal big data and society.
It says there needs to be clear procedures for holding accountable those responsible for generating and processing synthetic data. We need assurances that synthetic data will not be generated and used in ways that have negative consequences for individuals and society, such as perpetuating existing biases or creating new ones.
Professor Beduschi said: “We need to establish clear guidelines for all types of synthetic data, prioritizing transparency, accountability and fairness. It is especially important as advanced language models such as E3 and GPT-4, which can be trained on or generated from synthetic data, which can facilitate the spread of misleading information and have a negative impact on society. As such, adhering to these principles can help reduce potential harm and foster responsible innovation.
“Therefore, synthetic data should be clearly labeled as such and information about its generation should be provided to the user.”
For more information:
Ana Beduschi, Synthetic Data Protection: Towards a Paradigm Change in Data Regulation?, big data and society (2024). DOI: 10.1177/20539517241231277