As businesses become more data-centric, there is an emerging trend to seek smarter ways to use the massive volumes of data generated, Utilizing and sharing that data, however, comes with associated risks, for example a loss of control of confidential information or sensitive personal data, thereby limiting the potential of how such vast datasets can be used. But maybe not anymore!
The production and use of synthetic data is poised to become one of the most essential methods of data sharing. Synthetic data differs from “real-world” data in that it is created artificially using generative models and algorithms to imitate real-world data. Of particular importance is the enormous potential of synthetic data to preserve data subjectâs privacy by mimicking the qualities and patterns of real data without using or disclosing any actual personal data.
The ICO released its study paper ‘Exploring Synthetic Data Validation – Privacy, Utility, and Fidelity‘, which showed the collaborative approach with the Financial Conduct Authority (FCA) and the Alan Turing Institute. The paper covers useful insights on the utility of synthetic data, the validation of its privacy features and approaches to advancing synthetic data across various industries. We look at what synthetic data is in more detail, how it can be used in different industries (such as financial services and healthcare), and conclude with the main takeaways from the collaborative research paper:
Why synthetic data:
Sharing personal data carries the potential for significant risk of harm to individuals and this is why it is heavily regulated in the UK and globally. To be able to strike the much-needed balance of harnessing the societal and economic advantages of data sharing while embracing a data protection design and default approach to data use, the ICO is increasingly promoting the use of privacy-enhancing technologies (PETs), and synthetic data forms a part of these recommended technologies. Synthetic data is information that is artificially generated using a purpose-built mathematical model or algorithm (also referred to as a synthetic data generator), with the aim of solving a task or in this case mitigating risks to disclosure of sensitive information.
It was predicted in the widely-cited Hype Cycle For Artificial Intelligence report published in 2022 by Gartner Inc. that Synthetic data will lead to a 70% reduction in privacy sanctions by 2025. For training models and testing, synthetic data also offers potentials of scale and affordable datasets that can be generated quickly, and it is anticipated that by 2024, 60% of the data needed to train and develop artificial intelligence models will be synthetic data. You can find more about the practical application of synthetic data in our previous webinar: Synthetic Data -The Future of Data Sharing.
Identifying use cases:
Some of the use cases identified include:
Healthcare: Patient information is often extremely sensitive, posing a challenge for its use in clinical trials and research. Synthetic data can produce high quality data sets, with a lower risk than faced using real data. It can be used to address queries (such as trends in medical tests) without requiring sight of sensitive information (such as individualsâ actual medical results), facilitating research outcomes and the development of new treatments while protecting sensitive personal medical data.
Insurance: Insurance companies can adopt synthetic data generation to navigate the regulations and restrictions applying to the use of personal data. For example, companies have used synthetic data to improve their underwriting accuracy with the help of statistical insights in order to stay competitive and adapt to market developments.
Financial services: Using synthetic data to advance research in the identification of money laundering and fraudulent transactions. Synthetic data sets containing representative examples of customer information, such as artificial customer interactions including account opening behaviour, payments, and withdrawals, can be used to develop new tools to detect and combat illegal activities, without the risk of exposing sensitive customer information. It is also important to point out, the ICOâs open commitment to facilitating the development of responsible and legal use of synthetic data in financial services. There are plans underway with external experts from industry and academia, to develop usable case studies which will demonstrate how synthetic data can be used to share data responsibly and legally between financial institutions, while complying with data protection law and mitigating the risks to individuals â one to watch out for.
Key takeaways from the joint research paper:
Synthetic data generation and use is accessed from three interdependent perspectives:
- Privacy – whether there is risk that personal or sensitive data can be re-identified from the synthetic dataset.
- Fidelity – how similar is the synthetic dataset to the real data inputted. And;
- Utility – how useful is the synthetic data for the given task it was formed for.
The challenge identified from the research lies around balancing these three perspectives such that a synthetic dataset can be generated to serve a specified use case without compromising the truthfulness or privacy of the data.
The notable privacy recommendations from the research are:
- It is best to evaluate the specific application and level of accessibility that synthetic data will require when deciding how to use it (for example, will the data be shared externally or used in-house?). This is important because companies may need it to determine whether the use case requires processing of personal data or also confidential information (which includes non-personal data). These assessments must be taken into consideration to reduce risk. For instance, a company that violates the rules governing the use of personal data and confidential information may be subject to higher risk of regulatory fines, data litigation risk and even reputational harm.
- Organisations should only include the features required to satisfy their unique use case, and nothing else, to minimise the risk of re-identification from a synthetic dataset and to comply with the UK GDPR’s rules on data minimization and purpose limitation.
- To mitigate the anticipated trade-off between privacy and fidelity, businesses could classify prospective use cases based on the sort of fidelity and characteristics required and develop multiple synthetic datasets, each with its own set of privacy requirements. In compliance with the UK GDPR, businesses that follow this approach must ensure that, where they use personal data to meet additional use cases, it is compatible with the original purpose, that they have the data subject’s consent, or a clear obligation/function set down in law.
- Whether or not privacy guarantees are built into a synthetic data generator, it is recommended practice to always ensure that privacy risk is factored into post-generation testing of the synthetic data.
- To encourage adoption, regulatory organisations’ compliance thresholds will need to change to a risk-based approach that acknowledges that generating and distributing synthetic data carries inherent risk. In this sense, organisations will only have to demonstrate that they have mitigated the risk of re-identification to a point where it is sufficiently remote, and that the information is âeffectively anonymisedâ.
It is evident from the use cases that there is much good to flow from unleashing the power and harnessing large data sets. We will continue to keep an eye on these developments and in time how they are positioned with the new wave of EU data regulation beyond the GDPR.