News
27/5/2021

Synthetic data, the last frontier of data protection and data sharing

Share this post
Blog authors
Data Valley Consulting
Team
Sign up for the Data Valley newsletter
By clicking on “Sign Up” you consent to the processing of data according to our Privacy Policy.
Thank you, your subscription to the newsletter has been received!
An error occurred while submitting the form.

Those who work with data often face barriers, both legal and technical, in the sharing of sensitive data and not with partners or potential customers, outside your company. These barriers often result in Lack of opportunities for growth And they can hinder innovation of the product or service offered.

However, companies and researchers, which aim to access this data to develop algorithms or systems of Machine learning, they don't necessarily need access to strictly real data, but they must be able to access realistic data sets that simulate and maintain the same statistical properties as real information. This alternative data is known as synthetic data.

The synthetic data are properly data created through a particular anonymization technique based on models of Machine learning Generative. Starting from a real data set, an artificial intelligence system is trained to identify the correlations and statistical metrics of datasets original, and then generate a dataset Ex Novo What keeps the same statistical distribution of datasets original, while not sharing any data from the real dataset.

The synthesis thus makes it impossible to trace the real data. from the data generated, without losing the original statistical information; contrary to what happens with normal anonymization techniques, where the data is evaded of all “personal” elements, thus losing part of the information contained in the data set.

Any model that generates synthetic data can only replicate the specific properties of the original real data. However, the synthetic data has different advantages compared to real data:

  • the overcoming restrictions of data use: as mentioned, real data may have usage restrictions due to privacy regulations or other regulations, while synthetic data can replicate all important statistical properties without exposing real data, thus eliminating the problem;
  • The creation of data to simulate conditions that have not yet occurred: where real data does not exist, synthetic data is the only solution (for example, in the training of self-driving vehicles, to study their behavior in the event of an accident);
  • The generation of synthetic data On average it is more Economical compared to the acquisition of real data;
  • the sharing of information between companies belonging to the same sector is favored by the purely statistical nature of the information, without the risk of making the customers of each company contentious

Synthetic data therefore represents an interesting technological solution for those companies capable of integrating artificial intelligence systems or advanced data analytics tools into their services or workflows. In particular, theData synthesis represents a valid solution in those sectors where the availability and use of real data is not always possible, such as:

  • TheHealthcare, where the processing of genetic, biometric or health-related data enjoys a special protection regime;
  • insurance and banking and financial services, sectors in which the Data Science and Data analytics are revolutionizing the services offered;
  • The sector Retail, increasingly characterized by customer profiling, which must be carried out in accordance with the provisions on privacy;
  • and the Public Administration, which has an enormous amount of data of undoubted statistical value that cannot always be made available to the public.

However, the advantage of using synthetic data is not limited to sectors where the availability of datasets Real is limited, but even in any other sector in a competitive key.

In fact, the competitive advantage deriving from the use of data in business processes grows mainly as a function of the amount of data available. Consequently, a synthesis tool can be used to generate new data that, aggregated to the data sets already in the company's possession, can provide Insights more precise and a competitive advantage in their market sector.

An element that should not be underestimated is, as mentioned, the cost of individual information. In fact, to be used effectively, every data must be acquired, pre-processed and archived, thus constituting an important cost for the company.

Anonymization techniques need to balance, on the one hand, a high standard of data protection and, on the other, the minimum loss of information, thus reducing the value of the dataset.

In this regard, Those who design synthetic data generation systems must still reconcile privacy and utility, however, since there is no direct link between real and synthetic data, it will never be possible to re-identify the individual data original by analyzing the attributes of the generated data, because they are artificial, it will be rather necessary to evaluate how “resistant” the generative system is to malicious attacks aimed at deducing personal attributes from the entire synthetic dataset.

For these reasons, it seems possible to say that a synthesis system must be evaluated based on the integrity of its generative model, rather than on the basis of the principle of data minimization, referred to in articles 5 and 6 of the GDPR.

Among the companies that first introduced this technology in the Italian and European ecosystem, there is Aindo, a startup from SISSA — International School for Advanced Studies — in Trieste that has created a system based on machine learning to generate synthetic data.

Data Valley, on the occasion of the first appointment of the column Data & Co. — Opportunities with data, interviewed Daniele Panfilo, founder and CTO of Aindo, to understand with the protagonists of this revolution the operation and future prospects of this exciting solution.

Are you ready to transform the Data in value for your business?