Skip to content
Search
Generic filters
Exact matches only

A Review of Synthetic Tabular Data Tools and Models

Anonymization methods that are revolutionizing how we share data

Timothy Pillow
Picture by Mika Baumeister @mbaumi. https://unsplash.com/photos/Wpnoqo2plFA

We live in a data driven generation where big data, data mining and artificial intelligence (and other buzz words) are revolutionizing the ways we obtain value from data. The challenge is that both private companies and public entities have no way to easily share this data internally or externally. The main hurdles are: compliance laws, fears of data misuse, patient/client privacy and an inability to transfer data securely. If not for these limitations, it’s feasible that data scientists, dev-ops teams, research groups and other data professionals could provide significantly more efficient solutions to problems.

Before the prevalence of machine learning, primitive methods to anonymize data often resulted in anonymized data at the cost of data utility. Statistical properties were often partially or completely destroyed and anonymization methods were generally weak at best and prone to reverse engineering which could expose PII (personal identifiable information).

Real values within the data are substituted with different “authentic” looking values. E.g replacing all real names in a column with randomly selected names from an external names list. In some cases substitution can involve replacing PII with randomly coded strings that only the original data curator could match back to the original record. E.g. replacing ‘John Smith’ with ‘R7JxvOAjtT’.

Designed to randomize the order of data within the same column. Unlike substitution which replaces values with similar values from an external source, shuffling can be thought of as a form of internal substitution where it only substitutes values within the same column.

Simply replacing confidential data with ‘null’ values like ‘NA’, ‘Null’, ‘Missing’ etc.

Removing PII columns from the dataset.

Obscuring part of data values with stand-in symbols like ‘X’. E.g. credit card number XXX — XXXX — XXXX — 9823. Key difference between masking and nulling is that masking preserves the general format of the original data. For example I can still see that the credit card number was made up of 4 chunks of 4 numbers ‘XXXX’ even if I don’t know their value.

Recommended Reading: https://pdfs.semanticscholar.org/f541/758a9179998a1b21d28d1feb90428dafad90.pdf

In 2006 Netflix started “The Netflix Prize”, an online competition to design an algorithm to predict how a customer would rate a movie [2]. Netflix supplied a dataset of 100M ratings produced by 480K users for 17k movies. Netflix anonymized the dataset by using substitution to replace usernames with coded strings in combination with perturbation of some ratings with fake ratings. In 2008 two students at University of Texas, Austin published “Robust De-anonymization of Large Sparse Datasets” detailing a new class of statistical de-anonymization attacks against high-dimensional micro-data. By combining Netflix data with IMBD data the students were able to reveal who users were. The Netflix Prize is now ubiquitous among data leak examples.

Recommended reading: https://dataprivacylab.org/projects/identifiability/paper1.pdf

Recommended reading: https://arxiv.org/pdf/1911.12704.pdf

Differential refers to the fact that for a given computational task ‘T’ that injects noise into the dataset, there are numerous algorithms that can achieve the desired noise for a given epsilon. Because there are numerous differential ways of achieving epsilon privacy, we say the data is differentially private. As such, differential privacy is a definition, not an algorithm. It’s therefore possible to create a formally private system (satisfies conditions of your privacy system) but is not differentially private and has only 1 way of satisfying your system’s noise criteria. The choice to use differential privacy vs a formalized non-differential privacy system depends on the use case. Furthermore, the choice to use non-formalized “ad-hoc” noise injection can also be made but runs the risk of creating poorly anonymized data that can be easily reverse engineered by an attacker e.g. Netflix Prize.

Differential privacy can be complicated and timely to setup. Just as there is a tradeoff between level of anonymization and data utility, there is a tradeoff between effort and complexity of a privacy system. The biggest problem for differential privacy is that when epsilon is small (small level of data leakage) it becomes increasingly hard to find accurate (maintains statistical properties “well”) differentially private algorithms for ‘T’. In such circumstances someone might opt for a formal non-differentially private system. Failing the ability to find a single algorithm to satisfy a non-differentially private system then you might default to an ad-hoc, non-formalized system as a last resort.

Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population [5]

DP-SYN

As machine learning models became more sophisticated, ideas about data anonymization shifted. Instead of applying sophisticated algorithms to datasets research groups toyed with ideas of teaching models to recognize modes within the datasets and subsequently generating “synthesizing” data based on what the models learnt.

Recommended reading: https://www-cdn.law.stanford.edu/wp-content/uploads/2019/01/Bellovin_20190129-1.pdf

The term “synthesized data” gets thrown around loosely within differential privacy. Common terms such as “differentially private synthesized data” or “synthetic data produced by differential privacy” are ambiguous as it is either meant to mean 1) that synthetic data has been produced using “of a different cloth” approach and sanitized further using differential privacy or 2) that differential privacy has been used to create an anonymized dataset that is unique to the original and therefore assumed to be“synthetic”. Unfortunately, the latter definition is used most commonly which I think is most ambiguous. I would personally argue that if the data is “of the same cloth” it hasn’t been synthesized; it’s been distorted/sanitized. Regardless, it’s something to keep in mind when you see differentially private data: Just because it’s synthetic doesn’t mean it’s differentially private and just because it’s differentially private doesn’t mean the data was originally synthesized using a learnt “of a different cloth” approach.

CART (Classification and Regression Trees) — Discrete Variables

CLBN

SDV-Copulas (Synthetic Data Vault)

Recommended reading: https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342

MedGAN

So far we’ve seen that machine learning is very effective at learning from data and being able to reproduce synthetic data that:

· 51% Senior corporate respondents said that lack of sharing data between departments was a key issue in data strategy. [7]

· Consumer data shows that…

86% want to exercise greater control over the data companies hold about them

· Data masking market is expected to grow at 14.8% compounded annual growth rate (CAGR) and worth $767M by 2022 [9]

· Global Privacy Management Software market expected to grow at 33.1%(CAGR). [10]

Feel free to message me with additional models and I’ll add them to the list.