Exploring the Value of GANs for Synthetic Tabular Data Generation in Healthcare with a Focus on Data Quality, Augmentation, and Privacy
Abstract
Artificial Intelligence has demonstrated immense potential in healthcare-related applications, paving the way for advancements in diagnosis, treatment, and patient care. However, data protection laws and regulations present challenges that hinder the progress of development. Consequently, synthetic data has emerged as an increasingly popular research area. Synthetic data can serve as an anonymized and representative alternative to real data. While various methods exist for generating synthetic data, Generative Adversarial Networks (GANs) have demonstrated exceptional performance in this regard. This thesis focuses on utilizing GANs to generate synthetic tabular data, given that a significant portion of today's data is organized in tabular format. The primary objective is to evaluate the capabilities of GANs in generating synthetic tabular data specifically for healthcare applications. Three diverse healthcare datasets of varying sizes and complexities were selected, and two GAN models, CTGAN and CopulaGAN, were employed to generate corresponding synthetic datasets. The value of the generated data was assessed in terms of resemblance to real data, applicability to machine learning classification tasks, and preservation of individual privacy. Commonly used metrics within synthetic tabular data generation evaluation were applied to gauge the performance of the generated datasets. Resemblance metrics were based on comparing distributions and correlations between real and synthetic data. A novel framework, "SynthEval," was developed to offer an extensive evaluation of both real and synthetic data concerning classifier performance. Additionally, the framework investigated the potential of improving classifier performance by augmenting real data with synthetic data. Furthermore, the privacy assessment involved measuring nearest neighbor distances between real and synthetic data and checking for exact matches. The findings indicate that GAN models have the potential to generate data that exhibit comparable performance to real data, given that the training data used for the GAN model is of sufficient quantity and not of low quality. The results also indicate that GAN models can generate cleaner data with less noise than real data. However, the study also reveals that when synthetic data performance is too high, it may result in compromised privacy.