Exploring the Value of GANs for Synthetic Tabular Data Generation in Healthcare with a Focus on Data Quality, Augmentation, and Privacy

Pedersen, Maria Elinor

dc.contributor.advisor	Haugerud, Hårek
dc.contributor.advisor	Gorosito, Martin
dc.contributor.advisor	Yazidi, Anis
dc.contributor.author	Pedersen, Maria Elinor
dc.date.accessioned	2023-11-07T12:48:38Z
dc.date.available	2023-11-07T12:48:38Z
dc.date.issued	2023
dc.identifier.uri	https://hdl.handle.net/11250/3101091
dc.description.abstract	Artificial Intelligence has demonstrated immense potential in healthcare-related applications, paving the way for advancements in diagnosis, treatment, and patient care. However, data protection laws and regulations present challenges that hinder the progress of development. Consequently, synthetic data has emerged as an increasingly popular research area. Synthetic data can serve as an anonymized and representative alternative to real data. While various methods exist for generating synthetic data, Generative Adversarial Networks (GANs) have demonstrated exceptional performance in this regard. This thesis focuses on utilizing GANs to generate synthetic tabular data, given that a significant portion of today's data is organized in tabular format. The primary objective is to evaluate the capabilities of GANs in generating synthetic tabular data specifically for healthcare applications. Three diverse healthcare datasets of varying sizes and complexities were selected, and two GAN models, CTGAN and CopulaGAN, were employed to generate corresponding synthetic datasets. The value of the generated data was assessed in terms of resemblance to real data, applicability to machine learning classification tasks, and preservation of individual privacy. Commonly used metrics within synthetic tabular data generation evaluation were applied to gauge the performance of the generated datasets. Resemblance metrics were based on comparing distributions and correlations between real and synthetic data. A novel framework, "SynthEval," was developed to offer an extensive evaluation of both real and synthetic data concerning classifier performance. Additionally, the framework investigated the potential of improving classifier performance by augmenting real data with synthetic data. Furthermore, the privacy assessment involved measuring nearest neighbor distances between real and synthetic data and checking for exact matches. The findings indicate that GAN models have the potential to generate data that exhibit comparable performance to real data, given that the training data used for the GAN model is of sufficient quantity and not of low quality. The results also indicate that GAN models can generate cleaner data with less noise than real data. However, the study also reveals that when synthetic data performance is too high, it may result in compromised privacy.	en_US
dc.language.iso	eng	en_US
dc.publisher	Oslomet - storbyuniversitetet	en_US
dc.title	Exploring the Value of GANs for Synthetic Tabular Data Generation in Healthcare with a Focus on Data Quality, Augmentation, and Privacy	en_US
dc.type	Master thesis	en_US
dc.description.version	publishedVersion	en_US

Tilhørende fil(er)

Filnavn:: Pedersen_acit2023.pdf
Størrelse:: 6.089Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

TKD - Master i Anvendt data- og informasjonsteknologi (ACIT) [243]

Vis enkel innførsel