Enhancing Data Security in Healthcare with Synthetic Data Generation: An Autoencoder and Variational Autoencoder Approach
Abstract
The advent of machine learning and artificial intelligence (AI) in healthcare has revolutionized data analysis and patient care. However, utilizing real patient data presents substantial privacy and security challenges. This thesis tackles these challenges by exploring the application of Auto-Encoders (AEs) and Variational Auto-Encoders (VAEs) in synthetic healthcare data generation, offering an alternative to the typical use within Generative Adversarial Networks (GANs).
While techniques like CTGAN are known for generating realistic synthetic data, there is variability in how different implementations address the security of the original data during generation. This study leverages the unique data encoding capabilities of AEs and VAEs to propose a method that enhances data security, thereby producing synthetic data that upholds privacy while retaining utility for AI applications in healthcare, such as disease diagnosis and predictive modeling.
The methodology was rigorously tested across three diverse healthcare datasets, varying in size and characteristics, to ensure the effectiveness of the proposed solutions in protecting original data privacy while generating high-quality synthetic data. These methods were further evaluated using the Anonymeter tool to assess privacy risks thoroughly, ensuring a robust validation against the datasets used in prior research and affirming the advancements made by integrating AEs and VAEs.
This work contributes to the field of healthcare AI by providing a secure data generation framework that balances data utility with privacy. It sets the stage for future research in developing privacy-compliant AI systems in healthcare, highlighting the potential for widespread application of synthetic data while maintaining stringent privacy standards.