Correlation Visualization Under Missing Values: A Comparison Between Imputation and Direct Parameter Estimation Methods
Pham, Nhat-Hao; Vo, Khanh-Linh; Vu, Mai Anh; Nguyen, Thu; Riegler, Michael Alexander; Halvorsen, Pål; Nguyen, Binh T.
Original version
https://doi.org/10.1007/978-3-031-53302-0_8Abstract
Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can seriously affect this important data visualization tool. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two randomly missing data and monotone missing data. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot under missing data. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. In addition, the most accurate technique for computing a correlation matrix (in terms of RMSE) does not always give the correlation plots that most resemble the one based on complete data (the ground truth). We recommend using ImputePCA [1] for small datasets and DPER [2] for moderate and large datasets when plotting the correlation matrix based on their performance in the experiments.