Correlation visualization under missing values: a comparison between imputation and direct parameter estimation methods

Pham, Nhat-Hao, Vo, Khanh-Linh, Vu, Mai Anh, Nguyen, Thu, Riegler, Michael A., Halvorsen, Pål, Nguyen, Binh T.

Sep-5-2023–arXiv.org Machine Learning

Correlation matrix visualization is essential for understanding the relationships between variables in a dataset, but missing data can seriously affect this important data visualization tool. In this paper, we compare the effects of various missing data methods on the correlation plot, focusing on two randomly missing data and monotone missing data. We aim to provide practical strategies and recommendations for researchers and practitioners in creating and analyzing the correlation plot under missing data. Our experimental results suggest that while imputation is commonly used for missing data, using imputed data for plotting the correlation matrix may lead to a significantly misleading inference of the relation between the features. In addition, the most accurate technique for computing a correlation matrix (in terms of RMSE) does not always give the correlation plots that most resemble the one based on complete data (the ground truth). We recommend using DPER [1], a direct parameter estimation approach, for plotting the correlation matrix based on its performance in the experiments.

artificial intelligence, data quality, machine learning, (15 more...)

arXiv.org Machine Learning

Sep-5-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Vietnam (0.15)
- Europe > Norway (0.14)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Statistical Learning (1.00)
    - Representation & Reasoning (1.00)
  - Data Science > Data Quality (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found