Imputing missing values with unsupervised random trees

Nov-21-2019–arXiv.org Machine Learning

When designing statistical models from tabular data for supervised learning tasks such as regression or classification, oftentimes it happens that some of th e observations available for fitting such models are missing values in one or more variables, usually d ue to reasons such as poor data collection practices, loss of information, participants dropping out of a survey, or similar. Many methods such as [2] or [4] overcome this issue by using heuristics to handle missing information - decision tree methods in particular, due to their splitting nature that takes one variable at a time, are particularly well suited for implicit han dling of missing data without a-priori imputation ([16]), but other methods such as gene ralized linear models or support vector machines cannot handle missing values in the same wa y, and when using them on a dataset with missing entries, these entries have to either be dr opped or imputed. Typical strategies for imputing the missing entries include: replacing them with the column mean or median, determining the most similar observations (nearest neighbors) according to the non-missing variables and taking a simple or weighted average of the m issing variable(s) from them ([11]), producing a latent representation of the data by some low-rank matrix factorization that minimizes errors on the non-missing entries and from which the m issing entries are then reconstructed ([10]), and iterative imputation that starts with so me basic imputation for all values and then cycles through each variable by constructing a mod el to predict the missing values from the non-missing observations, replacing the earlier impu tation with the model prediction and repeating until convergence ([3], [18]).

faircutforest, imputation, iterative, (16 more...)

arXiv.org Machine Learning

Nov-21-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.05)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Decision Tree Learning (0.89)
  - Statistical Learning > Support Vector Machines (0.54)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found