Conditional probabilistic graphical models provide a powerful framework for structured regression in spatio-temporal datasets with complex correlation patterns. However, in real-life applications a large fraction of observations is often missing, which can severely limit the representational power of these models. In this paper we propose a Marginalized Gaussian Conditional Random Fields (m-GCRF) structured regression model for dealing with missing labels in partially observed temporal attributed graphs. This method is aimed at learning with both labeled and unlabeled parts and effectively predicting future values in a graph. The method is even capable of learning from nodes for which the response variable is never observed in history, which poses problems for many state-of-the-art models that can handle missing data. The proposed model is characterized for various missingness mechanisms on 500 synthetic graphs. The benefits of the new method are also demonstrated on a challenging application for predicting precipitation based on partial observations of climate variables in a temporal graph that spans the entire continental US. We also show that the method can be useful for optimizing the costs of data collection in climate applications via active reduction of the number of weather stations to consider. In experiments on these real-world and synthetic datasets we show that the proposed model is consistently more accurate than alternative semi-supervised structured models, as well as models that either use imputation to deal with missing values or simply ignore them altogether.
Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness. A widely used solution to this imputation problem is based on the lazy learning technique, $k$-nearest neighbor (kNN) approach. However, most of the previous work on missing data does not take into account the presence of the class label in the classification problem. Also, existing kNN imputation methods use variants of Minkowski distance as a measure of distance, which does not work well with heterogeneous data. In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance between the missing datum and all the training data. Grey distance works well in heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the features and the class label. This ensures that the imputation of the training data is directed towards improving classification performance. This class weighted grey kNN imputation algorithm demonstrates improved performance when compared to other kNN imputation algorithms, as well as standard imputation algorithms such as MICE and missForest, in imputation and classification problems. These problems are based on simulated scenarios and UCI datasets with various rates of missingness.
Blood lactate concentration is a strong indicator of mortality risk in critically ill patients. While frequent lactate measurements are necessary to assess patient's health state, the measurement is an invasive procedure that can increase risk of hospital-acquired infections. For this reason we formally define the problem of lactate prediction as a clinically relevant benchmark problem for machine learning community so as to assist clinical decision making in blood lactate testing. Accordingly, we demonstrate the relevant challenges of the problem and its data in addition to the adopted solutions. Also, we evaluate the performance of different prediction algorithms on a large dataset of ICU patients from the multi-centre eICU database. More specifically, we focus on investigating the impact of missing value imputation methods in lactate prediction for each algorithm. The experimental analysis shows promising prediction results that encourages further investigation of this problem.
Missing data is a significant problem impacting all domains. State-of-the-art framework for minimizing missing data bias is multiple imputation, for which the choice of an imputation model remains nontrivial. We propose a multiple imputation model based on overcomplete deep denoising autoencoders. Our proposed model is capable of handling different data types, missingness patterns, missingness proportions and distributions. Evaluation on several real life datasets show our proposed model significantly outperforms current state-of-the-art methods under varying conditions while simultaneously improving end of the line analytics.
Loss to followup is a significant issue in healthcare and has serious consequences for a study's validity and cost. Methods available at present for recovering loss to followup information are restricted by their expressive capabilities and struggle to model highly non-linear relations and complex interactions. In this paper we propose a model based on overcomplete denoising autoencoders to recover loss to followup information. Designed to work with high volume data, results on various simulated and real life datasets show our model is appropriate under varying dataset and loss to followup conditions and outperforms the state-of-the-art methods by a wide margin ($\ge 20\%$ in some scenarios) while preserving the dataset utility for final analysis.