Regression
Statistical stability indices for LIME: obtaining reliable explanations for Machine Learning models
Visani, Giorgio, Bagli, Enrico, Chesani, Federico, Poluzzi, Alessandro, Capuzzo, Davide
Nowadays we are witnessing a transformation of the business processes towards a more computation driven approach. The ever increasing usage of Machine Learning techniques is the clearest example of such trend. This sort of revolution is often providing advantages, such as an increase in prediction accuracy and a reduced time to obtain the results. However, these methods present a major drawback: it is very difficult to understand on what grounds the algorithm took the decision. To address this issue we consider the LIME method. We give a general background on LIME then, we focus on the stability issue: employing the method repeated times, under the same conditions, may yield to different explanations. Two complementary indices are proposed, to measure LIME stability. It is important for the practitioner to be aware of the issue, as well as to have a tool for spotting it. Stability guarantees LIME explanations to be reliable, therefore a stability assessment, made through the proposed indices, is crucial. As a case study, we apply both Machine Learning and classical statistical techniques to Credit Risk data. We test LIME on the Machine Learning algorithm and check its stability. Eventually, we examine the goodness of the explanations returned.
Boosting Algorithms for Estimating Optimal Individualized Treatment Rules
Wang, Duzhe, Fu, Haoda, Loh, Po-Ling
The proposed algorithms are based on the XGBoost algorithm, which is known as one of the most powerful algorithms in the machine learning literature. Our main idea is to model the conditional mean of clinical outcome or the decision rule via additive regression trees, and use the boosting technique to estimate each single tree iteratively. Our approaches overcome the challenge of correct model specification, which is required in current parametric methods. The major contribution of our proposed algorithms is providing efficient and accurate estimation of the highly nonlinear and complex optimal individualized treatment rules that often arise in practice. Finally, we illustrate the superior performance of our algorithms by extensive simulation studies and conclude with an application to the real data from a diabetes Phase III trial. 1 Introduction Precision medicine, as an emerging medical approach for disease treatment and prevention, has received more and more attention among government, healthcare industry and academia in recent years. It is a well-known fact that there exists a significant heterogeneity for patients in response to treatments. For example, as demonstrated in [9], for patients who are infected with human immunodeficiency virus and tuberculosis, their optimal timing of antiretroviral therapy (ART) varies significantly.
Stable Prediction with Model Misspecification and Agnostic Distribution Shift
Kuang, Kun, Xiong, Ruoxuan, Cui, Peng, Athey, Susan, Li, Bo
For many machine learning algorithms, two main assumptions are required to guarantee performance. One is that the test data are drawn from the same distribution as the training data, and the other is that the model is correctly specified. In real applications, however, we often have little prior knowledge on the test data and on the underlying true model. Under model misspecification, agnostic distribution shift between training and test data leads to inaccuracy of parameter estimation and instability of prediction across unknown test data. To address these problems, we propose a novel Decor-related Weighting Regression (DWR) algorithm which jointly optimizes a variable decorrelation regularizer and a weighted regression model. The variable decorrelation regularizer estimates a weight for each sample such that variables are decor-related on the weighted training data. Then, these weights are used in the weighted regression to improve the accuracy of estimation on the effect of each variable, thus help to improve the stability of prediction across unknown test data. Extensive experiments clearly demonstrate that our DWR algorithm can significantly improve the accuracy of parameter estimation and stability of prediction with model misspecification and agnostic distribution shift.
Cost-Sensitive Logistic Regression for Imbalanced Classification
Logistic regression does not support imbalanced classification directly. Instead, the training algorithm used to fit the logistic regression model must be modified to take the skewed distribution into account. This can be achieved by specifying a class weighting configuration that is used to influence the amount that logistic regression coefficients are updated during training. The weighting can penalize the model less for errors made on examples from the majority class and penalize the model more for errors made on examples from the minority class. The result is a version of logistic regression that performs better on imbalanced classification tasks, generally referred to as cost-sensitive or weighted logistic regression.
A Sparsity Inducing Nuclear-Norm Estimator (SpINNEr) for Matrix-Variate Regression in Brain Connectivity Analysis
Brzyski, Damian, Hu, Xixi, Goni, Joaquin, Ances, Beau, Randolph, Timothy W., Harezlak, Jaroslaw
For example, it is of clinical interest to understand associations between: (a) alcoholism and the electrical activity of different brain regions over time collected from electroencephalography (EEG) (Li et al., 2010); (b) cognitive function and three-dimensional white-matter structure data collected from diffusion tensor imaging (DTI) (Goldsmith et al., 2014) for patients with multiple sclerosis (MS); and (c) cognitive impairment and brain's metabolic activity data collected from three-dimensional positron emission tomography (PET) imaging (Wang et al., 2014). Our work focuses on the problem of identifying brain network connections that are associated with neurocognitive measures for HIVinfected individuals. The outcome (response) is a continuous variable and the predictors are matrix representations of functional connectivity between the brain's cortical regions. Biophysical considerations motivate our interest in estimating a matrix of regression coefficients that has the following two properties: (i) it should be relatively sparse, since we aim to identify connections that most strongly predict the outcome; and more importantly, (ii) the response-related connections form clusters, since brain activity networks are known to consist of densely connected regions. These two properties translate to the coefficient matrix having relatively small clusters, or blocks of nonzero entries, which implies that it is low-rank. Hence, we aim to solve the matrix regression problem by estimating a coefficient matrix that is both sparse and low-rank. To further illustrate our approach, consider the three matrices in Figure 1. The one in the left panel is sparse, but full-rank, the one on the right panel is low-rank, but not sparse, while the one in the middle panel is both low-rank and sparse, which is the structure we are interested in. To find such a solution, we propose a regularization method called SParsity Inducing Nuclear Norm EstimatoR (SpINNEr).
A Hybrid Two-layer Feature Selection Method Using GeneticAlgorithm and Elastic Net
Feature selection, as a critical pre-processing step for machine learning, aims at determining representative predictors from a high-dimensional feature space dataset to improve the prediction accuracy. However, the increase in feature space dimensionality, comparing to the number of observations, poses a severe challenge to many existing feature selection methods with respect to computational efficiency and prediction performance. This paper presents a new hybrid two-layer feature selection approach that combines a wrapper and an embedded method in constructing an appropriate subset of predictors. In the first layer of the proposed method, the Genetic Algorithm(GA) has been adopted as a wrapper to search for the optimal subset of predictors, which aims to reduce the number of predictors and the prediction error. As one of the meta-heuristic approaches, GA is selected due to its computational efficiency; however, GAs do not guarantee the optimality. To address this issue, a second layer is added to the proposed method to eliminate any remaining redundant/irrelevant predictors to improve the prediction accuracy. Elastic Net(EN) has been selected as the embedded method in the second layer because of its flexibility in adjusting the penalty terms in regularization process and time efficiency. This hybrid two-layer approach has been applied on a Maize genetic dataset from NAM population, which consists of multiple subsets of datasets with different ratio of the number of predictors to the number of observations. The numerical results confirm the superiority of the proposed model.
An Upper Bound of the Bias of Nadaraya-Watson Kernel Regression under Lipschitz Assumptions
Tosatto, Samuele, Akrour, Riad, Peters, Jan
The Nadaraya-Watson kernel estimator is among the most popular nonparameteric regression technique thanks to its simplicity. Its asymptotic bias has been studied by Rosenblatt in 1969 and has been reported in a number of related literature. However, Rosenblatt's analysis is only valid for infinitesimal bandwidth. In contrast, we propose in this paper an upper bound of the bias which holds for finite bandwidths. Moreover, contrarily to the classic analysis we allow for discontinuous first order derivative of the regression function, we extend our bounds for multidimensional domains and we include the knowledge of the bound of the regression function when it exists and if it is known, to obtain a tighter bound. We believe that this work has potential applications in those fields where some hard guarantees on the error are needed
How to Compare Machine Learning Algorithms
Under the RAM model [1], the "time" an algorithm takes is measured by the elementary operations of the algorithm. While users and developers may concern more about the wall clock time an algorithm takes to train the models, it would be fairer to use the standard worst case computational time complexity to compare the time the models take to train. Using computational complexity has the benefits of ignoring the differences like the computer power and architecture used at runtime and the underlying programming language, allowing users to focus on the fundamental differences of the elementary operations of the algorithms. Note that the time complexity can be very different during training and testing. For example, parametric models like linear regression could have long training time but they are efficient during test time.
Extreme Algorithm Selection With Dyadic Feature Representation
Tornede, Alexander, Wever, Marcel, Hüllermeier, Eyke
Algorithm selection (AS) deals with selecting an algorithm from a fixed set of candidate algorithms most suitable for a specific instance of an algorithmic problem, e.g., choosing solvers for SA T problems. Benchmark suites for AS usually comprise candidate sets consisting of at most tens of algorithms, whereas in combined algorithm selection and hyperparameter optimization problems the number of candidates becomes intractable, impeding to learn effective meta-models and thus requiring costly online performance evaluations. Therefore, here we propose the setting of extreme algorithm selection (XAS) where we consider fixed sets of thousands of candidate algorithms, facilitating meta learning. W e assess the applicability of state-of-the-art AS techniques to the XAS setting and propose approaches leveraging a dyadic feature representation in which both problem instances and algorithms are described. W e find the latter to improve significantly over the current state of the art in various metrics.
Kernel Selection for Modal Linear Regression: Optimal Kernel and IRLS Algorithm
Yamasaki, Ryoya, Tanaka, Toshiyuki
Abstract--Modal linear regression (MLR) is a method for obtaining a conditional mode predictor as a linear model. We study kernel selection for MLR from two perspectives: "which kernel achieves smaller error?" and "which kernel is computationally efficient?". First, we show that a Biweight kernel is optimal in the sense of minimizing an asymptotic mean squared error of a resulting MLR parameter. This result is derived from our refined analysis of an asymptotic statistical behavior of MLR. Secondly, we provide a kernel class for which iteratively reweighted least-squares algorithm (IRLS) is guaranteed to converge, and especially prove that IRLS with an Epanechnikov kernel terminates in a finite number of iterations. Simulation studies empirically verified that using a Biweight kernel provides good estimation accuracy and that using an Epanechnikov kernel is computationally efficient. Our results improve MLR of which existing studies often stick to a Gaussian kernel and modal EM algorithm specialized for it, by providing guidelines of kernel selection.