Regression
Phase Transitions in Transfer Learning for High-Dimensional Perceptrons
Dhifallah, Oussama, Lu, Yue M.
Transfer learning seeks to improve the generalization performance of a target task by exploiting the knowledge learned from a related source task. Central questions include deciding what information one should transfer and when transfer can be beneficial. The latter question is related to the so-called negative transfer phenomenon, where the transferred source information actually reduces the generalization performance of the target task. This happens when the two tasks are sufficiently dissimilar. In this paper, we present a theoretical analysis of transfer learning by studying a pair of related perceptron learning tasks. Despite the simplicity of our model, it reproduces several key phenomena observed in practice. Specifically, our asymptotic analysis reveals a phase transition from negative transfer to positive transfer as the similarity of the two tasks moves past a well-defined threshold. Transfer learning [1]-[5] is a promising approach to improving the performance of machine learning tasks. It does so by exploiting the knowledge gained from a previously-learned model, referred to as the source task, to improve the generalization performance of a related learning problem, referred to as the target task.
Structured Machine Learning Tools for Modelling Characteristics of Guided Waves
Haywood-Alexander, Marcus, Dervilis, Nikolaos, Worden, Keith, Cross, Elizabeth J., Mills, Robin S., Rogers, Timothy J.
The use of ultrasonic guided waves to probe the materials/structures for damage continues to increase in popularity for non-destructive evaluation (NDE) and structural health monitoring (SHM). The use of high-frequency waves such as these offers an advantage over low-frequency methods from their ability to detect damage on a smaller scale. However, in order to assess damage in a structure, and implement any NDE or SHM tool, knowledge of the behaviour of a guided wave throughout the material/structure is important (especially when designing sensor placement for SHM systems). Determining this behaviour is extremely diffcult in complex materials, such as fibre-matrix composites, where unique phenomena such as continuous mode conversion takes place. This paper introduces a novel method for modelling the feature-space of guided waves in a composite material. This technique is based on a data-driven model, where prior physical knowledge can be used to create structured machine learning tools; where constraints are applied to provide said structure. The method shown makes use of Gaussian processes, a full Bayesian analysis tool, and in this paper it is shown how physical knowledge of the guided waves can be utilised in modelling using an ML tool. This paper shows that through careful consideration when applying machine learning techniques, more robust models can be generated which offer advantages such as extrapolation ability and physical interpretation.
Explainable AI and Adoption of Algorithmic Advisors: an Experimental Study
David, Daniel Ben, Resheff, Yehezkel S., Tron, Talia
Machine learning is becoming a commonplace part of our technological experience. The notion of explainable AI (XAI) is attractive when regulatory or usability considerations necessitate the ability to back decisions with a coherent explanation. A large body of research has addressed algorithmic methods of XAI, but it is still unclear how to determine what is best suited to create human cooperation and adoption of automatic systems. Here we develop an experimental methodology where participants play a web-based game, during which they receive advice from either a human or algorithmic advisor, accompanied with explanations that vary in nature between experimental conditions. We use a reference-dependent decision-making framework, evaluate the game results over time, and in various key situations, to determine whether the different types of explanations affect the readiness to adopt, willingness to pay and trust a financial AI consultant. We find that the types of explanations that promotes adoption during first encounter differ from those that are most successful following failure or when cost is involved. Furthermore, participants are willing to pay more for AI-advice that includes explanations. These results add to the literature on the importance of XAI for algorithmic adoption and trust.
Weight-of-evidence 2.0 with shrinkage and spline-binning
Raymaekers, Jakob, Verbeke, Wouter, Verdonck, Tim
In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance between precision and interpretability. Linear methods, however, are not well equipped to handle categorical predictors with high-cardinality or to exploit non-linear relations in the data. As a solution, data preprocessing methods such as weight-of-evidence are typically used for transforming the predictors. The binning procedure that underlies the weight-of-evidence approach, however, has been little researched and typically relies on ad-hoc or expert driven procedures. The objective in this paper, therefore, is to propose a formalized, data-driven and powerful method. To this end, we explore the discretization of continuous variables through the binning of spline functions, which allows for capturing non-linear effects in the predictor variables and yields highly interpretable predictors taking only a small number of discrete values. Moreover, we extend upon the weight-of-evidence approach and propose to estimate the proportions using shrinkage estimators. Together, this offers an improved ability to exploit both non-linear and categorical predictors for achieving increased classification precision, while maintaining interpretability of the resulting model and decreasing the risk of overfitting. We present the results of a series of experiments in a fraud detection setting, which illustrate the effectiveness of the presented approach. We facilitate reproduction of the presented results and adoption of the proposed approaches by providing both the dataset and the code for implementing the experiments and the presented approach.
Weighting-Based Treatment Effect Estimation via Distribution Learning
Zhang, Dongcheng, Zhang, Kunpeng
Existing weighting methods for treatment effect estimation are often built upon the idea of propensity scores or covariate balance. They usually impose strong assumptions on treatment assignment or outcome model to obtain unbiased estimation, such as linearity or specific functional forms, which easily leads to the major drawback of model mis-specification. In this paper, we aim to alleviate these issues by developing a distribution learning-based weighting method. We first learn the true underlying distribution of covariates conditioned on treatment assignment, then leverage the ratio of covariates' density in the treatment group to that of the control group as the weight for estimating treatment effects. Specifically, we propose to approximate the distribution of covariates in both treatment and control groups through invertible transformations via change of variables. To demonstrate the superiority, robustness, and generalizability of our method, we conduct extensive experiments using synthetic and real data. From the experiment results, we find that our method for estimating average treatment effect on treated (ATT) with observational data outperforms several cutting-edge weighting-only benchmarking methods, and it maintains its advantage under a doubly-robust estimation framework that combines weighting with some advanced outcome modeling methods.
Identifying the latent space geometry of network models through analysis of curvature
Lubold, Shane, Chandrasekhar, Arun G., McCormick, Tyler H.
Statistically modeling networks, across numerous disciplines and contexts, is fundamentally challenging because of (often high-order) dependence between connections. A common approach assigns each person in the graph to a position on a low-dimensional manifold. Distance between individuals in this (latent) space is inversely proportional to the likelihood of forming a connection. The choice of the latent geometry (the manifold class, dimension, and curvature) has consequential impacts on the substantive conclusions of the model. More positive curvature in the manifold, for example, encourages more and tighter communities; negative curvature induces repulsion among nodes. Currently, however, the choice of the latent geometry is an a priori modeling assumption and there is limited guidance about how to make these choices in a data-driven way. In this work, we present a method to consistently estimate the manifold type, dimension, and curvature from an empirically relevant class of latent spaces: simply connected, complete Riemannian manifolds of constant curvature. Our core insight comes by representing the graph as a noisy distance matrix based on the ties between cliques. Leveraging results from statistical geometry, we develop hypothesis tests to determine whether the observed distances could plausibly be embedded isometrically in each of the candidate geometries. We explore the accuracy of our approach with simulations and then apply our approach to data-sets from economics and sociology as well as neuroscience.
Gaussian Function On Response Surface Estimation
Toutiaee, Mohammadhossein, Miller, John
We propose a new framework for 2-D interpreting (features and samples) black-box machine learning models via a metamodeling technique, by which we study the output and input relationships of the underlying machine learning model. The metamodel can be estimated from data generated via a trained complex model by running the computer experiment on samples of data in the region of interest. We utilize a Gaussian process as a surrogate to capture the response surface of a complex model, in which we incorporate two parts in the process: interpolated values that are modeled by a stationary Gaussian process Z governed by a prior covariance function, and a mean function mu that captures the known trends in the underlying model. The optimization procedure for the variable importance parameter theta is to maximize the likelihood function. This theta corresponds to the correlation of individual variables with the target response. There is no need for any pre-assumed models since it depends on empirical observations. Experiments demonstrate the potential of the interpretable model through quantitative assessment of the predicted samples.
An automatic procedure to determine groups of nonparametric regression curves
Villanueva, Nora M., Sestelo, Marta, Ordóñez, Celestino, Roca-Pardiñas, Javier
One of the main goals of statistical modelling is to understand the dependence of a response variable, Y, with respect to another explanatory variable, X. This type of dependence can be studied through nonparametric regression models, where the relationship between Y and X is modelled without specifying in advance the function that links them. Within this framework, the study of the regression curves can be useful in the comparison of two or more groups, which is an important problem associated with statistical inference. In particular, the topic of hypothesis testing the equality of mean functions has been widely investigated in the literature, see, for instance, the review that González-Manteiga and Crujeiras (2013) offers about this topic. Relevant papers on this topic are Hall and Hart (1990); King et al. (1991); Delgado (1993); Kulasekera (1995); Young and Bowman (1995); Dette and Neumeyer (2001); Pardo-Fernández et al. (2007); Srihera and Stute (2010), among others. Furthermore, in order to compare the values of a response variable across several groups in the presence of a covariate effect, nonparametric analysis of covariance or factor-by-curve interaction test can be used. Young and Bowman (1995) generalized the one-way analysis of variance test to the nonparametric regression setting, and Dette and Neumeyer (2001) proposed to use Young and Bowman's test also in the situation of a heteroscedastic error. In addition, Park and Kang (2008) developed a SiZer tool based on an analysis of variance type test statistic that is capable of comparing multiple curves based on the residuals. The evolution of this procedure is based on the comparison using the original regression curves (Park et al., 2014).
Inference for Low-rank Tensors -- No Need to Debias
Xia, Dong, Zhang, Anru R., Zhou, Yuchen
In this paper, we consider the statistical inference for several low-rank tensor models. Specifically, in the Tucker low-rank tensor PCA or regression model, provided with any estimates achieving some attainable error rate, we develop the data-driven confidence regions for the singular subspace of the parameter tensor based on the asymptotic distribution of an updated estimate by two-iteration alternating minimization. The asymptotic distributions are established under some essential conditions on the signal-to-noise ratio (in PCA model) or sample size (in regression model). If the parameter tensor is further orthogonally decomposable, we develop the methods and theory for inference on each individual singular vector. For the rank-one tensor PCA model, we establish the asymptotic distribution for general linear forms of principal components and confidence interval for each entry of the parameter tensor. Finally, numerical simulations are presented to corroborate our theoretical discoveries. In all these models, we observe that different from many matrix/vector settings in existing work, debiasing is not required to establish the asymptotic distribution of estimates or to make statistical inference on low-rank tensors. In fact, due to the widely observed statistical-computational-gap for low-rank tensor estimation, one usually requires stronger conditions than the statistical (or information-theoretic) limit to ensure the computationally feasible estimation is achievable. Surprisingly, such conditions ``incidentally" render a feasible low-rank tensor inference without debiasing.
Behavior of linear L2-boosting algorithms in the vanishing learning rate asymptotic
Dombry, Clément, Esstafa, Youssef
In the past decades, boosting has become a major and powerful prediction method in machine learning. The success of the classification algorithm AdaBoost by Freund and Schapire (1999) demonstrated the possibility to combine many weak learners in a sequential way in order to produce better predictions, with widespread applications in gene expression (Dudoit et al., 2002) or music genre identification (Bergstra et al., 2006), to name only a few. Friedman et al. (2000) were able to see a wider statistical framework that lead to the gradient boosting (Friedman, 2001), where a weak learner (e.g., regression trees) is used to optimize a loss function in a sequential procedure akin to gradient descent. Choosing the loss function according to the statistical problem at hand results in a versatile and efficient tool that can handle classification, regression, quantile regression or survival analysis... The popularity of gradient boosting is also due to its efficient implementation in the R package gbm by Ridgeway (2007). Along the methodological developments, strong theoretical results have justified the good performance of boosting. Consistency of boosting algorithm, i.e. their ability to achieve the optimal Bayes error rate for large samples, is considered in Breiman (2004), Zhang and Yu (2005) or Bartlett and Traskin (2007). The present paper is strongly influenced by Bühlmann 2 and Yu (2003) that proposes an analysis of regression boosting algorithms built on linear base learners thanks to explicit formulas for the boosted predictor and its error rate. In this paper, we focus on gradient boosting for regression with square loss and we briefly describe the corresponding algorithm.