Statistical Learning
Learning with Symmetric Label Noise: The Importance of Being Unhinged
Rooyen, Brendan van, Menon, Aditya, Williamson, Robert C.
Convex potential minimisation is the de facto approach to binary classification. However, Long and Servedio [2008] proved that under symmetric label noise (SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly shows that convex losses are not SLN-robust. In this paper, we propose a convex, classification-calibrated loss and prove that it is SLN-robust. The loss avoids the Long and Servedio [2008] result by virtue of being negatively unbounded. The loss is a modification of the hinge loss, where one does not clamp at zero; hence, we call it the unhinged loss. We show that the optimal unhinged solution is equivalent to that of a strongly regularised SVM, and is the limiting solution for any convex potential; this implies that strong l2 regularisation makes most standard learners SLN-robust. Experiments confirm the unhinged lossโ SLN-robustness.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Weston, Jason, Bordes, Antoine, Chopra, Sumit, Rush, Alexander M., van Merriรซnboer, Bart, Joulin, Armand, Mikolov, Tomas
One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.
Evolving Non-linear Stacking Ensembles for Prediction of Go Player Attributes
Moudลรญk, Josef, Neruda, Roman
The paper presents an application of non-linear stacking ensembles for prediction of Go player attributes. An evolutionary algorithm is used to form a diverse ensemble of base learners, which are then aggregated by a stacking ensemble. This methodology allows for an efficient prediction of different attributes of Go players from sets of their games. These attributes can be fairly general, in this work, we used the strength and style of the players.
Selecting Near-Optimal Learners via Incremental Data Allocation
Sabharwal, Ashish, Samulowitz, Horst, Tesauro, Gerald
We study a novel machine learning (ML) problem setting of sequentially allocating small subsets of training data amongst a large set of classifiers. The goal is to select a classifier that will give near-optimal accuracy when trained on all data, while also minimizing the cost of misallocated samples. This is motivated by large modern datasets and ML toolkits with many combinations of learning algorithms and hyper-parameters. Inspired by the principle of "optimism under uncertainty," we propose an innovative strategy, Data Allocation using Upper Bounds (DAUB), which robustly achieves these objectives across a variety of real-world datasets. We further develop substantial theoretical support for DAUB in an idealized setting where the expected accuracy of a classifier trained on $n$ samples can be known exactly. Under these conditions we establish a rigorous sub-linear bound on the regret of the approach (in terms of misallocated data), as well as a rigorous bound on suboptimality of the selected classifier. Our accuracy estimates using real-world datasets only entail mild violations of the theoretical scenario, suggesting that the practical behavior of DAUB is likely to approach the idealized behavior.
Strategies and Principles of Distributed Machine Learning on Big Data
Xing, Eric P., Ho, Qirong, Xie, Pengtao, Dai, Wei
The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. In order to run ML algorithms at such scales, on a distributed cluster with 10s to 1000s of machines, it is often the case that significant engineering efforts are required --- and one might fairly ask if such engineering truly falls within the domain of ML research or not. Taking the view that Big ML systems can benefit greatly from ML-rooted statistical and algorithmic insights --- and that ML researchers should therefore not shy away from such systems design --- we discuss a series of principles and strategies distilled from our recent efforts on industrial-scale ML solutions. These principles and strategies span a continuum from application, to engineering, and to theoretical research and development of Big ML systems and architectures, with the goal of understanding how to make them efficient, generally-applicable, and supported with convergence and scaling guarantees. They concern four key questions which traditionally receive little attention in ML research: How to distribute an ML program over a cluster? How to bridge ML computation with inter-machine communication? How to perform such communication? What should be communicated between machines? By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs, and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software as well as general-purpose ML frameworks, we present opportunities for ML researchers and practitioners to further shape and grow the area that lies between ML and systems.
Nonparametric Bayesian Factor Analysis for Dynamic Count Matrices
Acharya, Ayan, Ghosh, Joydeep, Zhou, Mingyuan
A gamma process dynamic Poisson factor analysis model is proposed to factorize a dynamic count matrix, whose columns are sequentially observed count vectors. The model builds a novel Markov chain that sends the latent gamma random variables at time $(t-1)$ as the shape parameters of those at time $t$, which are linked to observed or latent counts under the Poisson likelihood. The significant challenge of inferring the gamma shape parameters is fully addressed, using unique data augmentation and marginalization techniques for the negative binomial distribution. The same nonparametric Bayesian model also applies to the factorization of a dynamic binary matrix, via a Bernoulli-Poisson link that connects a binary observation to a latent count, with closed-form conditional posteriors for the latent counts and efficient computation for sparse observations. We apply the model to text and music analysis, with state-of-the-art results.
Nonparametric mixture of Gaussian graphical models
Graphical model has been widely used to investigate the complex dependence structure of high-dimensional data, and it is common to assume that observed data follow a homogeneous graphical model. However, observations usually come from different resources and have heterogeneous hidden commonality in real-world applications. Thus, it is of great importance to estimate heterogeneous dependencies and discover subpopulation with certain commonality across the whole population. In this work, we introduce a novel regularized estimation scheme for learning nonparametric mixture of Gaussian graphical models, which extends the methodology and applicability of Gaussian graphical models and mixture models. We propose a unified penalized likelihood approach to effectively estimate nonparametric functional parameters and heterogeneous graphical parameters. We further design an efficient generalized effective EM algorithm to address three significant challenges: high-dimensionality, non-convexity, and label switching. Theoretically, we study both the algorithmic convergence of our proposed algorithm and the asymptotic properties of our proposed estimators. Numerically, we demonstrate the performance of our method in simulation studies and a real application to estimate human brain functional connectivity from ADHD imaging data, where two heterogeneous conditional dependencies are explained through profiling demographic variables and supported by existing scientific findings.
Sharp Computational-Statistical Phase Transitions via Oracle Computational Model
Wang, Zhaoran, Gu, Quanquan, Liu, Han
We study the fundamental tradeoffs between computational tractability and statistical accuracy for a general family of hypothesis testing problems with combinatorial structures. Based upon an oracle model of computation, which captures the interactions between algorithms and data, we establish a general lower bound that explicitly connects the minimum testing risk under computational budget constraints with the intrinsic probabilistic and combinatorial structures of statistical problems. This lower bound mirrors the classical statistical lower bound by Le Cam (1986) and allows us to quantify the optimal statistical performance achievable given limited computational budgets in a systematic fashion. Under this unified framework, we sharply characterize the statistical-computational phase transition for two testing problems, namely, normal mean detection and sparse principal component detection. For normal mean detection, we consider two combinatorial structures, namely, sparse set and perfect matching. For these problems we identify significant gaps between the optimal statistical accuracy that is achievable under computational tractability constraints and the classical statistical lower bounds. Compared with existing works on computational lower bounds for statistical problems, which consider general polynomial-time algorithms on Turing machines, and rely on computational hardness hypotheses on problems like planted clique detection, we focus on the oracle computational model, which covers a broad range of popular algorithms, and do not rely on unproven hypotheses. Moreover, our result provides an intuitive and concrete interpretation for the intrinsic computational intractability of high-dimensional statistical problems. One byproduct of our result is a lower bound for a strict generalization of the matrix permanent problem, which is of independent interest.
The Poisson Gamma Belief Network
Zhou, Mingyuan, Cong, Yulai, Chen, Bo
To infer a multilayer representation of high-dimensional count vectors, we propose the Poisson gamma belief network (PGBN) that factorizes each of its layers into the product of a connection weight matrix and the nonnegative real hidden units of the next layer. The PGBN's hidden layers are jointly trained with an upward-downward Gibbs sampler, each iteration of which upward samples Dirichlet distributed connection weight vectors starting from the first layer (bottom data layer), and then downward samples gamma distributed hidden units starting from the top hidden layer. The gamma-negative binomial process combined with a layer-wise training strategy allows the PGBN to infer the width of each layer given a fixed budget on the width of the first layer. The PGBN with a single hidden layer reduces to Poisson factor analysis. Example results on text analysis illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the PGBN, whose hidden units are imposed with correlated gamma priors, can add more layers to increase its performance gains over Poisson factor analysis, given the same limit on the width of the first layer.
Recursive Partitioning for Heterogeneous Causal Effects
In this paper we study the problems of estimating heterogeneity in causal effects in experimental or observational studies and conducting inference about the magnitude of the differences in treatment effects across subsets of the population. In applications, our method provides a data-driven approach to determine which subpopulations have large or small treatment effects and to test hypotheses about the differences in these effects. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specified in a pre-analysis plan, without concern about invalidating inference due to multiple testing. In most of the literature on supervised machine learning (e.g. regression trees, random forests, LASSO, etc.), the goal is to build a model of the relationship between a unit's attributes and an observed outcome. A prominent role in these methods is played by cross-validation which compares predictions to actual outcomes in test samples, in order to select the level of complexity of the model that provides the best predictive power. Our method is closely related, but it differs in that it is tailored for predicting causal effects of a treatment rather than a unit's outcome. The challenge is that the "ground truth" for a causal effect is not observed for any individual unit: we observe the unit with the treatment, or without the treatment, but not both at the same time. Thus, it is not obvious how to use cross-validation to determine whether a causal effect has been accurately predicted. We propose several novel cross-validation criteria for this problem and demonstrate through simulations the conditions under which they perform better than standard methods for the problem of causal effects. We then apply the method to a large-scale field experiment re-ranking results on a search engine.