Regression
Unbiased Bayes for Big Data: Paths of Partial Posteriors
Strathmann, Heiko, Sejdinovic, Dino, Girolami, Mark
A key quantity of interest in Bayesian inference are expectations of functions with respect to a posterior distribution. Markov Chain Monte Carlo is a fundamental tool to consistently compute these expectations via averaging samples drawn from an approximate posterior. However, its feasibility is being challenged in the era of so called Big Data as all data needs to be processed in every iteration. Realising that such simulation is an unnecessarily hard problem if the goal is estimation, we construct a computationally scalable methodology that allows unbiased estimation of the required expectations -- without explicit simulation from the full posterior. The scheme's variance is finite by construction and straightforward to control, leading to algorithms that are provably unbiased and naturally arrive at a desired error tolerance. This is achieved at an average computational complexity that is sub-linear in the size of the dataset and its free parameters are easy to tune. We demonstrate the utility and generality of the methodology on a range of common statistical models applied to large-scale benchmark and real-world datasets.
From Predictive to Prescriptive Analytics
Bertsimas, Dimitris, Kallus, Nathan
In this paper, we combine ideas from machine learning (ML) and operations research and management science (OR/MS) in developing a framework, along with specific methods, for using data to prescribe decisions in OR/MS problems. In a departure from other work on data-driven optimization and reflecting our practical experience with the data available in applications of OR/MS, we consider data consisting, not only of observations of quantities with direct effect on costs/revenues, such as demand or returns, but predominantly of observations of associated auxiliary quantities. The main problem of interest is a conditional stochastic optimization problem, given imperfect observations, where the joint probability distributions that specify the problem are unknown. We demonstrate that our proposed solution methods are generally applicable to a wide range of decision problems. We prove that they are computationally tractable and asymptotically optimal under mild conditions even when data is not independent and identically distributed (iid) and even for censored observations. As an analogue to the coefficient of determination $R^2$, we develop a metric $P$ termed the coefficient of prescriptiveness to measure the prescriptive content of data and the efficacy of a policy from an operations perspective. To demonstrate the power of our approach in a real-world setting we study an inventory management problem faced by the distribution arm of an international media conglomerate, which ships an average of 1 billion units per year. We leverage both internal data and public online data harvested from IMDb, Rotten Tomatoes, and Google to prescribe operational decisions that outperform baseline measures. Specifically, the data we collect, leveraged by our methods, accounts for an 88% improvement as measured by our coefficient of prescriptiveness.
A mixture Cox-Logistic model for feature selection from survival and classification data
Branders, Samuel, D'Ambrosio, Roberto, Dupont, Pierre
This paper presents an original approach for jointly fitting survival times and classifying samples into subgroups. The Coxlogit model is a generalized linear model with a common set of selected features for both tasks. Survival times and class labels are here assumed to be conditioned by a common risk score which depends on those features. Learning is then naturally expressed as maximizing the joint probability of subgroup labels and the ordering of survival events, conditioned to a common weight vector. The model is estimated by minimizing a regularized log-likelihood through a coordinate descent algorithm. Validation on synthetic and breast cancer data shows that the proposed approach outperforms a standard Cox model or logistic regression when both predicting the survival times and classifying new samples into subgroups. It is also better at selecting informative features for both tasks.
High-Dimensional Longitudinal Classification with the Multinomial Fused Lasso
Adhikari, Samrachana, Lecci, Fabrizio, Becker, James T., Junker, Brian W., Kuller, Lewis H., Lopez, Oscar L., Tibshirani, Ryan J.
We study regularized estimation in high-dimensional longitudinal classification problems, using the lasso and fused lasso regularizers. The constructed coefficient estimates are piecewise constant across the time dimension in the longitudinal problem, with adaptively selected change points (break points). We present an efficient algorithm for computing such estimates, based on proximal gradient descent. We apply our proposed technique to a longitudinal data set on Alzheimer's disease from the Cardiovascular Health Study Cognition Study, and use this data set to motivate and demonstrate several practical considerations such as the selection of tuning parameters, and the assessment of model stability.
Online Nonparametric Regression with General Loss Functions
Rakhlin, Alexander, Sridharan, Karthik
This paper establishes minimax rates for online regression with arbitrary classes of functions and general losses. We show that below a certain threshold for the complexity of the function class, the minimax rates depend on both the curvature of the loss function and the sequential complexities of the class. Above this threshold, the curvature of the loss does not affect the rates. Furthermore, for the case of square loss, our results point to the interesting phenomenon: whenever sequential and i.i.d. empirical entropies match, the rates for statistical and online learning are the same. In addition to the study of minimax regret, we derive a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online prediction algorithms that can be computationally efficient for certain problems. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.
MACHINE INTELLIGENCE 2
C. COOPER 21 3 Data representation--the key to conceptualisation: D. B. VIGOR 33 MECHANISED MATHEMATICS 45 4 An approach to analytic integration using ordered algebraic expressions: L. I. HODGSON 47 5 Some theorem-proving strategies based on the resolution principle: J. L DARLINGTON 57 MACHINE LEARNING AND HEURISTIC PROGRAMMING 73 6 Automatic description and recognition of board patterns in Go-Moku: A. M. MURRAY and E. W. Etcomc
MACHINE INTELLIGENCE 13
The two outstanding figures in the history of computer science are Alan Turing and John von Neumann, and they shared the view that logic was the key to understanding and automating computation. In particular, it was Turing who gave us in the mid-1930s the fundamental analysis, and the logical definition, of the concept of'computability by machine' and who discovered the surprising and beautiful basic fact that there exist universal machines which by suitable programming can be made to t This essay is an expanded and revised version of one entitled The Role of Logic in Computer Science and Artificial Intelligence, which was completed in January 1992 (and was later published in the Proceedings of the Fifth Generation computer Systems 1992 Conference). Since completing that essay I have had the benefit of extremely helpful discussions on many of the details with Professor Donald Michie and Professor I. J. Good, both of whom knew Turing well during the war years at Bletchley Park. Professor J. A. N. Lee, whose knowledge of the literature and archives of the history of computing is encyclopedic, also provided additional information, some of which is still unpublished. Further light has very recently been shed on the von Neumann side of the story by Norman Macrae's excellent biography John von Neumann (Macrae 1992). Accordingly, it seemed appropriate to undertake a more complete and thorough version of the FGCS'92 essay, focussing somewhat more on the interesting historical and biographical issues. I am grateful to Donald Michie and Stephen Muggleton for inviting me to contribute such a'second edition' to the present volume, and I would also like to thank the Institute for New Computer Technology (ICOT) for kind permission to make use of the FGCS'92 essay in this way. 1 LOGIC, COMPUTERS, TURING, AND VON NEUMANN
13 A Comparative Study of Classification Algorithms: Statistical, Machine Learning and Neural Network R. D. King R. Henery
The aim of the Stat Log project is to compare the performance of statistical, machine learning, and neural network algorithms, on large real world problems. This paper describes the completed work on classification in the StatLog project. Classification is here defined to be the problem, given a set of multivariate data with assigned classes, of estimating the probability from a set of attributes describing a new example sampled from the same source that it has a pre-defined class. We gathered together a representative collection of algorithms from statistics (Naive Bayes, K-nearest Neighbour, Kernel density, Linear discriminant, Quadratic discriminant, Logistic regression, Projection pursuit, Bayesian networks), machine learning (CART, C4.5, NewID, AC2, CAL5, CN2, ITrule -- only propositional symbolic algorithms were considered), and neural networks (Backpropagation, Radial basis functions, Kohonen).
Implementable confidence sets in high dimensional regression
We consider the setting of linear regression in high dimension. We focus on the problem of constructing adaptive and honest confidence sets for the sparse parameter \theta, i.e. we want to construct a confidence set for theta that contains theta with high probability, and that is as small as possible. The l_2 diameter of a such confidence set should depend on the sparsity S of \theta - the larger S, the wider the confidence set. However, in practice, S is unknown. This paper focuses on constructing a confidence set for \theta which contains \theta with high probability, whose diameter is adaptive to the unknown sparsity S, and which is implementable in practice.