Statistical Learning
Fused Lasso Additive Model
Petersen, Ashley, Witten, Daniela, Simon, Noah
Ashley Petersen, Daniela Witten†, and Noah Simon ‡ Department of Biostatistics, University of Washington, Seattle W A 98195 March 29, 2018 We consider the problem of predicting an outcome variable usingp covariates that are measured onn independent observations, in the setting in which flexible and interpretable fits are desirable. We propose the fused lasso additive model (FLAM), in which each additive function is estimated to be piecewise constant with a small number of adaptively-chosen knots. FLAM is the solution to a convex optimization problem, for which a simple algorithm with guaranteed convergence to the global optimum is provided. FLAM is shown to be consistent in high dimensions, and an unbiased estimator of its degrees of freedom is proposed. We evaluate the performance of FLAM in a simulation study and on two data sets. Keywords: additive model, feature selection, high-dimensional, nonparametric regression, piecewise constant, sparsity 1 Introduction In this paper, we consider the task of predicting a response variable usingp features measured on n independent observations. In this paper, we propose a method that balances the tradeoff between inter-pretability and flexibility, while also allowing for sparsity in high dimensions whenp n . It selects a subset of features to include in the model, and for these features it fits piecewise constant functions with knots that are chosen adaptively based on the data. We now introduce some notation. We letX denote ann p matrix, for whichx j is the j th column (feature), and for which thei th element (observation) isx ij .
Particle Metropolis-Hastings using gradient and Hessian information
Dahlin, Johan, Lindsten, Fredrik, Schön, Thomas B.
Particle Metropolis-Hastings (PMH) allows for Bayesian parameter inference in nonlinear state space models by combining Markov chain Monte Carlo (MCMC) and particle filtering. The latter is used to estimate the intractable likelihood. In its original formulation, PMH makes use of a marginal MCMC proposal for the parameters, typically a Gaussian random walk. However, this can lead to a poor exploration of the parameter space and an inefficient use of the generated particles. We propose a number of alternative versions of PMH that incorporate gradient and Hessian information about the posterior into the proposal. This information is more or less obtained as a byproduct of the likelihood estimation. Indeed, we show how to estimate the required information using a fixed-lag particle smoother, with a computational cost growing linearly in the number of particles. We conclude that the proposed methods can: (i) decrease the length of the burn-in phase, (ii) increase the mixing of the Markov chain at the stationary phase, and (iii) make the proposal distribution scale invariant which simplifies tuning.
A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping
Bloodgood, Michael, Vijay-Shanker, K.
A survey of existing methods for stopping active learning (AL) reveals the needs for methods that are: more widely applicable; more aggressive in saving annotations; and more stable across changing datasets. A new method for stopping AL based on stabilizing predictions is presented that addresses these needs. Furthermore, stopping methods are required to handle a broad range of different annotation/performance tradeoff valuations. Despite this, the existing body of work is dominated by conservative methods with little (if any) attention paid to providing users with control over the behavior of stopping methods. The proposed method is shown to fill a gap in the level of aggressiveness available for stopping AL and supports providing users with control over stopping behavior.
Ensembles of Random Sphere Cover Classifiers
Bagnall, Anthony, Younsi, Reda
We propose and evaluate alternative ensemble schemes for a new instance based learning classifier, the Randomised Sphere Cover (RSC) classifier. RSC fuses instances into spheres, then bases classification on distance to spheres rather than distance to instances. The randomised nature of RSC makes it ideal for use in ensembles. We propose two ensemble methods tailored to the RSC classifier; $\alpha \beta$RSE, an ensemble based on instance resampling and $\alpha$RSSE, a subspace ensemble. We compare $\alpha \beta$RSE and $\alpha$RSSE to tree based ensembles on a set of UCI datasets and demonstrates that RSC ensembles perform significantly better than some of these ensembles, and not significantly worse than the others. We demonstrate via a case study on six gene expression data sets that $\alpha$RSSE can outperform other subspace ensemble methods on high dimensional data when used in conjunction with an attribute filter. Finally, we perform a set of Bias/Variance decomposition experiments to analyse the source of improvement in comparison to a base classifier.
Hardness of parameter estimation in graphical models
Bresler, Guy, Gamarnik, David, Shah, Devavrat
We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan (2008)) but no proof was known. Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).
Collapsed Variational Bayes Inference of Infinite Relational Model
Ishiguro, Katsuhiko, Sato, Issei, Ueda, Naonori
The Infinite Relational Model (IRM) is a probabilistic model for relational data clustering that partitions objects into clusters based on observed relationships. This paper presents Averaged CVB (ACVB) solutions for IRM, convergence-guaranteed and practically useful fast Collapsed Variational Bayes (CVB) inferences. We first derive ordinary CVB and CVB0 for IRM based on the lower bound maximization. CVB solutions yield deterministic iterative procedures for inferring IRM given the truncated number of clusters. Our proposal includes CVB0 updates of hyperparameters including the concentration parameter of the Dirichlet Process, which has not been studied in the literature. To make the CVB more practically useful, we further study the CVB inference in two aspects. First, we study the convergence issues and develop a convergence-guaranteed algorithm for any CVB-based inferences called ACVB, which enables automatic convergence detection and frees non-expert practitioners from difficult and costly manual monitoring of inference processes. Second, we present a few techniques for speeding up IRM inferences. In particular, we describe the linear time inference of CVB0, allowing the IRM for larger relational data uses. The ACVB solutions of IRM showed comparable or better performance compared to existing inference methods in experiments, and provide deterministic, faster, and easier convergence detection.
Taking into Account the Differences between Actively and Passively Acquired Data: The Case of Active Learning with Support Vector Machines for Imbalanced Datasets
Bloodgood, Michael, Vijay-Shanker, K.
Actively sampled data can have very different characteristics than passively sampled data. Therefore, it's promising to investigate using different inference procedures during AL than are used during passive learning (PL). This general idea is explored in detail for the focused case of AL with cost-weighted SVMs for imbalanced data, a situation that arises for many HLT tasks. The key idea behind the proposed InitPA method for addressing imbalance is to base cost models during AL on an estimate of overall corpus imbalance computed via a small unbiased sample rather than the imbalance in the labeled training data, which is the leading method used during PL.
Multivariate Comparison of Classification Algorithms
Yildiz, Olcay Taner, Alpaydin, Ethem
Statistical tests that compare classification algorithms are univariate and use a single performance measure, e.g., misclassification error, $F$ measure, AUC, and so on. In multivariate tests, comparison is done using multiple measures simultaneously. For example, error is the sum of false positives and false negatives and a univariate test on error cannot make a distinction between these two sources, but a 2-variate test can. Similarly, instead of combining precision and recall in $F$ measure, we can have a 2-variate test on (precision, recall). We use Hotelling's multivariate $T^2$ test for comparing two algorithms, and when we have three or more algorithms we use the multivariate analysis of variance (MANOVA) followed by pairwise post hoc tests. In our experiments, we see that multivariate tests have higher power than univariate tests, that is, they can detect differences that univariate tests cannot. We also discuss how multivariate analysis allows us to automatically extract performance measures that best distinguish the behavior of multiple algorithms.
Continuum limit of total variation on point clouds
Trillos, Nicolás García, Slepčev, Dejan
We consider point clouds obtained as random samples of a measure on a Euclidean domain. A graph representing the point cloud is obtained by assigning weights to edges based on the distance between the points they connect. Our goal is to develop mathematical tools needed to study the consistency, as the number of available data points increases, of graph-based machine learning algorithms for tasks such as clustering. In particular, we study when is the cut capacity, and more generally total variation, on these graphs a good approximation of the perimeter (total variation) in the continuum setting. We address this question in the setting of $\Gamma$-convergence. We obtain almost optimal conditions on the scaling, as number of points increases, of the size of the neighborhood over which the points are connected by an edge for the $\Gamma$-convergence to hold. Taking the limit is enabled by a transportation based metric which allows to suitably compare functionals defined on different point clouds.
ICE: Enabling Non-Experts to Build Models Interactively for Large-Scale Lopsided Problems
Simard, Patrice, Chickering, David, Lakshmiratan, Aparna, Charles, Denis, Bottou, Leon, Suarez, Carlos Garcia Jurado, Grangier, David, Amershi, Saleema, Verwey, Johan, Suh, Jina
Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.