Statistical Learning
Document stream clustering: experimenting an incremental algorithm and AR-based tools for highlighting dynamic trends
Lelu, Alain, Cadot, Martine, Cuxac, Pascal
We address here two major challenges presented by dynamic data mining: 1) the stability challenge: we have implemented a rigorous incremental density-based clustering algorithm, independent from any initial conditions and ordering of the data-vectors stream, 2) the cognitive challenge: we have implemented a stringent selection process of association rules between clusters at time t-1 and time t for directly generating the main conclusions about the dynamics of a data-stream. We illustrate these points with an application to a two years and 2600 documents scientific information database.
Gibbs posterior for variable selection in high-dimensional classification and data mining
Jiang, Wenxin, Tanner, Martin A.
In the popular approach of "Bayesian variable selection" (BVS), one uses prior and posterior distributions to select a subset of candidate variables to enter the model. A completely new direction will be considered here to study BVS with a Gibbs posterior originating in statistical mechanics. The Gibbs posterior is constructed from a risk function of practical interest (such as the classification error) and aims at minimizing a risk function without modeling the data probabilistically. This can improve the performance over the usual Bayesian approach, which depends on a probability model which may be misspecified. Conditions will be provided to achieve good risk performance, even in the presence of high dimensionality, when the number of candidate variables "$K$" can be much larger than the sample size "$n$." In addition, we develop a convenient Markov chain Monte Carlo algorithm to implement BVS with the Gibbs posterior.
A Novel Clustering Algorithm Based on a Modified Model of Random Walk
Li, Qiang, He, Yan, Jiang, Jing-ping
We introduce a modified model of random walk, and then develop two novel clustering algorithms based on it. In the algorithms, each data point in a dataset is considered as a particle which can move at random in space according to the preset rules in the modified model. Further, this data point may be also viewed as a local control subsystem, in which the controller adjusts its transition probability vector in terms of the feedbacks of all data points, and then its transition direction is identified by an event-generating function. Finally, the positions of all data points are updated. As they move in space, data points collect gradually and some separating parts emerge among them automatically. As a consequence, data points that belong to the same class are located at a same position, whereas those that belong to different classes are away from one another. Moreover, the experimental results have demonstrated that data points in the test datasets are clustered reasonably and efficiently, and the comparison with other algorithms also provides an indication of the effectiveness of the proposed algorithms.
Choice of neighbor order in nearest-neighbor classification
Hall, Peter, Park, Byeong U., Samworth, Richard J.
The $k$th-nearest neighbor rule is arguably the simplest and most intuitively appealing nonparametric classification procedure. However, application of this method is inhibited by lack of knowledge about its properties, in particular, about the manner in which it is influenced by the value of $k$; and by the absence of techniques for empirical choice of $k$. In the present paper we detail the way in which the value of $k$ determines the misclassification error. We consider two models, Poisson and Binomial, for the training samples. Under the first model, data are recorded in a Poisson stream and are "assigned" to one or other of the two populations in accordance with the prior probabilities. In particular, the total number of data in both training samples is a Poisson-distributed random variable. Under the Binomial model, however, the total number of data in the training samples is fixed, although again each data value is assigned in a random way. Although the values of risk and regret associated with the Poisson and Binomial models are different, they are asymptotically equivalent to first order, and also to the risks associated with kernel-based classifiers that are tailored to the case of two derivatives. These properties motivate new methods for choosing the value of $k$.
Statistical Learning Theory: Models, Concepts, and Results
von Luxburg, Ulrike, Schoelkopf, Bernhard
Statistical learning theory provides the theoretical basis for many of today's machine learning algorithms and is arguably one of the most beautifully developed branches of artificial intelligence in general. It originated in Russia in the 1960s and gained wide popularity in the 1990s following the development of the so-called Support Vector Machine (SVM), which has become a standard tool for pattern recognition in a variety of domains ranging from computer vision to computational biology. Providing the basis of new learning algorithms, however, was not the only motivation for developing statistical learning theory. It was just as much a philosophical one, attempting to answer the question of what it is that allows us to draw valid conclusions from empirical data. In this article we attempt to give a gentle, nontechnical overview over the key ideas and insights of statistical learning theory. We do not assume that the reader has a deep background in mathematics, statistics, or computer science. Given the nature of the subject matter, however, some familiarity with mathematical concepts and notations and some intuitive understanding of basic probability is required. There exist many excellent references to more technical surveys of the mathematics of statistical learning theory: the monographs by one of the founders of statistical learning theory (Vapnik, 1995, Vapnik, 1998), a brief overview over statistical learning theory in Section 5 of Schรถlkopf and Smola (2002), more technical overview papers such as Bousquet et al. (2003), Mendelson (2003), Boucheron et al. (2005), Herbrich and Williamson (2002), and the monograph Devroye et al. (1996).
Online Coordinate Boosting
Pelossof, Raphael, Jones, Michael, Vovsha, Ilia, Rudin, Cynthia
We present a new online boosting algorithm for adapting the weights of a boosted classifier, which yields a closer approximation to Freund and Schapire's AdaBoost algorithm than previous online boosting algorithms. We also contribute a new way of deriving the online algorithm that ties together previous online boosting work. We assume that the weak hypotheses were selected beforehand, and only their weights are updated during online boosting. The update rule is derived by minimizing AdaBoost's loss when viewed in an incremental form. The equations show that optimization is computationally expensive. However, a fast online approximation is possible. We compare approximation error to batch AdaBoost on synthetic datasets and generalization error on face datasets and the MNIST dataset.
On Similarities between Inference in Game Theory and Machine Learning
Rezek, I., Leslie, D. S., Reece, S., Roberts, S. J., Rogers, A., Dash, R. K., Jennings, N. R.
In this paper, we elucidate the equivalence between inference in game theory and machine learning. Our aim in so doing is to establish an equivalent vocabulary between the two domains so as to facilitate developments at the intersection of both ๏ฌelds, and as proof of the usefulness of this approach, we use recent developments in each ๏ฌeld to make useful improvements to the other. More speci๏ฌcally, we consider the analogies between smooth best responses in ๏ฌctitious play and Bayesian inference methods. Initially, we use these insights to develop and demonstrate an improved algorithm for learning in games based on probabilistic moderation. That is, by integrating over the distribution of opponent strategies (a Bayesian approach within machine learning) rather than taking a simple empirical average (the approach used in standard ๏ฌctitious play) we derive a novel moderated ๏ฌctitious play algorithm and show that it is more likely than standard ๏ฌctitious play to converge to a payoff-dominant but risk-dominated Nash equilibrium in a simple coordination game. Furthermore we consider the converse case, and show how insights from game theory can be used to derive two improved mean ๏ฌeld variational learning algorithms. We ๏ฌrst show that the standard update rule of mean ๏ฌeld variational learning is analogous to a Cournot adjustment within game theory. By analogy with ๏ฌctitious play, we then suggest an improved update rule, and show that this results in ๏ฌctitious variational play, an improved mean ๏ฌeld variational learning algorithm that exhibits better convergence in highly or strongly connected graphical models. Second, we use a recent advance in ๏ฌctitious play, namely dynamic ๏ฌctitious play, to derive a derivative action variational learning algorithm, that exhibits superior convergence properties on a canonical machine learning problem (clustering a mixture distribution).
Clustering and Feature Selection using Sparse Principal Component Analysis
Luss, Ronny, d'Aspremont, Alexandre
This paper focuses on applications of sparse principal component analysis to clustering and feature selection problems, with a particular focus on gene expression data analysis. Sparse methods have had a significant impact in many areas of statistics, in particular regression and classification (see [CT05], [DT05] and [Vap95] among others). As in these areas, our motivation for developing sparse multivariate visualization tools is the potential of these methods for yielding statistical results that are both more interpretable and more robust than classical analyses, while giving up little statistical efficiency. Principal component analysis (PCA) is a classic tool for analyzing large scale multivariate data. It seeks linear combinations of the data variables (often called factors or principal components) that capture a maximum amount of variance.
An Information Geometric Framework for Dimensionality Reduction
Carter, Kevin M., Raich, Raviv, Hero, Alfred O. III
This report concerns the problem of dimensionality reduction through information geometric methods on statistical manifolds. While there has been considerable work recently presented regarding dimensionality reduction for the purposes of learning tasks such as classification, clustering, and visualization, these methods have focused primarily on Riemannian manifolds in Euclidean space. While sufficient for many applications, there are many high-dimensional signals which have no straightforward and meaningful Euclidean representation. In these cases, signals may be more appropriately represented as a realization of some distribution lying on a statistical manifold, or a manifold of probability density functions (PDFs). We present a framework for dimensionality reduction that uses information geometry for both statistical manifold reconstruction as well as dimensionality reduction in the data domain.
Simulated annealing for weighted polygon packing
Xu, Yi-Chun, Xiao, Ren-Bin, Amos, Martyn
In this paper we present a new algorithm for a layout optimization problem: this concerns the placement of weighted polygons inside a circular container, the two objectives being to minimize imbalance of mass and to minimize the radius of the container. This problem carries real practical significance in industrial applications (such as the design of satellites), as well as being of significant theoretical interest. Previous work has dealt with circular or rectangular objects, but here we deal with the more realistic case where objects may be represented as polygons and the polygons are allowed to rotate. We present a solution based on simulated annealing and first test it on instances with known optima. Our results show that the algorithm obtains container radii that are close to optimal. We also compare our method with existing algorithms for the (special) rectangular case. Experimental results show that our approach out-performs these methods in terms of solution quality.