Statistical Learning
Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
Jordan, Michael I., Wainwright, Martin J.
We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropybound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problemthat can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problemis strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial marginfor certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3].
Learning Spectral Clustering
Bach, Francis R., Jordan, Michael I.
Spectral clustering refers to a class of techniques which rely on the eigenstructure ofa similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clustershaving low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors.
An Iterative Improvement Procedure for Hierarchical Clustering
Kauchak, David, Dasgupta, Sanjoy
We describe a procedure which finds a hierarchical clustering by hillclimbing. Thecost function we use is a hierarchical extension of the k-means cost; our local moves are tree restructurings and node reorderings. Weshow these can be accomplished efficiently, by exploiting special properties of squared Euclidean distances and by using techniques from scheduling algorithms.
Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Shental, Noam, Bar-hillel, Aharon, Hertz, Tomer, Weinshall, Daphna
Density estimation with Gaussian Mixture Models is a popular generative techniqueused also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Suchconstraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using aMarkov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
Warped Gaussian Processes
Snelson, Edward, Ghahramani, Zoubin, Rasmussen, Carl E.
This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm choosesa nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation.
Sparse Greedy Minimax Probability Machine Classification
Strohmann, Thomas R., Belitski, Andrei, Grudic, Gregory Z., DeCoste, Dennis
The Minimax Probability Machine Classification (MPMC) framework [Lanckriet et al., 2002] builds classifiers by minimizing the maximum probability of misclassification, and gives direct estimates of the probabilistic accuracybound ฮฉ. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally addingbasis functions (i.e.
Wormholes Improve Contrastive Divergence
Welling, Max, Mnih, Andriy, Hinton, Geoffrey E.
In models that define probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model's distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.
New Algorithms for Efficient High Dimensional Non-parametric Classification
liu, Ting, Moore, Andrew W., Gray, Alexander
This paper is about non-approximate acceleration of high dimensional nonparametric operations such as k nearest neighbor classifiers and the prediction phase of Support Vector Machine classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, andwe investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure k-NN classification and the prediction phaseof SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based k-NN.
Policy Search by Dynamic Programming
Bagnell, J. A., Kakade, Sham M., Schneider, Jeff G., Ng, Andrew Y.
We consider the policy search approach to reinforcement learning. We show that if a "baseline distribution" is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide nontrivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem.