Goto

Collaborating Authors

 Markov Models


Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Neural Information Processing Systems

When used to learn high dimensional parametric probabilistic models, the clas- sical maximum likelihood (ML) learning often suffers from computational in- tractability, which motivates the active developments of non-ML learning meth- ods. Yet, because of their divergent motivations and forms, the objective func- tions of many non-ML learning methods are seemingly unrelated, and there lacks a unified framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learn- ing methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score match- ing [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be unified under the minimum KL con- traction framework with different choices of the KL contraction operators.


High-dimensional Sparse Inverse Covariance Estimation using Greedy Methods

arXiv.org Machine Learning

In this paper we consider the task of estimating the non-zero pattern of the sparse inverse covariance matrix of a zero-mean Gaussian random vector from a set of iid samples. Note that this is also equivalent to recovering the underlying graph structure of a sparse Gaussian Markov Random Field (GMRF). We present two novel greedy approaches to solving this problem. The first estimates the non-zero covariates of the overall inverse covariance matrix using a series of global forward and backward greedy steps. The second estimates the neighborhood of each node in the graph separately, again using greedy forward and backward steps, and combines the intermediate neighborhoods to form an overall estimate. The principal contribution of this paper is a rigorous analysis of the sparsistency, or consistency in recovering the sparsity pattern of the inverse covariance matrix. Surprisingly, we show that both the local and global greedy methods learn the full structure of the model with high probability given just $O(d\log(p))$ samples, which is a \emph{significant} improvement over state of the art $\ell_1$-regularized Gaussian MLE (Graphical Lasso) that requires $O(d^2\log(p))$ samples. Moreover, the restricted eigenvalue and smoothness conditions imposed by our greedy methods are much weaker than the strong irrepresentable conditions required by the $\ell_1$-regularization based methods. We corroborate our results with extensive simulations and examples, comparing our local and global greedy methods to the $\ell_1$-regularized Gaussian MLE as well as the Neighborhood Greedy method to that of nodewise $\ell_1$-regularized linear regression (Neighborhood Lasso).


Asynchronous Stochastic Approximation with Differential Inclusions

arXiv.org Machine Learning

The asymptotic pseudo-trajectory approach to stochastic approximation of Benaim, Hofbauer and Sorin is extended for asynchronous stochastic approximations with a set-valued mean field. The asynchronicity of the process is incorporated into the mean field to produce convergence results which remain similar to those of an equivalent synchronous process. In addition, this allows many of the restrictive assumptions previously associated with asynchronous stochastic approximation to be removed. The framework is extended for a coupled asynchronous stochastic approximation process with set-valued mean fields. Two-timescales arguments are used here in a similar manner to the original work in this area by Borkar. The applicability of this approach is demonstrated through learning in a Markov decision process.


Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization

Journal of Artificial Intelligence Research

Many problems in artificial intelligence require adaptively making a sequence of decisions with uncertain outcomes under partial observability. Solving such stochastic optimization problems is a fundamental but notoriously difficult challenge. In this paper, we introduce the concept of adaptive submodularity, generalizing submodular set functions to adaptive policies. We prove that if a problem satisfies this property, a simple adaptive greedy algorithm is guaranteed to be competitive with the optimal policy. In addition to providing performance guarantees for both stochastic maximization and coverage, adaptive submodularity can be exploited to drastically speed up the greedy algorithm by using lazy evaluations. We illustrate the usefulness of the concept by giving several examples of adaptive submodular objectives arising in diverse AI applications including management of sensing resources, viral marketing and active learning. Proving adaptive submodularity for these problems allows us to recover existing results in these applications as special cases, improve approximation guarantees and handle natural generalizations.


Self-Avoiding Random Dynamics on Integer Complex Systems

arXiv.org Machine Learning

This paper introduces a new specialized algorithm for equilibrium Monte Carlo sampling of binary-valued systems, which allows for large moves in the state space. This is achieved by constructing self-avoiding walks (SAWs) in the state space. As a consequence, many bits are flipped in a single MCMC step. We name the algorithm SARDONICS, an acronym for Self-Avoiding Random Dynamics on Integer Complex Systems. The algorithm has several free parameters, but we show that Bayesian optimization can be used to automatically tune them. SARDONICS performs remarkably well in a broad number of sampling tasks: toroidal ferromagnetic and frustrated Ising models, 3D Ising models, restricted Boltzmann machines and chimera graphs arising in the design of quantum computers.


Joint Modeling of Multiple Related Time Series via the Beta Process

arXiv.org Machine Learning

We propose a Bayesian nonparametric approach to the problem of jointly modeling multiple related time series. Our approach is based on the discovery of a set of latent, shared dynamical behaviors. Using a beta process prior, the size of the set and the sharing pattern are both inferred from data. We develop efficient Markov chain Monte Carlo methods based on the Indian buffet process representation of the predictive distribution of the beta process, without relying on a truncated model. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynamical behaviors via birth and death proposals. We examine the benefits of our proposed feature-based model on several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data.


Learning to Make Predictions In Partially Observable Environments Without a Generative Model

Journal of Artificial Intelligence Research

When faced with the problem of learning a model of a high-dimensional environment, a common approach is to limit the model to make only a restricted set of predictions, thereby simplifying the learning problem. These partial models may be directly useful for making decisions or may be combined together to form a more complete, structured model. However, in partially observable (non-Markov) environments, standard model-learning methods learn generative models, i.e. models that provide a probability distribution over all possible futures (such as POMDPs). It is not straightforward to restrict such models to make only certain predictions, and doing so does not always simplify the learning problem. In this paper we present prediction profile models: non-generative partial models for partially observable systems that make only a given set of predictions, and are therefore far simpler than generative models in some cases. We formalize the problem of learning a prediction profile model as a transformation of the original model-learning problem, and show empirically that one can learn prediction profile models that make a small set of important predictions even in systems that are too complex for standard generative models.


Spectral Methods for Learning Multivariate Latent Tree Structure

arXiv.org Machine Learning

This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our finite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics.


Model Selection in Undirected Graphical Models with the Elastic Net

arXiv.org Machine Learning

Structure learning in random fields has attracted considerable attention due to its difficulty and importance in areas such as remote sensing, computational biology, natural language processing, protein networks, and social network analysis. We consider the problem of estimating the probabilistic graph structure associated with a Gaussian Markov Random Field (GMRF), the Ising model and the Potts model, by extending previous work on $l_1$ regularized neighborhood estimation to include the elastic net $l_1+l_2$ penalty. Additionally, we show numerical evidence that the edge density plays a role in the graph recovery process. Finally, we introduce a novel method for augmenting neighborhood estimation by leveraging pair-wise neighborhood union estimates.


Planning with State Uncertainty via Contingency Planning and Execution Monitoring

AAAI Conferences

An example is a Mars rover: The major problem with applying POMDP approaches to thanks to low-level control and obstacle avoidance, rovers realistic planning problems like the Mars rovers is the sheer can be expected to reach their destinations reliably, and can size of the problems. Using point-based approximations and collect and communicate data, but they do not know in advance structured representations similar to those used in classical which science targets are interesting and hence will planning (Poupart 2005), problems with tens of millions provide valuable data. Similarly, robots performing tasks of states can be solved approximately, but even that corresponds such as security or cognitive assistance are generally able to to a classical planning problem with only 25 binary navigate reliably, but use unreliable vision algorithms to detect variables, which is a quite small problem by the standards the people and objects with which they are supposed of classical deterministic planning. The alternative we propose to interact. Following Besse and Chaib-draa (2009), we in this paper is to construct a series of classical deterministic will refer to problems with deterministic actions but stochastic planning problems from the quasi-deterministic observations as quasi-deterministic problems, which differ problem. By solving each of these deterministic problems from Deterministic-POMDPs (DET-POMDPS) (Bonet we construct a contingent plan--one that contains branches 2009) by taking into account of uncertainty from observation to be chosen between at run-time.