Goto

Collaborating Authors

 Markov Models


Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures

arXiv.org Machine Learning

Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.


Joint modeling of multiple time series via the beta process with application to motion capture segmentation

arXiv.org Machine Learning

We propose a Bayesian nonparametric approach to the problem of jointly modeling multiple related time series. Our model discovers a latent set of dynamical behaviors shared among the sequences, and segments each time series into regions defined by a subset of these behaviors. Using a beta process prior, the size of the behavior set and the sharing pattern are both inferred from data. We develop Markov chain Monte Carlo (MCMC) methods based on the Indian buffet process representation of the predictive distribution of the beta process. Our MCMC inference algorithm efficiently adds and removes behaviors via novel split-merge moves as well as data-driven birth and death proposals, avoiding the need to consider a truncated model. We demonstrate promising results on unsupervised segmentation of human motion capture data.


Projecting Markov Random Field Parameters for Fast Mixing

arXiv.org Machine Learning

Markov chain Monte Carlo (MCMC) algorithms are simple and extremely powerful techniques to sample from almost arbitrary distributions. The flaw in practice is that it can take a large and/or unknown amount of time to converge to the stationary distribution. This paper gives sufficient conditions to guarantee that univariate Gibbs sampling on Markov Random Fields (MRFs) will be fast mixing, in a precise sense. Further, an algorithm is given to project onto this set of fast-mixing parameters in the Euclidean norm. Following recent work, we give an example use of this to project in various divergence measures, comparing univariate marginals obtained by sampling after projection to common variational methods and Gibbs sampling on the original parameters.


Marginal Pseudo-Likelihood Learning of Markov Network structures

arXiv.org Machine Learning

Undirected graphical models known as Markov networks are popular for a wide variety of applications ranging from statistical physics to computational biology. Traditionally, learning of the network structure has been done under the assumption of chordality which ensures that efficient scoring methods can be used. In general, non-chordal graphs have intractable normalizing constants which renders the calculation of Bayesian and other scores difficult beyond very small-scale systems. Recently, there has been a surge of interest towards the use of regularized pseudo-likelihood methods for structural learning of large-scale Markov network models, as such an approach avoids the assumption of chordality. The currently available methods typically necessitate the use of a tuning parameter to adapt the level of regularization for a particular dataset, which can be optimized for example by cross-validation. Here we introduce a Bayesian version of pseudo-likelihood scoring of Markov networks, which enables an automatic regularization through marginalization over the nuisance parameters in the model. We prove consistency of the resulting MPL estimator for the network structure via comparison with the pseudo information criterion. Identification of the MPL-optimal network on a prescanned graph space is considered with both greedy hill climbing and exact pseudo-Boolean optimization algorithms. We find that for reasonable sample sizes the hill climbing approach most often identifies networks that are at a negligible distance from the restricted global optimum. Using synthetic and existing benchmark networks, the marginal pseudo-likelihood method is shown to generally perform favorably against recent popular inference methods for Markov networks.


Large-Margin Determinantal Point Processes

arXiv.org Machine Learning

Determinantal point processes (DPPs) offer a powerful approach to modeling diversity in many applications where the goal is to select a diverse subset. We study the problem of learning the parameters (the kernel matrix) of a DPP from labeled training data. We make two contributions. First, we show how to reparameterize a DPP's kernel matrix with multiple kernel functions, thus enhancing modeling flexibility. Second, we propose a novel parameter estimation technique based on the principle of large margin separation. In contrast to the state-of-the-art method of maximum likelihood estimation, our large-margin loss function explicitly models errors in selecting the target subsets, and it can be customized to trade off different types of errors (precision vs. recall). Extensive empirical studies validate our contributions, including applications on challenging document and video summarization, where flexibility in modeling the kernel matrix and balancing different errors is indispensable.


Stochastic Variational Inference for Hidden Markov Models

arXiv.org Machine Learning

Variational inference algorithms have proven successful for Bayesian analysis in large data settings, with recent advances using stochastic variational inference (SVI). However, such methods have largely been studied in independent or exchangeable data settings. We develop an SVI algorithm to learn the parameters of hidden Markov models (HMMs) in a time-dependent data setting. The challenge in applying stochastic optimization in this setting arises from dependencies in the chain, which must be broken to consider minibatches of observations. We propose an algorithm that harnesses the memory decay of the chain to adaptively bound errors arising from edge effects. We demonstrate the effectiveness of our algorithm on synthetic experiments and a large genomics dataset where a batch algorithm is computationally infeasible.


Active Inference for Binary Symmetric Hidden Markov Models

arXiv.org Machine Learning

We consider active maximum a posteriori (MAP) inference problem for Hidden Markov Models (HMM), where, given an initial MAP estimate of the hidden sequence, we select to label certain states in the sequence to improve the estimation accuracy of the remaining states. We develop an analytical approach to this problem for the case of binary symmetric HMMs, and obtain a closed form solution that relates the expected error reduction to model parameters under the specified active inference scheme. We then use this solution to determine most optimal active inference scheme in terms of error reduction, and examine the relation of those schemes to heuristic principles of uncertainty reduction and solution unicity.



Discovering and Characterizing Emerging Events in Big Data

AAAI Conferences

We describe a novel system for discovering and characterizing emerging events. We define event emergence to be a developing situation comprised of a series of sub-events. To detect sub-events from a very large, continuous textual input stream, we use two techniques: (1) frequency-based detection of sub-events that are potentially entailed by an emerging event; and (2) anomaly-based detection of other sub-events that are potentially indicative of an emerging event. Identifying emerging events from detected sub-events involves connecting sub-events to each other and to the relevant emerging events within the event models and estimating the likelihood of possible emerging events. Each sub-event can be part of a number of emerging events and supports various event models to varying degrees. We adopt a coherent and compact model that probabilistically identifies emerging events. The innovative aspect of our work is a well-defined framework where statistical Big Data techniques are informed by event semantics and inference techniques (and vice versa). Our work is strongly grounded in semantics and knowledge representation, which enables us to produce more reliable results than would otherwise be possible with a purely statistical approach.


Hidden Parameter Markov Decision Processes: An Emerging Paradigm for Modeling Families of Related Tasks

AAAI Conferences

The goal of transfer is to use knowledge obtained by solving one task to improvea robot's (or software agent's) performance in future tasks. In general, we do not expect this to work; for transfer to be feasible, there must be something in common between the source task(s) and goal task(s). The question at the core of the transfer learning enterprise is therefore: what makes two tasks related?, or more generally, how do you define a family of related tasks? Given a precise definition of how a particular family of tasks is related, we can formulate clear optimizationmethods for selecting source tasks and determining what knowledge should be imported from the source task(s), and how it should be used in the target task(s). This paper describes one model that has appeared in several different research scenarios where an agent is faced with afamily of tasks that have similar, but not identical, dynamics (or reward functions). For example, a human learning to play baseball may, over the course of their career,be exposed to several different bats, each with slightly different weights and lengths.A human who has learned to play baseball well with one bat would be expected to be able to pick up any similar bat and use it.Similarly, when learning to drive a car, one may learn in more than one car, and then be expected to be able to drive any make and model of car (within reasonablevariations) with little or no relearning. These examples are instances of exactly the kind of flexible, reliable,and sample-efficient behavior that we should be aiming to achieve in robotics applications. One way to model such a family of tasks is to posit that they are generated by asmall set of latent parameters (e.g., the length and weight of the bat, or parametersdescribing the various physical properties of the car's steering system and clutch) thatare fixed for each problem instance (e.g., for each bat, or car), but are not directlyobservable by the agent. Defining a distributionover these latent parameters results in a family of related tasks, and transferis feasible to the extent that the number of latent variables is small, the task dynamics(or reward function) vary smoothly with them, and to the extent to which they can eitherbe ignored or identified using transition data from the task.This model has appeared under several different names in the literature; we refer to it as a hidden-parameterMarkov decision process (or HIP-MDP).