Goto

Collaborating Authors

 Markov Models


Entropy Rate Estimation for Markov Chains with Large State Space

arXiv.org Machine Learning

Estimating the entropy based on data is one of the prototypical problems in distribution property testing and estimation. For estimating the Shannon entropy of a distribution on $S$ elements with independent samples, [Paninski2004] showed that the sample complexity is sublinear in $S$, and [Valiant--Valiant2011] showed that consistent estimation of Shannon entropy is possible if and only if the sample size $n$ far exceeds $\frac{S}{\log S}$. In this paper we consider the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations. We show that: (1) As long as the Markov chain mixes not too slowly, i.e., the relaxation time is at most $O(\frac{S}{\ln^3 S})$, consistent estimation is achievable when $n \gg \frac{S^2}{\log S}$. (2) As long as the Markov chain has some slight dependency, i.e., the relaxation time is at least $1+\Omega(\frac{\ln^2 S}{\sqrt{S}})$, consistent estimation is impossible when $n \lesssim \frac{S^2}{\log S}$. Under both assumptions, the optimal estimation accuracy is shown to be $\Theta(\frac{S^2}{n \log S})$. In comparison, the empirical entropy rate requires at least $\Omega(S^2)$ samples to be consistent, even when the Markov chain is memoryless. In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complexity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure.


Nonparametric Bayesian Sparse Graph Linear Dynamical Systems

arXiv.org Machine Learning

A nonparametric Bayesian sparse graph linear dynamical system (SGLDS) is proposed to model sequentially observed multivariate data. SGLDS uses the Bernoulli-Poisson link together with a gamma process to generate an infinite dimensional sparse random graph to model state transitions. Depending on the sparsity pattern of the corresponding row and column of the graph affinity matrix, a latent state of SGLDS can be categorized as either a non-dynamic state or a dynamic one. A normal-gamma construction is used to shrink the energy captured by the non-dynamic states, while the dynamic states can be further categorized into live, absorbing, or noise-injection states, which capture different types of dynamical components of the underlying time series. The state-of-the-art performance of SGLDS is demonstrated with experiments on both synthetic and real data.


Variational Sequential Monte Carlo

arXiv.org Machine Learning

Many recent advances in large scale probabilistic inference rely on variational methods. The success of variational approaches depends on (i) formulating a flexible parametric family of distributions, and (ii) optimizing the parameters to find the member of this family that most closely approximates the exact posterior. In this paper we present a new approximating family of distributions, the variational sequential Monte Carlo (VSMC) family, and show how to optimize it in variational inference. VSMC melds variational inference (VI) and sequential Monte Carlo (SMC), providing practitioners with flexible, accurate, and powerful Bayesian inference. The VSMC family is a variational family that can approximate the posterior arbitrarily well, while still allowing for efficient optimization of its parameters. We demonstrate its utility on state space models, stochastic volatility models for financial data, and deep Markov models of brain neural circuits.


Rover Descent: Learning to optimize by learning to navigate on prototypical loss surfaces

arXiv.org Machine Learning

Learning to optimize - the idea that we can learn from data algorithms that optimize a numerical criterion - has recently been at the heart of a growing number of research efforts. One of the most challenging issues within this approach is to learn a policy that is able to optimize over classes of functions that are fairly different from the ones that it was trained on. We propose a novel way of framing learning to optimize as a problem of learning a good navigation policy on a partially observable loss surface. To this end, we develop Rover Descent, a solution that allows us to learn a fairly broad optimization policy from training on a small set of prototypical two-dimensional surfaces that encompasses the classically hard cases such as valleys, plateaus, cliffs and saddles and by using strictly zero-order information. We show that, without having access to gradient or curvature information, we achieve state-of-the-art convergence speed on optimization problems not presented at training time such as the Rosenbrock function and other hard cases in two dimensions. We extend our framework to optimize over high dimensional landscapes, while still handling only two-dimensional local landscape information and show good preliminary results.


Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections

arXiv.org Machine Learning

The paper describes a new algorithm to generate minimal, stable, and symbolic corrections to an input that will cause a neural network with ReLU neurons to change its output. We argue that such a correction is a useful way to provide feedback to a user when the neural network produces an output that is different from a desired output. Our algorithm generates such a correction by solving a series of linear constraint satisfaction problems. The technique is evaluated on a neural network that has been trained to predict whether an applicant will pay a mortgage.


Differentiable Dynamic Programming for Structured Prediction and Attention

arXiv.org Machine Learning

Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.


Approximating Partition Functions in Constant Time

arXiv.org Machine Learning

We study approximations of the partition function of dense graphical models. Partition functions of graphical models play a fundamental role is statistical physics, in statistics and in machine learning. Two of the main methods for approximating the partition function are Markov Chain Monte Carlo and Variational Methods. An impressive body of work in mathematics, physics and theoretical computer science provides conditions under which Markov Chain Monte Carlo methods converge in polynomial time. These methods often lead to polynomial time approximation algorithms for the partition function in cases where the underlying model exhibits correlation decay. There are very few theoretical guarantees for the performance of variational methods. One exception is recent results by Risteski (2016) who considered dense graphical models and showed that using variational methods, it is possible to find an $O(\epsilon n)$ additive approximation to the log partition function in time $n^{O(1/\epsilon^2)}$ even in a regime where correlation decay does not hold. We show that under essentially the same conditions, an $O(\epsilon n)$ additive approximation of the log partition function can be found in constant time, independent of $n$. In particular, our results cover dense Ising and Potts models as well as dense graphical models with $k$-wise interaction. They also apply for low threshold rank models.


Measuring Sample Quality with Diffusions

arXiv.org Machine Learning

Stein's method for measuring convergence to a continuous target distribution relies on an operator characterizing the target and Stein factor bounds on the solutions of an associated differential equation. While such operators and bounds are readily available for a diversity of univariate targets, few multivariate targets have been analyzed. We introduce a new class of characterizing operators based on Ito diffusions and develop explicit multivariate Stein factor bounds for any target with a fast-coupling Ito diffusion. As example applications, we develop computable and convergence-determining diffusion Stein discrepancies for log-concave, heavy-tailed, and multimodal targets and use these quality measures to select the hyperparameters of biased Markov chain Monte Carlo (MCMC) samplers, compare random and deterministic quadrature rules, and quantify bias-variance tradeoffs in approximate MCMC. Our results establish a near-linear relationship between diffusion Stein discrepancies and Wasserstein distances, improving upon past work even for strongly log-concave targets. The exposed relationship between Stein factors and Markov process coupling may be of independent interest.


Heron Inference for Bayesian Graphical Models

arXiv.org Machine Learning

Bayesian graphical models have been shown to be a powerful tool for discovering uncertainty and causal structure from real-world data in many application fields. Current inference methods primarily follow different kinds of trade-offs between computational complexity and predictive accuracy. At one end of the spectrum, variational inference approaches perform well in computational efficiency, while at the other end, Gibbs sampling approaches are known to be relatively accurate for prediction in practice. In this paper, we extend an existing Gibbs sampling method, and propose a new deterministic Heron inference (Heron) for a family of Bayesian graphical models. In addition to the support for nontrivial distributability, one more benefit of Heron is that it is able to not only allow us to easily assess the convergence status but also largely improve the running efficiency. We evaluate Heron against the standard collapsed Gibbs sampler and state-of-the-art state augmentation method in inference for well-known graphical models. Experimental results using publicly available real-life data have demonstrated that Heron significantly outperforms the baseline methods for inferring Bayesian graphical models.


Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

arXiv.org Machine Learning

We present a unified framework to analyze the global convergence of Langevin dynamics based algorithms for nonconvex finite-sum optimization with $n$ component functions. At the core of our analysis is a direct analysis of the ergodicity of the numerical approximations to Langevin dynamics, which leads to faster convergence rates. Specifically, we show that gradient Langevin dynamics (GLD) and stochastic gradient Langevin dynamics (SGLD) converge to the almost minimizer within $\tilde O\big(nd/(\lambda\epsilon) \big)$ and $\tilde O\big(d^7/(\lambda^5\epsilon^5) \big)$ stochastic gradient evaluations respectively, where $d$ is the problem dimension, and $\lambda$ is the spectral gap of the Markov chain generated by GLD. Both of the results improve upon the best known gradient complexity results. Furthermore, for the first time we prove the global convergence guarantee for variance reduced stochastic gradient Langevin dynamics (VR-SGLD) to the almost minimizer after $\tilde O\big(\sqrt{n}d^5/(\lambda^4\epsilon^{5/2})\big)$ stochastic gradient evaluations, which outperforms the gradient complexities of GLD and SGLD in a wide regime. Our theoretical analyses shed some light on using Langevin dynamics based algorithms for nonconvex optimization with provable guarantees.