Markov Models
Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems
We study Reinforcement Learning for partially observable systems using function approximation. We propose a new PO-bilinear framework, that is general enough to include models such as undercomplete tabular Partially Observable Markov Decision Processes (POMDPs), Linear Quadratic Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs. Under this framework, we propose an actor-critic style algorithm that is capable to performing agnostic policy learning. Given a policy class that consists of memory based policies (i.e., policy that looks at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy among the policy class. For certain examples such as undercomplete POMDPs and LQGs, by leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon.
Reviews: Learning Chordal Markov Networks via Branch and Bound
The authors present a branch and bound algorithm for learning Chordal Markov networks. The prior state of the art algorithm is a dynamic programming approach based on a recursive characterization of clique tress and storing in memory the scores of already-solved subproblems. The proposed algorithm uses a branch and bound algorithm to search for an optimal chordal Markov network. The algorithm first uses a dynamic programming algorithm to enumerate Bayesian network structures, which are later used as pruning bounds. A symmetry breaking technique is introduced to prune the search space.
Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition
Zheng, Zhong, Zhang, Haochen, Xue, Lingzhou
We study the gap-dependent bounds of two important algorithms for on-policy Q-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the almost optimal $\sqrt{T}$-type regret bound in the worst-case scenario, where $T$ is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for Q-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for Q-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decomposition framework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in $T$ and improve upon existing ones for Q-learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for Q-learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for Q-learning.
Lifted Weighted Mini-Bucket
Many graphical models, such as Markov Logic Networks (MLNs) with evidence, possess highly symmetric substructures but no exact symmetries. Unfortunately, there are few principled methods that exploit these symmetric substructures to perform efficient approximate inference. In this paper, we present a lifted variant of the Weighted Mini-Bucket elimination algorithm which provides a principled way to (i) exploit the highly symmetric substructure of MLN models, and (ii) incorporate high-order inference terms which are necessary for high quality approximate inference. Our method has significant control over the accuracy-time trade-off of the approximation, allowing us to generate any-time approximations. Experimental results demonstrate the utility of this class of approximations, especially in models with strong repulsive potentials.
Reviews: Stochastic Gradient Richardson-Romberg Markov Chain Monte Carlo
I like the fact that the method is very simple to understand and implement (see my summary), and does not require any major changes to the base SG-MCMC algorithm. Also, this seems very general and applies to a large class of SG-MCMC algorithms, and is therefore potentially very impactful to the Stochastic Gradient MCMC community. Novelty: Although Richardson-Romberg extrapolation is well known in numerical analysis, it is not widely known in the machine learning / stochastic gradient MCMC community. Clarity: The paper is well written and the presentation is clear. Comments/questions: - Can this technique be directly applied to all SG-MCMC methods?
Reviews: Collapsed variational Bayes for Markov jump processes
The authors present a variational inference algorithm for continuous time Markov jump processes. Following previous work, they use "uniformization" to produce a discrete time skeleton at which they infer the latent states. Unlike previous work, however, the authors propose to learn this skeleton (a point estimate, via random search) and to integrate out, or collapse, the transition matrix during latent state inference. They compare their algorithm to existing MCMC schemes, which also use uniformization, but which do not collapse out the transition matrix. While this work is well motivated, I found it difficult to tease out which elements of the inference algorithm led to the observed improvement.
Reviews: rho-POMDPs have Lipschitz-Continuous epsilon-Optimal Value Functions
The paper addresses the problem of rho-POMDPs non-convex reward functions, proving that indeed under some cases they, and their resulting value functions, are Lipschitz-continuous (LC) for finite horizons. The paper also proposes and uses a more general vector form of LC, too. This result allows value function approximations of the optimal V * to be used, as well as upper and lower bounds (U and L) on value as in HSVI, and a wide array of new algorithms to be developed. This is analogous to the PWLC result for standard POMDPs, as LC is more general, allowing for similar contraction operators with Banach's fixed point theorem as in (PO)MDPs, and finite horizon approximations of the infinite horizon objective criteria. Once the paper establishes the main result, it discusses approximations of U and L using min or max, respectively, over sets of cones.
Reviews: Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
The authors conduct an analysis of CTC trained acoustic models to determine how information related to phonetic categories is preserved in CTC-based models which directly output graphemes. The work follows a long line of research that has analyzed neural network representations to determine how they model phonemic representations, although to the best of my knowledge this has not been done previously for CTC-based end-to-end architectures. The results and analysis presented by the authors is interesting, although there are some concerns I have with the conclusions that the authors draw that I would like to clarify these points. Please see my detailed comments below. In the paper, the authors conclude that (Line 159--164) "... after the 5th recurrent layer accuracy goes down again. One possible explanation to this may be that higher layers in the model are more sensitive to long distance information that is needed for the speech recognition task, whereas the local information which is needed for classifying phones is better captured in lower layers."
Reviews: On Learning Markov Chains
Summary: The paper's goal is to study the minimax rates for learning problems on Markovian data. The author(s) consider an interesting setting where the sequence of data observed follow a Markovian dependency pattern. They consider discrete state Markov chains with a state space [k] and study the minimax error rates for the following two tasks: Prediction: Given a trajectory X_1 - X_2 … - X_n from an unknown chain M, predict the probability distribution of the next state X_n 1, i.e., P(. X_1…n)) ] where the expectation is over the trajectory X_1…X_n. The loss function the paper focuses on is KL-divergence and presents a conjecture for how the L_1 loss should scale with respect to k and n.
Reviews: Information Theoretic Properties of Markov Random Fields, and their Algorithmic Applications
This paper is concerned with learning Markov random fields (MRF). It is a theoretical paper, which is ultimately focused with proving a particular statement: given a variable X in an MRF, and given some of its Markov blanket variables B, there exists another variable Y that is conditionally dependent on X given the subset of B. In general this statement is not true; so the goal here is to identify some conditions where this is true. Most of this paper is centered around this, from which the ability to learn an MRF follows. The paper is mostly technical; my main complaint is that I do not think it is very intuitive. It appears that central to the results is the assumption on non-degeneracy, which I believe should be explained in higher level terms.