Markov Models
A phase transition in sampling from Restricted Boltzmann Machines
Kwon, Youngwoo, Qin, Qian, Wang, Guanyang, Wei, Yuchen
Restricted Boltzmann Machines are a class of undirected graphical models that play a key role in deep learning and unsupervised learning. In this study, we prove a phase transition phenomenon in the mixing time of the Gibbs sampler for a one-parameter Restricted Boltzmann Machine. Specifically, the mixing time varies logarithmically, polynomially, and exponentially with the number of vertices depending on whether the parameter $c$ is above, equal to, or below a critical value $c_\star\approx-5.87$. A key insight from our analysis is the link between the Gibbs sampler and a dynamical system, which we utilize to quantify the former based on the behavior of the latter. To study the critical case $c= c_\star$, we develop a new isoperimetric inequality for the sampler's stationary distribution by showing that the distribution is nearly log-concave.
Offline Hierarchical Reinforcement Learning via Inverse Optimization
Schmidt, Carolin, Gammelli, Daniele, Harrison, James, Pavone, Marco, Rodrigues, Filipe
Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed.
The Plug-in Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis
We study the sample complexity of the plug-in approach for learning $\varepsilon$-optimal policies in average-reward Markov decision processes (MDPs) with a generative model. The plug-in approach constructs a model estimate then computes an average-reward optimal policy in the estimated model. Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed. Unlike the more well-studied discounted MDP reduction method, the plug-in approach requires no prior problem information or parameter tuning. Our results fill this gap and address the limitations of prior approaches, as we show that the plug-in approach is optimal in several well-studied settings without using prior knowledge. Specifically it achieves the optimal diameter- and mixing-based sample complexities of $\widetilde{O}\left(SA \frac{D}{\varepsilon^2}\right)$ and $\widetilde{O}\left(SA \frac{\tau_{\mathrm{unif}}}{\varepsilon^2}\right)$, respectively, without knowledge of the diameter $D$ or uniform mixing time $\tau_{\mathrm{unif}}$. We also obtain span-based bounds for the plug-in approach, and complement them with algorithm-specific lower bounds suggesting that they are unimprovable. Our results require novel techniques for analyzing long-horizon problems which may be broadly useful and which also improve results for the discounted plug-in approach, removing effective-horizon-related sample size restrictions and obtaining the first optimal complexity bounds for the full range of sample sizes without reward perturbation.
Equilibrium and non-Equilibrium regimes in the learning of Restricted Boltzmann Machines
Training Restricted Boltzmann Machines (RBMs) has been challenging for a long time due to the difficulty of computing precisely the log-likelihood gradient. Over the past decades, many works have proposed more or less successful recipes but without studying systematically the crucial quantity of the problem: the mixing time i.e. the number of MCMC iterations needed to sample completely new configurations from a model. In this work, we show that this mixing time plays a crucial role in the behavior and stability of the trained model, and that RBMs operate in two well-defined distinct regimes, namely equilibrium and out-of-equilibrium, depending on the interplay between this mixing time of the model and the number of MCMC steps, k, used to approximate the gradient. We further show empirically that this mixing time increases along the learning, which often implies a transition from one regime to another as soon as k becomes smaller than this time.In particular, we show that using the popular k (persistent) contrastive divergence approaches, with k small, the dynamics of the fitted model are extremely slow and often dominated by strong out-of-equilibrium effects. On the contrary, RBMs trained in equilibrium display much faster dynamics, and a smooth convergence to dataset-like configurations during the sampling.Finally, we discuss how to exploit in practice both regimes depending on the task one aims to fulfill: (i) short k s can be used to generate convincing samples in short learning times, (ii) large k (or increasingly large) must be used to learn the correct equilibrium distribution of the RBM.
Regime Switching Bandits
We study a multi-armed bandit problem where the rewards exhibit regime switching. Specifically, the distributions of the random rewards generated from all arms are modulated by a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the transition matrix and the reward distributions. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, belief error control in partially observable Markov decision processes and upper-confidence-bound methods for online learning. We also establish an upper bound O(T {2/3}\sqrt{\log T}) for the proposed learning algorithm where T is the learning horizon. Finally, we conduct proof-of-concept experiments to illustrate the performance of the learning algorithm.
Towards Instance-Optimal Offline Reinforcement Learning with Pessimism
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy \mu . In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. Here \pi \star is a optimal policy, \mu is the behavior policy and d(s_h,a_h) is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results.
Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems
This manuscript contributes a general and practical framework for casting a Markov process model of a system at equilibrium as a structural causal model, and carrying out counterfactual inference. Markov processes mathematically describe the mechanisms in the system, and predict the system's equilibrium behavior upon intervention, but do not support counterfactual inference. In contrast, structural causal models support counterfactual inference, but do not identify the mechanisms. We define the structural causal models in terms of the parameters and the equilibrium dynamics of the Markov process models, and counterfactual inference flows from these settings. The proposed approach alleviates the identifiability drawback of the structural causal models, in that the counterfactual inference is consistent with the counterfactual trajectories simulated from the Markov process model.
Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes
In this paper, a rather general online problem called \emph{dynamic resource allocation with capacity constraints (DRACC)} is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. As the existing online learning techniques do not yield vanishing-regret mechanisms for this problem, we develop a novel online learning framework defined over deterministic Markov decision processes with \emph{dynamic} state transition and reward functions. We then prove that if the Markov decision process is guaranteed to admit an oracle that can simulate any given policy from any initial state with bounded loss --- a condition that is satisfied in the DRACC problem --- then the online learning problem can be solved with vanishing regret. Our proof technique is based on a reduction to online learning with \emph{switching cost}, in which an online decision maker incurs an extra cost every time she switches from one arm to another.
Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.
Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space
Models of many real-life applications, such as queueing models of communication networks or computing systems, have a countably infinite state-space. Algorithmic and learning procedures that have been developed to produce optimal policies mainly focus on finite state settings, and do not directly apply to these models. To overcome this lacuna, in this work we study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes (MDPs) governed by an unknown parameter \theta\in\Theta, and defined on a countably-infinite state-space \mathcal X \mathbb{Z}_ d, with finite action space \mathcal A, and an unbounded cost function. We take a Bayesian perspective with the random unknown parameter \boldsymbol{\theta} * generated via a given fixed prior distribution on \Theta . To optimally control the unknown MDP, we propose an algorithm based on Thompson sampling with dynamically-sized episodes: at the beginning of each episode, the posterior distribution formed via Bayes' rule is used to produce a parameter estimate, which then decides the policy applied during the episode.