Markov Models
Reviews: Scaling Factorial Hidden Markov Models: Stochastic Variational Inference without Messages
Technical quality: This paper tackles the problem of inference and learning of factored HMMs on large sequences and with large latent dimensionality. The primary contribution of the paper is integrating several existing approaches together to enable large-scale learning of FHMMs without a loss in modeling performance. The technical details of the components of the approach (the bivariate Gaussian copula variation posterior, the recognition network, the SVI learning approach) appear to be technically correct. The experimentation touches on the correct points including the accuracy of the learned models and the scalability of the proposed approach. The accuracy of the learned models is assessed using log likelihood on held-out test data. The experiments show that the model performs similarly to the SMF approach on both simulated and real (the Bach Corals) data.
Reviews: Wasserstein Training of Restricted Boltzmann Machines
The paper refers to [2] and says that those authors proved statistical consistency. However, I am then surprised to see in section 4.3 that non-zero shrinkage is obtained (including for gamma 0) for the very simple case of modelling a N(0,I) distribution with N(0, sigma 2 I). What is going on here?? A failure of consistency would be a serious flaw in the formulation of a statistical learning criterion. Also in sec 3 (Stability and KL regularization) the authors say that at least for learning based on samples (\hat{p}_{theta}) that some regularization wrt the KL divergence is required. This clearly weakens the "purity" of the smoothed Wasserstein objective fn.
Reviews: An Online Sequence-to-Sequence Model Using Partial Conditioning
This is a well-done paper. It attacks a problem that is worthwhile: how to construct and train a sequence-to-sequence model that can operate on-line instead of waiting for an entire input to be received. It clearly describes an architecture for solving the problem, and walks the reader through the issues in the design of each component in the architecture: next-step prediction, the attention mechanism, and modeling the ends of blocks. It clearly explains the challenges that need to be overcome train the model and perform inference with it, and proposes reasonable approximate algorithms for training and inference. The speech recognition experiments used to demonstrate the utility of the transducer model and to explore design issues such as maintenance of recurrent state across block boundaries, block size, design of the attention mechanism, and depth of the model are reasonable.
Reviews: PAC Reinforcement Learning with Rich Observations
Contextual MDPs are a specific type of POMDPs with the restriction that the optimal q-function depends only on the most recent observation (instead of the belief state). The authors show that Contextual MDPs are not poly PAC learneable even when either memoryless policies are considered or value function approximation is used. However, when both memoryless policies and value function approximation is used and the transitions are deterministic, then the model is PAC learnable in a polynomial number of episodes (and the complexity is independent of the number of observations). The paper is well written overall. The proofs are quite clear and quite thorough. I am not quite sure that the 16 pages of technical proofs in the appendix are suitable for a conference; the paper may better fit a journal format.
Online POMDP Planning with Anytime Deterministic Guarantees
Autonomous agents operating in real-world scenarios frequently encounter uncertainty and make decisions based on incomplete information. Planning under uncertainty can be mathematically formalized using partially observable Markov decision processes (POMDPs). However, finding an optimal plan for POMDPs can be computationally expensive and is feasible only for small tasks. In recent years, approximate algorithms, such as tree search and sample-based methodologies, have emerged as state-of-the-art POMDP solvers for larger problems. Despite their effectiveness, these algorithms offer only probabilistic and often asymptotic guarantees toward the optimal solution due to their dependence on sampling.
Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes
Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor \gamma of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free \gamma -rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the \gamma -rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity.
Deep Gaussian Markov Random Fields for Graph-Structured Dynamical Systems
Probabilistic inference in high-dimensional state-space models is computationally challenging. For many spatiotemporal systems, however, prior knowledge about the dependency structure of state variables is available. We leverage this structure to develop a computationally efficient approach to state estimation and learning in graph-structured state-space models with (partially) unknown dynamics and limited historical data. Building on recent methods that combine ideas from deep learning with principled inference in Gaussian Markov random fields (GMRF), we reformulate graph-structured state-space models as Deep GMRFs defined by simple spatial and temporal graph layers. This results in a flexible spatiotemporal prior that can be learned efficiently from a single time sequence via variational inference.
A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes
The proximal policy optimization (PPO) algorithm stands as one of the most prosperous methods in the field of reinforcement learning (RL). Despite its success, the theoretical understanding of PPO remains deficient. Specifically, it is unclear whether PPO or its optimistic variants can effectively solve linear Markov decision processes (MDPs), which are arguably the simplest models in RL with function approximation. To bridge this gap, we propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback, and establish a \tilde{\mathcal{O}}(d {3/4}H 2K {3/4}) regret for it. Here d is the ambient dimension of linear MDPs, H is the length of each episode, and K is the number of episodes. Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both stochastic linear MDPs and adversarial linear MDPs with full information.
Finding Safe Zones of Markov Decision Processes Policies
Given a policy of a Markov Decision Process, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of a SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset. SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones, and show that in general, the problem is computationally hard. For this reason, we concentrate on finding approximate SafeZones.
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
Yuan, Siyu, Chen, Zehui, Xi, Zhiheng, Ye, Junjie, Du, Zhengyin, Chen, Jiecao
Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).