Markov Models
Model-Based Exploration in Monitored Markov Decision Processes
Kazemipour, Alireza, Parisi, Simone, Taylor, Matthew E., Bowling, Michael
A tenet of reinforcement learning is that rewards are always observed by the agent. However, this is not true in many realistic settings, e.g., a human observer may not always be able to provide rewards, a sensor to observe rewards may be limited or broken, or rewards may be unavailable during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed as a model of such settings. Yet, Mon-MDP algorithms developed thus far do not fully exploit the problem structure, cannot take advantage of a known monitor, have no worst-case guarantees for ``unsolvable'' Mon-MDPs without specific initialization, and only have asymptotic proofs of convergence. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses all of these shortcomings. The algorithm uses two instances of model-based interval estimation, one to guarantee that observable rewards are indeed observed, and another to learn the optimal policy. Second, empirical results demonstrate these advantages, showing faster convergence than prior algorithms in over two dozen benchmark settings, and even more dramatic improvements when the monitor process is known. Third, we present the first finite-sample bound on performance and show convergence to an optimal worst-case policy when some rewards are never observable.
In-context learning of evolving data streams with tabular foundational models
Lourenรงo, Afonso, Gama, Joรฃo, Xing, Eric P., Marreiros, Goreti
State-of-the-art data stream mining in supervised classification has traditionally relied on ensembles of incremental decision trees. However, the emergence of large tabular models, i.e., transformers designed for structured numerical data, marks a significant paradigm shift. These models move beyond traditional weight updates, instead employing in-context learning through prompt tuning. By using on-the-fly sketches to summarize unbounded streaming data, one can feed this information into a pre-trained model for efficient processing. This work bridges advancements from both areas, highlighting how transformers' implicit meta-learning abilities, pre-training on drifting natural data, and reliance on context optimization directly address the core challenges of adaptive learning in dynamic environments. Exploring real-time model adaptation, this research demonstrates that TabPFN, coupled with a simple sliding memory strategy, consistently outperforms ensembles of Hoeffding trees across all non-stationary benchmarks. Several promising research directions are outlined in the paper. The authors urge the community to explore these ideas, offering valuable opportunities to advance in-context stream learning.
The Role of Sparsity for Length Generalization in Transformers
Golowich, Noah, Jelassi, Samy, Brandfonbrener, David, Kakade, Sham M., Malach, Eran
Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call $k$-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model justifies certain techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling. We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is a ``sparse'' dependency structure of each token on the previous ones. Inspired by our theory, we introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which position coupling can successfully be applied to achieve length generalization.
Towards Optimal Adversarial Robust Reinforcement Learning with Infinity Measurement Error
Li, Haoran, Zhang, Zicheng, Luo, Wang, Han, Congying, Lv, Jiayu, Guo, Tiande, Hu, Yudong
Ensuring the robustness of deep reinforcement learning (DRL) agents against adversarial attacks is critical for their trustworthy deployment. Recent research highlights the challenges of achieving state-adversarial robustness and suggests that an optimal robust policy (ORP) does not always exist, complicating the enforcement of strict robustness constraints. In this paper, we further explore the concept of ORP. We first introduce the Intrinsic State-adversarial Markov Decision Process (ISA-MDP), a novel formulation where adversaries cannot fundamentally alter the intrinsic nature of state observations. ISA-MDP, supported by empirical and theoretical evidence, universally characterizes decision-making under state-adversarial paradigms. We rigorously prove that within ISA-MDP, a deterministic and stationary ORP exists, aligning with the Bellman optimal policy. Our findings theoretically reveal that improving DRL robustness does not necessarily compromise performance in natural environments. Furthermore, we demonstrate the necessity of infinity measurement error (IME) in both $Q$-function and probability spaces to achieve ORP, unveiling vulnerabilities of previous DRL algorithms that rely on $1$-measurement errors. Motivated by these insights, we develop the Consistent Adversarial Robust Reinforcement Learning (CAR-RL) framework, which optimizes surrogates of IME. We apply CAR-RL to both value-based and policy-based DRL algorithms, achieving superior performance and validating our theoretical analysis.
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Feng, Yunhai, Han, Jiaming, Yang, Zhuoran, Yue, Xiangyu, Levine, Sergey, Luo, Jianlan
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.
Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning
Xu, Yang, Mondal, Washim Uddin, Aggarwal, Vaneet
We present the first finite-sample analysis for policy evaluation in robust average-reward Markov Decision Processes (MDPs). Prior works in this setting have established only asymptotic convergence guarantees, leaving open the question of sample complexity. In this work, we address this gap by establishing that the robust Bellman operator is a contraction under the span semi-norm, and developing a stochastic approximation framework with controlled bias. Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently. To overcome the infinite expected sample complexity inherent in standard MLMC, we introduce a truncation mechanism based on a geometric distribution, ensuring a finite constant sample complexity while maintaining a small bias that decays exponentially with the truncation level. Our method achieves the order-optimal sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for robust policy evaluation and robust average reward estimation, marking a significant advancement in robust reinforcement learning theory.
Quadruped Robot Simulation Using Deep Reinforcement Learning -- A step towards locomotion policy
Jadoon, Nabeel Ahmad Khan, Ekpanyapong, Mongkol
We present a novel reinforcement learning method to train the quadruped robot in a simulated environment. The idea of controlling quadruped robots in a dynamic environment is quite challenging and my method presents the optimum policy and training scheme with limited resources and shows considerable performance. The report uses the raisimGymTorch open-source library and proprietary software RaiSim for the simulation of ANYmal robot. My approach is centered on formulating Markov decision processes using the evaluation of the robot walking scheme while training. Resulting MDPs are solved using a proximal policy optimization algorithm used in actor-critic mode and collected thousands of state transitions with a single desktop machine. This work also presents a controller scheme trained over thousands of time steps shown in a simulated environment. This work also sets the base for early-stage researchers to deploy their favorite algorithms and configurations. Keywords: Legged robots, deep reinforcement learning, quadruped robot simulation, optimal control
Monotonicity Testing of High-Dimensional Distributions with Subcube Conditioning
Chakrabarty, Deeparnab, Chen, Xi, Ristic, Simeon, Seshadhri, C., Waingarten, Erik
We study monotonicity testing of high-dimensional distributions on $\{-1,1\}^n$ in the model of subcube conditioning, suggested and studied by Canonne, Ron, and Servedio~\cite{CRS15} and Bhattacharyya and Chakraborty~\cite{BC18}. Previous work shows that the \emph{sample complexity} of monotonicity testing must be exponential in $n$ (Rubinfeld, Vasilian~\cite{RV20}, and Aliakbarpour, Gouleakis, Peebles, Rubinfeld, Yodpinyanee~\cite{AGPRY19}). We show that the subcube \emph{query complexity} is $\tilde{\Theta}(n/\varepsilon^2)$, by proving nearly matching upper and lower bounds. Our work is the first to use directed isoperimetric inequalities (developed for function monotonicity testing) for analyzing a distribution testing algorithm. Along the way, we generalize an inequality of Khot, Minzer, and Safra~\cite{KMS18} to real-valued functions on $\{-1,1\}^n$. We also study uniformity testing of distributions that are promised to be monotone, a problem introduced by Rubinfeld, Servedio~\cite{RS09} , using subcube conditioning. We show that the query complexity is $\tilde{\Theta}(\sqrt{n}/\varepsilon^2)$. Our work proves the lower bound, which matches (up to poly-logarithmic factors) the uniformity testing upper bound for general distributions (Canonne, Chen, Kamath, Levi, Waingarten~\cite{CCKLW21}). Hence, we show that monotonicity does not help, beyond logarithmic factors, in testing uniformity of distributions with subcube conditional queries.
Risk-Averse Reinforcement Learning: An Optimal Transport Perspective on Temporal Difference Learning
-- The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance, frequently without considering risk or safety. In contrast, safe reinforcement learning seeks to reduce or avoid unsafe states. This letter introduces a risk-averse temporal difference algorithm that uses optimal transport theory to direct the agent toward predictable behavior . By incorporating a risk indicator, the agent learns to favor actions with predictable consequences. We evaluate the proposed algorithm in several case studies and show its effectiveness in the presence of uncertainty. The results demonstrate that our method reduces the frequency of visits to risky states while preserving performance. I. INTRODUCTION Reinforcement learning (RL) algorithms focus on maximizing performance, primarily through long-term reward optimization. However, this objective alone does not always prevent negative or high-risk outcomes.
BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
Yang, Yue, Zhao, Linfeng, Ding, Mingyu, Bertasius, Gedas, Szafir, Daniel
Robotics has long sought to develop visual-servoing robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To validate OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: "Single Predicate Shift", "Accumulated Predicate Shift", and "Skill Chaining", each designed to assess a different aspect of OSS's negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate a potential solution to OSS that scales up the training data for each skill with a larger and more visually diverse set of demonstrations, with our results showing it is not sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/