Goto

Collaborating Authors

 Markov Models


Generalized Linear Markov Decision Process

arXiv.org Machine Learning

The linear Markov Decision Process (MDP) framework offers a principled foundation for reinforcement learning (RL) with strong theoretical guarantees and sample efficiency. However, its restrictive assumption-that both transition dynamics and reward functions are linear in the same feature space-limits its applicability in real-world domains, where rewards often exhibit nonlinear or discrete structures. Motivated by applications such as healthcare and e-commerce, where data is scarce and reward signals can be binary or count-valued, we propose the Generalized Linear MDP (GLMDP) framework-an extension of the linear MDP framework-that models rewards using generalized linear models (GLMs) while maintaining linear transition dynamics. We establish the Bellman completeness of GLMDPs with respect to a new function class that accommodates nonlinear rewards and develop two offline RL algorithms: Generalized Pessimistic Value Iteration (GPEVI) and a semi-supervised variant (SS-GPEVI) that utilizes both labeled and unlabeled trajectories. Our algorithms achieve theoretical guarantees on policy suboptimality and demonstrate improved sample efficiency in settings where reward labels are expensive or limited.


Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

arXiv.org Machine Learning

In this paper we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $ฯ€^*$ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $ฮฒ\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(ฮต,ฮด)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{ฯ€_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $ฯ€_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-ฮณ}$ and the strength of this dependence grows with the learners risk-sensitivity $|ฮฒ|$. We also provide two lower bounds which shows that exponential dependence on $|ฮฒ|\frac{1}{1-ฮณ}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in $\varepsilon$ and $ฮด$ and that the PAC-bound on $Q$-learning is tight in the number of actions $A$, and that the PAC-bound on policy-learning is nearly tight in $A$.


WoMAP: World Models For Embodied Open-Vocabulary Object Localization

arXiv.org Artificial Intelligence

Language-instructed active object localization is a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP's superior performance in a broad range of zero-shot object localization tasks, with more than 9x and 2x higher success rates compared to VLM and diffusion policy baselines, respectively. Further, we show that WoMAP achieves strong generalization and sim-to-real transfer on a TidyBot.


Supervised Optimism Correction: Be Confident When LLMs Are Sure

arXiv.org Artificial Intelligence

In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$-function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$-value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.


Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games

arXiv.org Artificial Intelligence

Artificial agents are increasingly central to complex interactions and decision-making tasks, yet aligning their behaviors with desired human values remains an open challenge. In this work, we investigate how human-like personality traits influence agent behavior and performance within text-based interactive environments. We introduce PANDA: Personality Adapted Neural Decision Agents, a novel method for projecting human personality traits onto agents to guide their behavior. To induce personality in a text-based game agent, (i) we train a personality classifier to identify what personality type the agent's actions exhibit, and (ii) we integrate the personality profiles directly into the agent's policy-learning pipeline. By deploying agents embodying 16 distinct personality types across 25 text-based games and analyzing their trajectories, we demonstrate that an agent's action decisions can be guided toward specific personality profiles. Moreover, certain personality types, such as those characterized by higher levels of Openness, display marked advantages in performance. These findings underscore the promise of personality-adapted agents for fostering more aligned, effective, and human-centric decision-making in interactive environments.


WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks

arXiv.org Artificial Intelligence

Powered by a large language model (LLM), a web browsing agent operates web browsers in a human-like manner and offers a highly transparent path toward automating a wide range of everyday tasks. As web agents become increasingly capable and demonstrate proficiency in general browsing tasks, a critical question emerges: Can they go beyond general browsing to robustly handle tasks that are tedious and complex, or chores that humans often avoid doing themselves? In this paper, we introduce WebChoreArena, a new fully reproducible benchmark comprising 532 carefully curated tasks designed to extend the scope of WebArena beyond general browsing to more labor-intensive and tedious tasks. WebChoreArena systematically integrates three key challenges: (i) Massive Memory tasks requiring accurate retrieval of large amounts of information in the observations, (ii) Calculation tasks demanding precise mathematical reasoning, and (iii) Long-Term Memory tasks necessitating long-term memory across multiple webpages. Built on top of the fully reproducible and widely adopted four WebArena simulation environments, WebChoreArena ensures strict reproducibility and enables fair, direct comparisons with the established WebArena benchmark, offering key insights into agent progress. Our experimental results demonstrate that as LLMs evolve, represented by GPT-4o, Claude 3.7 Sonnet, and Gemini 2.5 Pro, significant improvements in performance are observed on WebChoreArena. These findings suggest that WebChoreArena is well-suited to measure the advancement of state-of-the-art LLMs with greater clarity. Nevertheless, the results also indicate that even with Gemini 2.5 Pro, there remains substantial room for improvement compared to WebArena, highlighting the increased challenges posed by WebChoreArena.


Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

arXiv.org Artificial Intelligence

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive theoretical understanding of this phenomenon remains limited. In this work, we investigate the layerwise behavior of Transformers to uncover the mechanisms underlying their multi-task generalization ability. Taking explorations on a typical sequence model, i.e, Hidden Markov Models, which are fundamental to many language tasks, we observe that: first, lower layers of Transformers focus on extracting feature representations, primarily influenced by neighboring tokens; second, on the upper layers, features become decoupled, exhibiting a high degree of time disentanglement. Building on these empirical insights, we provide theoretical analysis for the expressiveness power of Transformers. Our explicit constructions align closely with empirical observations, providing theoretical support for the Transformer's effectiveness and efficiency on sequence learning across diverse tasks.


Slow Feature Analysis on Markov Chains from Goal-Directed Behavior

arXiv.org Artificial Intelligence

Slow Feature Analysis is a unsupervised representation learning method that extracts slowly varying features from temporal data and can be used as a basis for subsequent reinforcement learning. Often, the behavior that generates the data on which the representation is learned is assumed to be a uniform random walk. Less research has focused on using samples generated by goal-directed behavior, as commonly the case in a reinforcement learning setting, to learn a representation. In a spatial setting, goal-directed behavior typically leads to significant differences in state occupancy between states that are close to a reward location and far from a reward location. Through the perspective of optimal slow features on ergodic Markov chains, this work investigates the effects of these differences on value-function approximation in an idealized setting. Furthermore, three correction routes, which can potentially alleviate detrimental scaling effects, are evaluated and discussed. In addition, the special case of goal-averse behavior is considered.


DriveMind: A Dual-VLM based Reinforcement Learning Framework for Autonomous Driving

arXiv.org Artificial Intelligence

Recent advances in autonomous vehicles have shifted development from rigid pipelines to end-to-end neural policies mapping raw sensor streams directly to control commands [1-3]. While these models offer streamlined architectures and strong benchmark performance, they raise critical deployment concerns. Their internal logic is opaque, complicating validation in safety-critical settings. They struggle to generalize to rare events like severe weather or infrastructure damage and lack formal guarantees on kinematic properties such as speed limits and lane-keeping. Further, they provide no natural interface for human oversight or explanation. These challenges motivate frameworks that combine deep network expressiveness with transparency, robustness, and provable safety. Meanwhile, Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated human-level reasoning and visual grounding [4-6]. Recent works like VLM-SR (Shaped Rewards) [7], VLM-RM (Reward Models) [8], and RoboCLIP (Language-Conditioned Robot Learning via Contrastive Language-Image Pretraining) [9] inject semantic feedback into Reinforcement Learning (RL), but rely on static prompts unsuited to evolving road conditions and overlook vehicle dynamics.


Learning Juntas under Markov Random Fields

arXiv.org Artificial Intelligence

We give an algorithm for learning $O(\log n)$ juntas in polynomial-time with respect to Markov Random Fields (MRFs) in a smoothed analysis framework where only the external field has been randomly perturbed. This is a broad generalization of the work of Kalai and Teng, who gave an algorithm that succeeded with respect to smoothed product distributions (i.e., MRFs whose dependency graph has no edges). Our algorithm has two phases: (1) an unsupervised structure learning phase and (2) a greedy supervised learning algorithm. This is the first example where algorithms for learning the structure of an undirected graphical model lead to provably efficient algorithms for supervised learning.