Goto

Collaborating Authors

 Markov Models


Error Analysis of Triangular Optimal Transport Maps for Filtering

arXiv.org Machine Learning

We present a systematic analysis of estimation errors for a class of optimal transport based algorithms for filtering and data assimilation. Along the way, we extend previous error analyses of Brenier maps to the case of conditional Brenier maps that arise in the context of simulation based inference. We then apply these results in a filtering scenario to analyze the optimal transport filtering algorithm of Al-Jarrah et al. (2024, ICML). An extension of that algorithm along with numerical benchmarks on various non-Gaussian and high-dimensional examples are provided to demonstrate its effectiveness and practical potential.


Joint Optimization of Cooperation Efficiency and Communication Covertness for Target Detection with AUVs

arXiv.org Artificial Intelligence

This paper investigates underwater cooperative target detection using autonomous underwater vehicles (AUVs), with a focus on the critical trade-off between cooperation efficiency and communication covertness. To tackle this challenge, we first formulate a joint trajectory and power control optimization problem, and then present an innovative hierarchical action management framework to solve it. According to the hierarchical formulation, at the macro level, the master AUV models the agent selection process as a Markov decision process and deploys the proximal policy optimization algorithm for strategic task allocation. At the micro level, each selected agent's decentralized decision-making is modeled as a partially observable Markov decision process, and a multi-agent proximal policy optimization algorithm is used to dynamically adjust its trajectory and transmission power based on its local observations. Under the centralized training and decentralized execution paradigm, our target detection framework enables adaptive covert cooperation while satisfying both energy and mobility constraints. By comprehensively modeling the considered system, the involved signals and tasks, as well as energy consumption, theoretical insights and practical solutions for the efficient and secure operation of multiple AUVs are provided, offering significant implications for the execution of underwater covert communication tasks.


Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

arXiv.org Artificial Intelligence

Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.


Combining Cost-Constrained Runtime Monitors for AI Safety

arXiv.org Artificial Intelligence

Monitoring AIs at runtime can help us detect and stop harmful actions. In this paper, we study how to efficiently combine multiple runtime monitors into a single monitoring protocol. The protocol's objective is to maximize the probability of applying a safety intervention on misaligned outputs (i.e., maximize recall). Since running monitors and applying safety interventions are costly, the protocol also needs to adhere to an average-case budget constraint. Taking the monitors' performance and cost as given, we develop an algorithm to find the best protocol. The algorithm exhaustively searches over when and which monitors to call, and allocates safety interventions based on the Neyman-Pearson lemma. By focusing on likelihood ratios and strategically trading off spending on monitors against spending on interventions, we more than double our recall rate compared to a naive baseline in a code review setting. We also show that combining two monitors can Pareto dominate using either monitor alone. Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior in cost-sensitive settings.


Steering Generative Models with Experimental Data for Protein Fitness Optimization

arXiv.org Artificial Intelligence

Protein fitness optimization involves finding a protein sequence that maximizes desired quantitative properties in a combinatorially large design space of possible sequences. Recent advances in steering protein generative models (e.g., diffusion models and language models) with labeled data offer a promising approach. However, most previous studies have optimized surrogate rewards and/or utilized large amounts of labeled data for steering, making it unclear how well existing methods perform and compare to each other in real-world optimization campaigns where fitness is measured through low-throughput wet-lab assays. In this study, we explore fitness optimization using small amounts (hundreds) of labeled sequence-fitness pairs and comprehensively evaluate strategies such as classifier guidance and posterior sampling for guiding generation from different discrete diffusion models of protein sequences. We also demonstrate how guidance can be integrated into adaptive sequence selection akin to Thompson sampling in Bayesian optimization, showing that plug-and-play guidance strategies offer advantages over alternatives such as reinforcement learning with protein language models. Overall, we provide practical insights into how to effectively steer modern generative models for next-generation protein fitness optimization.


Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach

arXiv.org Artificial Intelligence

Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.


StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking

arXiv.org Artificial Intelligence

Human players do more than press buttons: they ground what they see on screen into precise keyboard-mouse actions and, when stuck, they seek information before trying again. We ask whether current vision-language models (VLMs) can do the same. Despite encouraging results under simplified control or tool scaffolds, human-like play in a real client - mapping raw screenshots to temporally coherent low-level actions while deciding when to ask for guidance - remains an open challenge. We introduce StarBench, a turn-based RPG benchmark derived from Honkai: Star Rail that targets these two human-like competencies: multimodal decision-making from pixels to actions and agentic information seeking. StarBench standardizes evaluation across eight combat tasks and two regimes with shared tasks and metrics: (i) direct control, where agents receive only screenshots and must emit low-level primitives (click and keypress) with no semantic hints; and (ii) tool-assisted control, where higher-level intents can be mapped to primitives by detectors and OCR outputs provide optional textualized observations to ease UI grounding. To mirror human practice, StarBench also includes an ask-or-act diagnostic that measures whether and when agents choose to request brief guidance before proceeding, and how that choice affects subsequent performance. We report reference baselines for contemporary VLMs and a human reference. Results expose sizable gaps in perception-to-control fidelity in the direct regime, while showing that judicious information seeking correlates with improved success, establishing StarBench as a reproducible yardstick for agentic information seeking and multimodal decision-making in real-client play.


Probabilistic Modeling of Intentions in Socially Intelligent LLM Agents

arXiv.org Artificial Intelligence

We present a probabilistic intent modeling framework for large language model (LLM) agents in multi-turn social dialogue. The framework maintains a belief distribution over a partner's latent intentions, initialized from contextual priors and dynamically updated through likelihood estimation after each utterance. The evolving distribution provides additional contextual grounding for the policy, enabling adaptive dialogue strategies under uncertainty. Preliminary experiments in the SOTOPIA environment show consistent improvements: the proposed framework increases the Overall score by 9.0% on SOTOPIA-All and 4.1% on SOTOPIA-Hard compared with the Qwen2.5-7B baseline, and slightly surpasses an oracle agent that directly observes partner intentions. These early results suggest that probabilistic intent modeling can contribute to the development of socially intelligent LLM agents.


PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion

arXiv.org Artificial Intelligence

State-of-the-art perceptive Reinforcement Learning controllers for legged robots either (i) impose oscillator or IK-based gait priors that constrain the action space, add bias to the policy optimization and reduce adaptability across robot morphologies, or (ii) operate "blind", which struggle to anticipate hind-leg terrain, and are brittle to noise. In this paper, we propose Phase-Guided Terrain Traversal (PGTT), a perception-aware deep-RL approach that overcomes these limitations by enforcing gait structure purely through reward shaping, thereby reducing inductive bias in policy learning compared to oscillator/IK-conditioned action priors. PGTT encodes per-leg phase as a cubic Hermite spline that adapts swing height to local heightmap statistics and adds a swing-phase contact penalty, while the policy acts directly in joint space supporting morphology-agnostic deployment. Trained in MuJoCo (MJX) on procedurally generated stair-like terrains with curriculum and domain randomization, PGTT achieves the highest success under push disturbances (median +7.5% vs. the next best method) and on discrete obstacles (+9%), with comparable velocity tracking, and converging to an effective policy roughly 2x faster than strong end-to-end baselines. We validate PGTT on a Unitree Go2 using a real-time LiDAR elevation-to-heightmap pipeline, and we report preliminary results on ANYmal-C obtained with the same hyperparameters. These findings indicate that terrain-adaptive, phase-guided reward shaping is a simple and general mechanism for robust perceptive locomotion across platforms.


Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

arXiv.org Artificial Intelligence

The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $γ< 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $γ= 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $γ= 1$) can be replaced with a new object that we call the transient visitation measure.