Markov Models
Multiple Ships Cooperative Navigation and Collision Avoidance using Multi-agent Reinforcement Learning with Communication
In the real world, unmanned surface vehicles (USV) often need to coordinate with each other to accomplish specific tasks. However, achieving cooperative control in multi-agent systems is challenging due to issues such as non-stationarity and partial observability. Recent advancements in Multi-Agent Reinforcement Learning (MARL) provide new perspectives to address these challenges. Therefore, we propose using the multi-agent deep deterministic policy gradient (MADDPG) algorithm with communication to address multiple ships' cooperation problems under partial observability. We developed two tasks based on OpenAI's gym environment: cooperative navigation and cooperative collision avoidance. In these tasks, ships must not only learn effective control strategies but also establish communication protocols with other agents. We analyze the impact of external noise on communication, the effect of inter-agent communication on performance, and the communication patterns learned by the agents. The results demonstrate that our proposed framework effectively addresses cooperative navigation and collision avoidance among multiple vessels, significantly outperforming traditional single-agent algorithms. Agents establish a consistent communication protocol, enabling them to compensate for missing information through shared observations and achieve better coordination.
Solving the Challenge Set without Solving the Task: On Winograd Schemas as a Test of Pronominal Coreference Resolution
Porada, Ian, Cheung, Jackie Chi Kit
Challenge sets such as the Winograd Schema Challenge (WSC) are used to benchmark systems' ability to resolve ambiguities in natural language. If one assumes as in existing work that solving a given challenge set is at least as difficult as solving some more general task, then high performance on the challenge set should indicate high performance on the general task overall. However, we show empirically that this assumption of difficulty does not always hold. In particular, we demonstrate that despite the strong performance of prompted language models (LMs) on the WSC and its variants, these same modeling techniques perform relatively poorly at resolving certain pronominal ambiguities attested in OntoNotes and related datasets that are perceived to be easier. Motivated by these findings, we propose a method for ensembling a prompted LM with a supervised, task-specific system that is overall more accurate at resolving pronominal coreference across datasets. Finally, we emphasize that datasets involving the same linguistic phenomenon draw on distinct, but overlapping, capabilities, and evaluating on any one dataset alone does not provide a complete picture of a system's overall capability.
Conformal Prediction: A Data Perspective
Zhou, Xiaofan, Chen, Baiting, Gui, Yu, Cheng, Lu
The recent rapid development of well-designed and powerful machine learning (ML) models has significantly transformed our lives. However, the success of these models is often evaluated based on the accuracy of their predictions, which, while important, is not sufficient in many real-world scenarios. In high-stakes applications, it is equally critical to assess the uncertainty of model outputs. Uncertainty quantification (UQ) has long been a central problem in fields like statistics and ML. Several well-established methods, such as Bayesian inference and resampling techniques, have been widely adopted to address UQ. However, Bayesian posterior intervals are only valid if the parametric assumptions of the model are correctly specified, which may not always be the case in practical applications.
Provable Convergence and Limitations of Geometric Tempering for Langevin Dynamics
Chehab, Omar, Korba, Anna, Stromme, Austin, Vacher, Adrien
Sampling from a target distribution ฯ whose density is known up to a normalizing constant is a challenging problem in statistics and machine learning, and is currently the subject of intense interest due to applications in Bayesian statistics [19] and energy-based models in deep learning [68], among other areas. In these settings, the normalizing constant of the target distribution ฯ is typically intractable, and Markov Chain Monte Carlo (MCMC) algorithms [58, 57] are commonly used to generate Markov chains in the ambient space, whose law eventually approximates the target distribution. Among MCMC algorithms, the Unadjusted Langevin Algorithm (ULA), which corresponds to a time discretization of a Langevin diffusion process, has attracted considerable attention due to its simplicity, theoretical grounding, and utility in practice [59, 76, 23, 66]. For example, ULA can be proven to converge quickly when the target distribution ฯ is smooth and strongly log-concave [23]. However, many cases in practice require to sample from distributions which are not log-concave, and indeed potentially even multi-modal [54, 82]. In such settings, the convergence of ULA is governed by functional inequalities which effectively quantify the convexity, or lack thereof, of the target distribution [75]. Nonetheless, truly multi-modal target distributions generally have poor functional inequalities, thus leading to weak convergence guarantees for ULA. This phenomenon is not merely a theoretical artifact, and it is wellknown amongst practitioners that when sampling from multi-modal distributions, algorithms based on ULA can get stuck in local modes and suffer from slow convergence [20]. Tempering or annealing is a popular technique [51, 27, 69] to overcome the deficiencies of ULA and other MCMC methods in the multi-modal setting.
Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational complexity with a slightly worse regret. A key new ingredient of our algorithms is the design of a bonus term to guide exploration. We complement our algorithms by presenting several structure dependent lower bounds on regret for FMDPs that reveal the difficulty hiding in the intricacy of the structures.
Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs
We present two elegant solutions for modeling continuous-time dynamics, in a novel model-based reinforcement learning (RL) framework for semi-Markov decision processes (SMDPs), using neural ordinary differential equations (ODEs). Our models accurately characterize continuous-time dynamics and enable us to develop high-performing policies using a small amount of data. We also develop a model-based approach for optimizing time schedules to reduce interaction rates with the environment while maintaining the near-optimal performance, which is not possible for model-free methods. We experimentally demonstrate the efficacy of our methods across various continuous-time domains.
Sample-Efficient Reinforcement Learning of Partially Observable Markov Games
This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs---weakly revealing POMGs---in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies.
Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes
Many real-world problems that require making optimal sequences of decisions under uncertainty involve costs when the agent wishes to obtain information about its environment. We design and analyze algorithms for reinforcement learning (RL) in Action-Contingent Noiselessly Observable MDPs (ACNO-MDPs), a special class of POMDPs in which the agent can choose to either (1) fully observe the state at a cost and then act; or (2) act without any immediate observation information, relying on past observations to infer the underlying state. ACNO-MDPs arise frequently in important real-world application domains like healthcare, in which clinicians must balance the value of information gleaned from medical tests (e.g., blood-based biomarkers) with the costs of gathering that information (e.g., the costs of labor and materials required to administer such tests). We develop a PAC RL algorithm for tabular ACNO-MDPs that provides substantially tighter bounds, compared to generic POMDP-RL algorithms, on the total number of episodes exhibiting worse than near-optimal performance. For continuous-state ACNO-MDPs, we propose a novel method of incorporating observation information that, when coupled with modern RL algorithms, yields significantly faster learning compared to other POMDP-RL algorithms in several simulated environments.
Particle Gibbs for Infinite Hidden Markov Models
Infinite Hidden Markov Models (iHMM's) are an attractive, nonparametric generalization of the classical Hidden Markov Model which can automatically infer the number of hidden states in the system. However, due to the infinite-dimensional nature of the transition dynamics, performing inference in the iHMM is difficult. In this paper, we present an infinite-state Particle Gibbs (PG) algorithm to resample state trajectories for the iHMM. The proposed algorithm uses an efficient proposal optimized for iHMMs, and leverages ancestor sampling to improve the mixing of the standard PG algorithm. Our algorithm demonstrates significant convergence improvements on synthetic and real world data sets.
Bayesian Risk Markov Decision Processes
We consider finite-horizon Markov Decision Processes where parameters, such as transition probabilities, are unknown and estimated from data. The popular distributionally robust approach to addressing the parameter uncertainty can sometimes be overly conservative. In this paper, we propose a new formulation, Bayesian risk Markov decision process (BR-MDP), to address parameter uncertainty in MDPs, where a risk functional is applied in nested form to the expected total cost with respect to the Bayesian posterior distributions of the unknown parameters. The proposed formulation provides more flexible risk attitudes towards parameter uncertainty and takes into account the availability of data in future time stages. To solve the proposed formulation with the conditional value-at-risk (CVaR) risk functional, we propose an efficient approximation algorithm by deriving an analytical approximation of the value function and utilizing the convexity of CVaR.