product mdp
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- (11 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Reinforcement Learning with $ω$-Regular Objectives and Constraints
Wagner, Dominik, Witzman, Leon, Ong, Luke
Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of $ω$-regular objectives addresses this by precisely specifying rich behavioural properties. Even still, measuring performance by a single scalar (be it reward or satisfaction probability) masks safety-performance trade-offs that arise in settings with a tolerable level of risk. We address both limitations simultaneously by combining $ω$-regular objectives with explicit constraints, allowing safety requirements and optimisation targets to be treated separately. We develop a model-based RL algorithm based on linear programming, which in the limit produces a policy maximising the probability of satisfying an $ω$-regular objective while also adhering to $ω$-regular constraints within specified thresholds. Furthermore, we establish a translation to constrained limit-average problems with optimality-preserving guarantees.
- Europe > Austria > Vienna (0.14)
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (6 more...)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- (11 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Adaptive Bi-Level Multi-Robot Task Allocation and Learning under Uncertainty with Temporal Logic Constraints
This work addresses the problem of multi-robot coordination under unknown robot transition models, ensuring that tasks specified by Time Window Temporal Logic are satisfied with user-defined probability thresholds. We present a bi-level framework that integrates (i) high-level task allocation, where tasks are assigned based on the robots' estimated task completion probabilities and expected rewards, and (ii) low-level distributed policy learning and execution, where robots independently optimize auxiliary rewards while fulfilling their assigned tasks. To handle uncertainty in robot dynamics, our approach leverages real-time task execution data to iteratively refine expected task completion probabilities and rewards, enabling adaptive task allocation without explicit robot transition models. We theoretically validate the proposed algorithm, demonstrating that the task assignments meet the desired probability thresholds with high confidence. Finally, we demonstrate the effectiveness of our framework through comprehensive simulations.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Republic of Türkiye > Aksaray Province > Aksaray (0.05)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
- Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.88)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.68)
Probabilistic Satisfaction of Temporal Logic Constraints in Reinforcement Learning via Adaptive Policy-Switching
Lin, Xiaoshan, Yüksel, Sadık Bera, Yazıcıoğlu, Yasin, Aksaray, Derya
Constrained Reinforcement Learning (CRL) is a subset of machine learning that introduces constraints into the traditional reinforcement learning (RL) framework. Unlike conventional RL which aims solely to maximize cumulative rewards, CRL incorporates additional constraints that represent specific mission requirements or limitations that the agent must comply with during the learning process. In this paper, we address a type of CRL problem where an agent aims to learn the optimal policy to maximize reward while ensuring a desired level of temporal logic constraint satisfaction throughout the learning process. We propose a novel framework that relies on switching between pure learning (reward maximization) and constraint satisfaction. This framework estimates the probability of constraint satisfaction based on earlier trials and properly adjusts the probability of switching between learning and constraint satisfaction policies. We theoretically validate the correctness of the proposed algorithm and demonstrate its performance through comprehensive simulations.
- Asia > Middle East > Republic of Türkiye > Aksaray Province > Aksaray (0.06)
- North America > United States > Minnesota (0.04)
- North America > United States > Massachusetts (0.04)
Reinforcement Learning with LTL and $\omega$-Regular Objectives via Optimality-Preserving Translation to Average Rewards
Le, Xuan-Bach, Wagner, Dominik, Witzman, Leon, Rabinovich, Alexander, Ong, Luke
Linear temporal logic (LTL) and, more generally, $\omega$-regular objectives are alternatives to the traditional discount sum and average reward objectives in reinforcement learning (RL), offering the advantage of greater comprehensibility and hence explainability. In this work, we study the relationship between these objectives. Our main result is that each RL problem for $\omega$-regular objectives can be reduced to a limit-average reward problem in an optimality-preserving fashion, via (finite-memory) reward machines. Furthermore, we demonstrate the efficacy of this approach by showing that optimal policies for limit-average problems can be found asymptotically by solving a sequence of discount-sum problems approximately. Consequently, we resolve an open problem: optimal policies for LTL and $\omega$-regular objectives can be learned asymptotically.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (10 more...)
- Transportation (0.46)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Directed Exploration in Reinforcement Learning from Linear Temporal Logic
Bagatella, Marco, Krause, Andreas, Martius, Georg
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning, as it allows describing objectives beyond the expressivity of conventional discounted return formulations. Nonetheless, recent works have shown that LTL formulas can be translated into a variable rewarding and discounting scheme, whose optimization produces a policy maximizing a lower bound on the probability of formula satisfaction. However, the synthesized reward signal remains fundamentally sparse, making exploration challenging. We aim to overcome this limitation, which can prevent current algorithms from scaling beyond low-dimensional, short-horizon problems. We show how better exploration can be achieved by further leveraging the LTL specification and casting its corresponding Limit Deterministic B\"uchi Automaton (LDBA) as a Markov reward process, thus enabling a form of high-level value estimation. By taking a Bayesian perspective over LDBA dynamics and proposing a suitable prior distribution, we show that the values estimated through this procedure can be treated as a shaping potential and mapped to informative intrinsic rewards. Empirically, we demonstrate applications of our method from tabular settings to high-dimensional continuous systems, which have so far represented a significant challenge for LTL-based reinforcement learning algorithms.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Portugal > Braga > Braga (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
On the Uniqueness of Solution for the Bellman Equation of LTL Objectives
Xuan, Zetong, Bozkurt, Alper Kamil, Pajic, Miroslav, Wang, Yu
Surrogate rewards for linear temporal logic (LTL) objectives are commonly utilized in planning problems for LTL objectives. In a widely-adopted surrogate reward approach, two discount factors are used to ensure that the expected return approximates the satisfaction probability of the LTL objective. The expected return then can be estimated by methods using the Bellman updates such as reinforcement learning. However, the uniqueness of the solution to the Bellman equation with two discount factors has not been explicitly discussed. We demonstrate with an example that when one of the discount factors is set to one, as allowed in many previous works, the Bellman equation may have multiple solutions, leading to inaccurate evaluation of the expected return. We then propose a condition for the Bellman equation to have the expected return as the unique solution, requiring the solutions for states inside a rejecting bottom strongly connected component (BSCC) to be 0. We prove this condition is sufficient by showing that the solutions for the states with discounting can be separated from those for the states without discounting under this condition.
- North America > United States > Florida > Alachua County > Gainesville (0.14)
- North America > United States > North Carolina > Durham County > Durham (0.04)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
Learning Task Automata for Reinforcement Learning using Hidden Markov Models
Abate, Alessandro, Almulla, Yousif, Fox, James, Hyland, David, Wooldridge, Michael
Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification, especially when the environment's dynamics are only partially known. This paper proposes a novel pipeline for learning non-Markovian task specifications as succinct finite-state `task automata' from episodes of agent experience within unknown environments. We leverage two key algorithmic insights. First, we learn a product MDP, a model composed of the specification's automaton and the environment's MDP (both initially unknown), by treating the product MDP as a partially observable MDP and using the well-known Baum-Welch algorithm for learning hidden Markov models. Second, we propose a novel method for distilling the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our learnt task automaton enables the decomposition of a task into its constituent sub-tasks, which improves the rate at which an RL agent can later synthesise an optimal policy. It also provides an interpretable encoding of high-level environmental and task features, so a human can readily verify that the agent has learnt coherent tasks with no misspecifications. In addition, we take steps towards ensuring that the learnt automaton is environment-agnostic, making it well-suited for use in transfer learning. Finally, we provide experimental results compared with two baselines to illustrate our algorithm's performance in different environments and tasks.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Topological Guided Actor-Critic Modular Learning of Continuous Systems with Temporal Objectives
This work investigates the formal policy synthesis of continuous-state stochastic dynamic systems given high-level specifications in linear temporal logic. To learn an optimal policy that maximizes the satisfaction probability, we take a product between a dynamic system and the translated automaton to construct a product system on which we solve an optimal planning problem. Since this product system has a hybrid product state space that results in reward sparsity, we introduce a generalized optimal backup order, in reverse to the topological order, to guide the value backups and accelerate the learning process. We provide the optimality proof for using the generalized optimal backup order in this optimal planning problem. Further, this paper presents an actor-critic reinforcement learning algorithm when topological order applies. This algorithm leverages advanced mathematical techniques and enjoys the property of hyperparameter self-tuning. We provide proof of the optimality and convergence of our proposed reinforcement learning algorithm. We use neural networks to approximate the value function and policy function for hybrid product state space. Furthermore, we observe that assigning integer numbers to automaton states can rank the value or policy function approximated by neural networks. To break the ordinal relationship, we use an individual neural network for each automaton state's value (policy) function, termed modular learning. We conduct two experiments. First, to show the efficacy of our reinforcement learning algorithm, we compare it with baselines on a classic control task, CartPole. Second, we demonstrate the empirical performance of our formal policy synthesis framework on motion planning of a Dubins car with a temporal specification.
- North America > United States > Massachusetts > Worcester County > Worcester (0.04)
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)