optimality guarantee
Non-Linear Coordination Graphs
Value decomposition multi-agent reinforcement learning methods learn the global value function as a mixing of each agent's individual utility functions. Coordination graphs (CGs) represent a higher-order decomposition by incorporating pairwise payoff functions and thus is supposed to have a more powerful representational capacity. However, CGs decompose the global value function linearly over local value functions, severely limiting the complexity of the value function class that can be represented. In this paper, we propose the first non-linear coordination graph by extending CG value decomposition beyond the linear case. One major challenge is to conduct greedy action selections in this new function class to which commonly adopted DCOP algorithms are no longer applicable. We study how to solve this problem when mixing networks with LeakyReLU activation are used. An enumeration method with a global optimality guarantee is proposed and motivates an efficient iterative optimization method with a local optimality guarantee. We find that our method can achieve superior performance on challenging multi-agent coordination tasks like MACO.
Practical, Utilitarian Algorithm Configuration
Graham, Devon, Velez, Eros Rojas, Leyton-Brown, Kevin
Utilitarian algorithm configuration identifies a parameter setting for a given algorithm that maximizes a user's utility. Utility functions offer a theoretically well-grounded approach to optimizing decision-making under uncertainty and are flexible enough to capture a user's preferences over algorithm runtimes (e.g., they can describe a sharp cutoff after which a solution is no longer required, a per-hour cost for compute, or diminishing returns from algorithms that take longer to run). COUP is a recently-introduced utilitarian algorithm configuration procedure which was designed mainly to offer strong theoretical guarantees about the quality of the configuration it returns, with less attention paid to its practical performance. This paper closes that gap, bringing theoretically-grounded, utilitarian algorithm configuration to the point where it is competitive with widely used, heuristic configuration procedures that offer no performance guarantees. We present a series of improvements to COUP that improve its empirical performance without degrading its theoretical guarantees and demonstrate their benefit experimentally. Using a case study, we also illustrate ways of exploring the robustness of a given solution to the algorithm selection problem to variations in the utility function.
A Scalable Post-Processing Pipeline for Large-Scale Free-Space Multi-Agent Path Planning with PiBT
Chakravarty, Arjo, Grey, Michael X., Muthugala, M. A. Viraj J., Elara, Mohan Rajesh
Free-space multi-agent path planning remains challenging at large scales. Most existing methods either offer optimality guarantees but do not scale beyond a few dozen agents, or rely on grid-world assumptions that do not generalize well to continuous space. In this work, we propose a hybrid, rule-based planning framework that combines Priority Inheritance with Backtracking (PiBT) with a novel safety-aware path smoothing method. Our approach extends PiBT to 8-connected grids and selectively applies string-pulling based smoothing while preserving collision safety through local interaction awareness and a fallback collision resolution step based on Safe Interval Path Planning (SIPP). This design allows us to reduce overall path lengths while maintaining real-time performance. We demonstrate that our method can scale to over 500 agents in large free-space environments, outperforming existing any-angle and optimal methods in terms of runtime, while producing near-optimal trajectories in sparse domains. Our results suggest this framework is a promising building block for scalable, real-time multi-agent navigation in robotics systems operating beyond grid constraints.
Non-Linear Coordination Graphs
Value decomposition multi-agent reinforcement learning methods learn the global value function as a mixing of each agent's individual utility functions. Coordination graphs (CGs) represent a higher-order decomposition by incorporating pairwise payoff functions and thus is supposed to have a more powerful representational capacity. However, CGs decompose the global value function linearly over local value functions, severely limiting the complexity of the value function class that can be represented. In this paper, we propose the first non-linear coordination graph by extending CG value decomposition beyond the linear case. One major challenge is to conduct greedy action selections in this new function class to which commonly adopted DCOP algorithms are no longer applicable.
Search-Based Path Planning among Movable Obstacles
Ren, Zhongqiang, Suvonov, Bunyod, Chen, Guofei, He, Botao, Liao, Yijie, Fermuller, Cornelia, Zhang, Ji
This paper investigates Path planning Among Movable Obstacles (PAMO), which seeks a minimum cost collision-free path among static obstacles from start to goal while allowing the robot to push away movable obstacles (i.e., objects) along its path when needed. To develop planners that are complete and optimal for PAMO, the planner has to search a giant state space involving both the location of the robot as well as the locations of the objects, which grows exponentially with respect to the number of objects. The main idea in this paper is that, only a small fraction of this giant state space needs to be explored during planning as guided by a heuristic, and most of the objects far away from the robot are intact, which thus leads to runtime efficient algorithms. Based on this idea, this paper introduces two PAMO formulations, i.e., bi-objective and resource constrained problems in an occupancy grid, and develops PAMO*, a search method with completeness and solution optimality guarantees, to solve the two problems. We then further extend PAMO* to hybrid-state PAMO* to plan in continuous spaces with high-fidelity interaction between the robot and the objects. Our results show that, PAMO* can often find optimal solutions within a second in cluttered environments with up to 400 objects.
Collaborative Graph Exploration with Reduced Pose-SLAM Uncertainty via Submodular Optimization
Bai, Ruofei, Yuan, Shenghai, Guo, Hongliang, Yin, Pengyu, Yau, Wei-Yun, Xie, Lihua
This paper considers the collaborative graph exploration problem in GPS-denied environments, where a group of robots are required to cover a graph environment while maintaining reliable pose estimations in collaborative simultaneous localization and mapping (SLAM). Considering both objectives presents challenges for multi-robot pathfinding, as it involves the expensive covariance inference for SLAM uncertainty evaluation, especially considering various combinations of robots' paths. To reduce the computational complexity, we propose an efficient two-stage strategy where exploration paths are first generated for quick coverage, and then enhanced by adding informative and distance-efficient loop-closing actions, called loop edges, along the paths for reliable pose estimation. We formulate the latter problem as a non-monotone submodular maximization problem by relating SLAM uncertainty with pose graph topology, which (1) facilitates more efficient evaluation of SLAM uncertainty than covariance inference, and (2) allows the application of approximation algorithms in submodular optimization to provide optimality guarantees. We further introduce the ordering heuristics to improve objective values while preserving the optimality bound. Simulation experiments over randomly generated graph environments verify the efficiency of our methods in finding paths for quick coverage and enhanced pose graph reliability, and benchmark the performance of the approximation algorithms and the greedy-based algorithm in the loop edge selection problem. Our implementations will be open-source at https://github.com/bairuofei/CGE.
A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback
Kim, Kihyun, Zhang, Jiawei, Ozdaglar, Asuman, Parrilo, Pablo A.
Reward learning involves inferring and shaping the underlying reward function from observed human demonstrations and feedback. Inverse reinforcement learning (IRL) and reinforcement learning from human feedback (RLHF, also known as preference-based reinforcement learning) are key methodologies in reward learning, applied in various sequential decision-making tasks such as games [1-3], robotics [4-6], and language models [7-11]. Particularly in the recent drastic development of large language models (LLMs), RLHF has played a crucial role in fine-tuning models to better align with human preferences [10]. However, despite the notable empirical success of these algorithms, a significant gap remains in the theoretical analysis of IRL and RLHF, limiting us to guarantee their reliability. This work aims to bridge this gap by proposing a novel theoretical framework for offline IRL and RLHF. IRL aims to infer a reward function that aligns with an expert behavior from demonstrations [12, 13]. Typical IRL algorithms employ a bi-level optimization framework within the context of maximum likelihood estimation (MLE). In this framework, the inner optimization evaluates the policy based on the current reward parameters, while the outer optimization updates these parameters to better match observed expert behavior. These algorithms have been extensively explored in the literature [14-18], and their convergence is studied in both online settings [17] and offline settings [18].
Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees
Shao, Daqian, Kwiatkowska, Marta
Linear Temporal Logic (LTL) is widely used to specify high-level objectives for system policies, and it is highly desirable for autonomous systems to learn the optimal policy with respect to such specifications. However, learning the optimal policy from LTL specifications is not trivial. We present a model-free Reinforcement Learning (RL) approach that efficiently learns an optimal policy for an unknown stochastic system, modelled using Markov Decision Processes (MDPs). We propose a novel and more general product MDP, reward structure and discounting mechanism that, when applied in conjunction with off-the-shelf model-free RL algorithms, efficiently learn the optimal policy that maximizes the probability of satisfying a given LTL specification with optimality guarantees. We also provide improved theoretical results on choosing the key parameters in RL to ensure optimality. To directly evaluate the learned policy, we adopt probabilistic model checker PRISM to compute the probability of the policy satisfying such specifications. Several experiments on various tabular MDP environments across different LTL tasks demonstrate the improved sample efficiency and optimal policy convergence.
CONFIG: Constrained Efficient Global Optimization for Closed-Loop Control System Optimization with Unmodeled Constraints
Xu, Wenjie, Jiang, Yuning, Svetozarevic, Bratislav, Jones, Colin N.
In this paper, the CONFIG algorithm, a simple and provably efficient constrained global optimization algorithm, is applied to optimize the closed-loop control performance of an unknown system with unmodeled constraints. Existing Gaussian process based closed-loop optimization methods, either can only guarantee local convergence (e.g., SafeOPT), or have no known optimality guarantee (e.g., constrained expected improvement) at all, whereas the recently introduced CONFIG algorithm has been proven to enjoy a theoretical global optimality guarantee. In this study, we demonstrate the effectiveness of CONFIG algorithm in the applications. The algorithm is first applied to an artificial numerical benchmark problem to corroborate its effectiveness. It is then applied to a classical constrained steady-state optimization problem of a continuous stirred-tank reactor. Simulation results show that our CONFIG algorithm can achieve performance competitive with the popular CEI (Constrained Expected Improvement) algorithm, which has no known optimality guarantee. As such, the CONFIG algorithm offers a new tool, with both a provable global optimality guarantee and competitive empirical performance, to optimize the closed-loop control performance for a system with soft unmodeled constraints. Last, but not least, the open-source code is available as a python package to facilitate future applications.
Wordle-solving state of the art: all optimality results so far -- Laurent's notes
Most mathematical questions one could have about Wordle are settled by now, and a few remain open. I summarize here what is known, as far as I can tell. First, let's clarify a few things about the game: Wordle comes with a dictionary of 12972 words that the player is allowed to use as guesses. They are essentially all 5-letter combinations one could reasonably argue are English words. The "secret" word that the player has to discover is also always in that dictionary.