Farinelli, Alessandro
Curriculum Learning for Safe Mapless Navigation
Marzari, Luca, Corsi, Davide, Marchesini, Enrico, Farinelli, Alessandro
This work investigates the effects of Curriculum Learning (CL)-based approaches on the agent's performance. In particular, we focus on the safety aspect of robotic mapless navigation, comparing over a standard end-to-end (E2E) training strategy. To this end, we present a CL approach that leverages Transfer of Learning (ToL) and fine-tuning in a Unity-based simulation with the Robotnik Kairos as a robotic agent. For a fair comparison, our evaluation considers an equal computational demand for every learning approach (i.e., the same number of interactions and difficulty of the environments) and confirms that our CL-based method that uses ToL outperforms the E2E methodology. In particular, we improve the average success rate and the safety of the trained policy, resulting in 10% fewer collisions in unseen testing scenarios. To further confirm these results, we employ a formal verification tool to quantify the number of correct behaviors of Reinforcement Learning policies over desired specifications.
Benchmarking Safe Deep Reinforcement Learning in Aquatic Navigation
Marchesini, Enrico, Corsi, Davide, Farinelli, Alessandro
We propose a novel benchmark environment for Safe Reinforcement Learning focusing on aquatic navigation. Aquatic navigation is an extremely challenging task due to the non-stationary environment and the uncertainties of the robotic platform, hence it is crucial to consider the safety aspect of the problem, by analyzing the behavior of the trained network to avoid dangerous situations (e.g., collisions). To this end, we consider a value-based and policy-gradient Deep Reinforcement Learning (DRL) and we propose a crossover-based strategy that combines gradient-based and gradient-free DRL to improve sample-efficiency. Moreover, we propose a verification strategy based on interval analysis that checks the behavior of the trained models over a set of desired properties. Our results show that the crossover-based training outperforms prior DRL approaches, while our verification allows us to quantify the number of configurations that violate the behaviors that are described by the properties. Crucially, this will serve as a benchmark for future research in this domain of applications.
Centralizing State-Values in Dueling Networks for Multi-Robot Reinforcement Learning Mapless Navigation
Marchesini, Enrico, Farinelli, Alessandro
We study the problem of multi-robot mapless navigation in the popular Centralized Training and Decentralized Execution (CTDE) paradigm. This problem is challenging when each robot considers its path without explicitly sharing observations with other robots and can lead to non-stationary issues in Deep Reinforcement Learning (DRL). The typical CTDE algorithm factorizes the joint action-value function into individual ones, to favor cooperation and achieve decentralized execution. Such factorization involves constraints (e.g., monotonicity) that limit the emergence of novel behaviors in an individual as each agent is trained starting from a joint action-value. In contrast, we propose a novel architecture for CTDE that uses a centralized state-value network to compute a joint state-value, which is used to inject global state information in the value-based updates of the agents. Consequently, each model computes its gradient update for the weights, considering the overall state of the environment. Our idea follows the insights of Dueling Networks as a separate estimation of the joint state-value has both the advantage of improving sample efficiency, while providing each robot information whether the global state is (or is not) valuable. Experiments in a robotic navigation task with 2 4, and 8 robots, confirm the superior performance of our approach over prior CTDE methods (e.g., VDN, QMIX).
Rule-based Shielding for Partially Observable Monte-Carlo Planning
Mazzi, Giulio, Castellini, Alberto, Farinelli, Alessandro
Partially Observable Monte-Carlo Planning (POMCP) is a powerful online algorithm able to generate approximate policies for large Partially Observable Markov Decision Processes. The online nature of this method supports scalability by avoiding complete policy representation. The lack of an explicit representation however hinders policy interpretability and makes policy verification very complex. In this work, we propose two contributions. The first is a method for identifying unexpected actions selected by POMCP with respect to expert prior knowledge of the task. The second is a shielding approach that prevents POMCP from selecting unexpected actions. The first method is based on Satisfiability Modulo Theory (SMT). It inspects traces (i.e., sequences of belief-action-observation triplets) generated by POMCP to compute the parameters of logical formulas about policy properties defined by the expert. The second contribution is a module that uses online the logical formulas to identify anomalous actions selected by POMCP and substitutes those actions with actions that satisfy the logical formulas fulfilling expert knowledge. We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to velocity regulation in mobile robot navigation. Results show that the shielded POMCP outperforms the standard POMCP in a case study in which a wrong parameter of POMCP makes it select wrong actions from time to time. Moreover, we show that the approach keeps good performance also if the parameters of the logical formula are optimized using trajectories containing some wrong actions.
Identification of Unexpected Decisions in Partially Observable Monte-Carlo Planning: a Rule-Based Approach
Mazzi, Giulio, Castellini, Alberto, Farinelli, Alessandro
Partially Observable Monte-Carlo Planning (POMCP) is a powerful online algorithm able to generate approximate policies for large Partially Observable Markov Decision Processes. The online nature of this method supports scalability by avoiding complete policy representation. The lack of an explicit representation however hinders interpretability. In this work, we propose a methodology based on Satisfiability Modulo Theory (SMT) for analyzing POMCP policies by inspecting their traces, namely sequences of belief-action-observation triplets generated by the algorithm. The proposed method explores local properties of policy behavior to identify unexpected decisions. We propose an iterative process of trace analysis consisting of three main steps, i) the definition of a question by means of a parametric logical formula describing (probabilistic) relationships between beliefs and actions, ii) the generation of an answer by computing the parameters of the logical formula that maximize the number of satisfied clauses (solving a MAX-SMT problem), iii) the analysis of the generated logical formula and the related decision boundaries for identifying unexpected decisions made by POMCP with respect to the original question. We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to mobile robot navigation. Results show that the approach can exploit human knowledge on the domain, outperforming state-of-the-art anomaly detection methods in identifying unexpected decisions. An improvement of the Area Under Curve up to 47\% has been achieved in our tests.
Evaluating the Safety of Deep Reinforcement Learning Models using Semi-Formal Verification
Corsi, Davide, Marchesini, Enrico, Farinelli, Alessandro
Groundbreaking successes have been achieved by Deep Reinforcement Learning (DRL) in solving practical decision-making problems. Robotics, in particular, can involve high-cost hardware and human interactions. Hence, scrupulous evaluations of trained models are required to avoid unsafe behaviours in the operational environment. However, designing metrics to measure the safety of a neural network is an open problem, since standard evaluation parameters (e.g., total reward) are not informative enough. In this paper, we present a semi-formal verification approach for decision-making tasks, based on interval analysis, that addresses the computational demanding of previous verification frameworks and design metrics to measure the safety of the models. Our method obtains comparable results over standard benchmarks with respect to formal verifiers, while drastically reducing the computation time. Moreover, our approach allows to efficiently evaluate safety properties for decision-making models in practical applications such as mapless navigation for mobile robots and trajectory generation for manipulators.
A COP Model For Graph-Constrained Coalition Formation
Bistaffa, Filippo, Farinelli, Alessandro
We consider Graph-Constrained Coalition Formation (GCCF), a widely studied subproblem of coalition formation in which the set of valid coalitions is restricted by a graph. We propose COP-GCCF, a novel approach that models GCCF as a COP, and we solve such COP with a highly-parallel approach based on Bucket Elimination executed on the GPU, which is able to exploit the high constraint tightness of COP-GCCF. Results show that our approach outperforms state of the art algorithms (i.e., DyCE and IDPG) by at least one order of magnitude on realistic graphs, i.e., a crawl of the Twitter social graph, both in terms of runtime and memory.
Algorithms for Graph-Constrained Coalition Formation in the Real World
Bistaffa, Filippo, Farinelli, Alessandro, Cerquides, Jesús, Rodríguez-Aguilar, Juan A., Ramchurn, Sarvapali D.
Coalition formation typically involves the coming together of multiple, heterogeneous, agents to achieve both their individual and collective goals. In this paper, we focus on a special case of coalition formation known as Graph-Constrained Coalition Formation (GCCF) whereby a network connecting the agents constrains the formation of coalitions. We focus on this type of problem given that in many real-world applications, agents may be connected by a communication network or only trust certain peers in their social network. We propose a novel representation of this problem based on the concept of edge contraction, which allows us to model the search space induced by the GCCF problem as a rooted tree. Then, we propose an anytime solution algorithm (CFSS), which is particularly efficient when applied to a general class of characteristic functions called $m+a$ functions. Moreover, we show how CFSS can be efficiently parallelised to solve GCCF using a non-redundant partition of the search space. We benchmark CFSS on both synthetic and realistic scenarios, using a real-world dataset consisting of the energy consumption of a large number of households in the UK. Our results show that, in the best case, the serial version of CFSS is 4 orders of magnitude faster than the state of the art, while the parallel version is 9.44 times faster than the serial version on a 12-core machine. Moreover, CFSS is the first approach to provide anytime approximate solutions with quality guarantees for very large systems of agents (i.e., with more than 2700 agents).
Sharing Rides with Friends: A Coalition Formation Algorithm for Ridesharing
Bistaffa, Filippo (University of Verona) | Farinelli, Alessandro (University of Verona) | Ramchurn, Sarvapali D. (University of Southampton)
We consider the Social Ridesharing (SR) problem, where a set of commuters, connected through a social network, arrange one-time rides at short notice. In particular, we focus on the associated optimisation problem of forming cars to minimise the travel cost of the overall system modelling such problem as a graph constrained coalition formation (GCCF) problem, where the set of feasible coalitions is restricted by a graph (i.e., the social network). Moreover, we significantly extend the state of the art algorithm for GCCF, i.e., the CFSS algorithm, to solve our GCCF model of the SR problem. Our empirical evaluation uses a real dataset for both spatial (GeoLife) and social data (Twitter), to validate the applicability of our approach in a realistic application scenario. Empirical results show that our approach computes optimal solutions for systems of medium scale (up to 100 agents) providing significant cost reductions (up to -36.22%). Moreover, we can provide approximate solutions for very large systems (i.e., up to 2000 agents) and good quality guarantees (i.e., with an approximation ratio of 1.41 in the worst case) within minutes (i.e., 100 seconds).
Worst-case bounds on the quality of max-product fixed-points
Vinyals, Meritxell, Cerquides, Jes\', us, Farinelli, Alessandro, Rodríguez-aguilar, Juan A.
We study worst-case bounds on the quality of any fixed point assignment of the max-product algorithm for Markov Random Fields (MRF). We start proving a bound independent of the MRF structure and parameters. Afterwards, we show how this bound can be improved for MRFs with particular structures such as bipartite graphs or grids. Our results provide interesting insight into the behavior of max-product. For example, we prove that max-product provides very good results (at least 90% of the optimal) on MRFs with large variable-disjoint cycles (MRFs in which all cycles are variable-disjoint, namely that they do not share any edge and in which each cycle contains at least 20 variables).