Goto

Collaborating Authors

 Restelli, Marcello


Scalable Multi-Agent Offline Reinforcement Learning and the Role of Information

arXiv.org Artificial Intelligence

Offline Reinforcement Learning (RL) focuses on learning policies solely from a batch of previously collected data. of- fering the potential to leverage such datasets effectively without the need for costly or risky active exploration. While recent advances in Offline Multi-Agent RL (MARL) have shown promise, most existing methods either rely on large datasets jointly collected by all agents or agent-specific datasets collected independently. The former approach ensures strong performance but raises scalability concerns, while the latter emphasizes scalability at the expense of performance guarantees. In this work, we propose a novel scalable routine for both dataset collection and offline learning. Agents first collect diverse datasets coherently with a pre-specified information-sharing network and subsequently learn coherent localized policies without requiring either full observability or falling back to complete decentralization. We theoretically demonstrate that this structured approach allows a multi-agent extension of the seminal Fitted Q-Iteration (FQI) algorithm to globally converge, in high probability, to near-optimal policies. The convergence is subject to error terms that depend on the informativeness of the shared information. Furthermore, we show how this approach allows to bound the inherent error of the supervised-learning phase of FQI with the mutual information between shared and unshared information. Our algorithm, SCAlable Multi-agent FQI (SCAM-FQI), is then evaluated on a distributed decision-making problem. The empirical results align with our theoretical findings, supporting the effectiveness of SCAM-FQI in achieving a balance between scalability and policy performance.


Towards Principled Multi-Agent Task Agnostic Exploration

arXiv.org Artificial Intelligence

In reinforcement learning, we typically refer to task-agnostic exploration when we aim to explore the environment without access to the task specification a priori. In a single-agent setting the problem has been extensively studied and mostly understood. A popular approach cast the task-agnostic objective as maximizing the entropy of the state distribution induced by the agent's policy, from which principles and methods follows. In contrast, little is known about task-agnostic exploration in multi-agent settings, which are ubiquitous in the real world. How should different agents explore in the presence of others? In this paper, we address this question through a generalization to multiple agents of the problem of maximizing the state distribution entropy. First, we investigate alternative formulations, highlighting respective positives and negatives. Then, we present a scalable, decentralized, trust-region policy search algorithm to address the problem in practical settings. Finally, we provide proof of concept experiments to both corroborate the theoretical findings and pave the way for task-agnostic exploration in challenging multi-agent settings.


Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models

arXiv.org Machine Learning

Reinforcement Learning (RL) (Sutton and Barto, We tackle average-reward infinite-horizon 2018) tackles the sequential decision-making problem POMDPs with an unknown transition model of an agent interacting with an unknown or partially but a known observation model, a setting known environment with the goal of maximizing the that has been previously addressed in two long-term sum of rewards. The RL agent should tradeoff limiting ways: (i) frequentist methods relying between exploring the environment to learn its on suboptimal stochastic policies having structure and exploiting the estimates to compute a a minimum probability of choosing each action, policy that maximizes the reward. This problem has and (ii) Bayesian approaches employing been successfully addressed in past works under the the optimal policy class but requiring MDP formulation (Bartlett and Tewari, 2009; Jaksch strong assumptions about the consistency et al., 2010; Zanette and Brunskill, 2019). MDPs assume of employed estimators. Our work removes full observability of the state space but this assumption these limitations by proving convenient estimation is often violated in many real-world scenarios guarantees for the transition model such as robotics or finance, where only a partial and introducing an optimistic algorithm that observation of the environment is available. In this leverages the optimal class of deterministic case, it is more appropriate to model the problem using belief-based policies. We introduce modifications Partially-Observable MDPs (Sondik, 1978).


A parametric algorithm is optimal for non-parametric regression of smooth functions

arXiv.org Artificial Intelligence

We address the regression problem for a general function $f:[-1,1]^d\to \mathbb R$ when the learner selects the training points $\{x_i\}_{i=1}^n$ to achieve a uniform error bound across the entire domain. In this setting, known historically as nonparametric regression, we aim to establish a sample complexity bound that depends solely on the function's degree of smoothness. Assuming periodicity at the domain boundaries, we introduce PADUA, an algorithm that, with high probability, provides performance guarantees optimal up to constant or logarithmic factors across all problem parameters. Notably, PADUA is the first parametric algorithm with optimal sample complexity for this setting. Due to this feature, we prove that, differently from the non-parametric state of the art, PADUA enjoys optimal space complexity in the prediction phase. To validate these results, we perform numerical experiments over functions coming from real audio data, where PADUA shows comparable performance to state-of-the-art methods, while requiring only a fraction of the computational time.


Statistical Analysis of Policy Space Compression Problem

arXiv.org Machine Learning

Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. However, the complexity of exploring vast policy spaces can lead to significant inefficiencies. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness. Our research focuses on determining the necessary sample size to learn this compressed set accurately. We employ R\'enyi divergence to measure the similarity between true and estimated policy distributions, establishing error bounds for good approximations. To simplify the analysis, we employ the $l_1$ norm, determining sample size requirements for both model-based and model-free settings. Finally, we correlate the error bounds from the $l_1$ norm with those from R\'enyi divergence, distinguishing between policies near the vertices and those in the middle of the policy space, to determine the lower and upper bounds for the required sample sizes.


A Retrospective on the Robot Air Hockey Challenge: Benchmarking Robust, Reliable, and Safe Learning Techniques for Real-world Robotics

arXiv.org Artificial Intelligence

Machine learning methods have a groundbreaking impact in many application domains, but their application on real robotic platforms is still limited. Despite the many challenges associated with combining machine learning technology with robotics, robot learning remains one of the most promising directions for enhancing the capabilities of robots. When deploying learning-based approaches on real robots, extra effort is required to address the challenges posed by various real-world factors. To investigate the key factors influencing real-world deployment and to encourage original solutions from different researchers, we organized the Robot Air Hockey Challenge at the NeurIPS 2023 conference. We selected the air hockey task as a benchmark, encompassing low-level robotics problems and high-level tactics. Different from other machine learning-centric benchmarks, participants need to tackle practical challenges in robotics, such as the sim-to-real gap, low-level control issues, safety problems, real-time requirements, and the limited availability of real-world data. Furthermore, we focus on a dynamic environment, removing the typical assumption of quasi-static motions of other real-world benchmarks. The competition's results show that solutions combining learning-based approaches with prior knowledge outperform those relying solely on data when real-world deployment is challenging. Our ablation study reveals which real-world factors may be overlooked when building a learning-based solution. The successful real-world air hockey deployment of best-performing agents sets the foundation for future competitions and follow-up research directions.


Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs

arXiv.org Artificial Intelligence

Achieving the no-regret property for Reinforcement Learning (RL) problems in continuous state and action-space environments is one of the major open problems in the field. Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes. Furthermore, many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret, which makes any possible solution unfeasible in practice. In this paper, we identify local linearity as the feature that makes Markov Decision Processes (MDPs) both learnable (sublinear regret) and feasible (regret that is polynomial in $H$). We define a novel MDP representation class, namely Locally Linearizable MDPs, generalizing other representation classes like Linear MDPs and MDPS with low inherent Belmman error. Then, i) we introduce Cinderella, a no-regret algorithm for this general representation class, and ii) we show that all known learnable and feasible MDP families are representable in this class. We first show that all known feasible MDPs belong to a family that we call Mildly Smooth MDPs. Then, we show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation. This way, Cinderella is shown to achieve state-of-the-art regret bounds for all previously known (and some new) continuous MDPs for which RL is learnable and feasible.


Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

arXiv.org Artificial Intelligence

Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths, i.e., \emph{truncated}. Specifically, this surrogate shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate. Building on these findings, we present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust version of the surrogate of the estimator's error. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.


Exploiting Risk-Aversion and Size-dependent fees in FX Trading with Fitted Natural Actor-Critic

arXiv.org Artificial Intelligence

In recent years, the popularity of artificial intelligence has surged due to its widespread application in various fields. The financial sector has harnessed its advantages for multiple purposes, including the development of automated trading systems designed to interact autonomously with markets to pursue different aims. In this work, we focus on the possibility of recognizing and leveraging intraday price patterns in the Foreign Exchange market, known for its extensive liquidity and flexibility. Our approach involves the implementation of a Reinforcement Learning algorithm called Fitted Natural Actor-Critic. This algorithm allows the training of an agent capable of effectively trading by means of continuous actions, which enable the possibility of executing orders with variable trading sizes. This feature is instrumental to realistically model transaction costs, as they typically depend on the order size. Furthermore, it facilitates the integration of risk-averse approaches to induce the agent to adopt more conservative behavior. The proposed approaches have been empirically validated on EUR-USD historical data.


Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting

arXiv.org Machine Learning

Dealing with Partially Observable Markov Decision Processes is notably a challenging task. We face an average-reward infinite-horizon POMDP setting with an unknown transition model, where we assume the knowledge of the observation model. Under this assumption, we propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. Then, we propose the OAS-UCRL algorithm that implicitly balances the exploration-exploitation trade-off following the $\textit{optimism in the face of uncertainty}$ principle. The algorithm runs through episodes of increasing length. For each episode, the optimal belief-based policy of the estimated POMDP interacts with the environment and collects samples that will be used in the next episode by the OAS estimation procedure to compute a new estimate of the POMDP parameters. Given the estimated model, an optimization oracle computes the new optimal policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $\mathcal{O}(\sqrt{T \log(T)})$ for the proposed OAS-UCRL algorithm. We compare against the oracle playing the optimal stochastic belief-based policy and show the efficient scaling of our approach with respect to the dimensionality of the state, action, and observation space. We finally conduct numerical simulations to validate and compare the proposed technique with other baseline approaches.