Metelli, Alberto Maria
A Refined Analysis of UCBVI
Drago, Simone, Mussi, Marco, Metelli, Alberto Maria
In this work, we provide a refined analysis of the UCBVI algorithm (Azar et al., 2017), improving both the bonus terms and the regret analysis. Additionally, we compare our version of UCBVI with both its original version and the state-of-the-art MVP algorithm. Our empirical validation demonstrates that improving the multiplicative constants in the bounds has significant positive effects on the empirical performance of the algorithms.
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models
Russo, Alessio, Metelli, Alberto Maria, Restelli, Marcello
Reinforcement Learning (RL) (Sutton and Barto, We tackle average-reward infinite-horizon 2018) tackles the sequential decision-making problem POMDPs with an unknown transition model of an agent interacting with an unknown or partially but a known observation model, a setting known environment with the goal of maximizing the that has been previously addressed in two long-term sum of rewards. The RL agent should tradeoff limiting ways: (i) frequentist methods relying between exploring the environment to learn its on suboptimal stochastic policies having structure and exploiting the estimates to compute a a minimum probability of choosing each action, policy that maximizes the reward. This problem has and (ii) Bayesian approaches employing been successfully addressed in past works under the the optimal policy class but requiring MDP formulation (Bartlett and Tewari, 2009; Jaksch strong assumptions about the consistency et al., 2010; Zanette and Brunskill, 2019). MDPs assume of employed estimators. Our work removes full observability of the state space but this assumption these limitations by proving convenient estimation is often violated in many real-world scenarios guarantees for the transition model such as robotics or finance, where only a partial and introducing an optimistic algorithm that observation of the environment is available. In this leverages the optimal class of deterministic case, it is more appropriate to model the problem using belief-based policies. We introduce modifications Partially-Observable MDPs (Sondik, 1978).
On the Partial Identifiability in Reward Learning: Choosing the Best Reward
Lazzati, Filippo, Metelli, Alberto Maria
When the feedback is not informative enough, the target However, in practice, ReL has been successfully applied reward is only partially identifiable, i.e., there only to IL (Ho & Ermon, 2016) and reward design (Christiano exists a set of rewards (the feasible set) that are et al., 2017). The most significant issue that prevents equally-compatible with the feedback. In this paper, the use of ReL algorithms to other applications is partial we show that there exists a choice of reward, identifiability (Cao et al., 2021; Kim et al., 2021; Skalse non-necessarily contained in the feasible set that, et al., 2023b). Indeed, the target reward may not be uniquely depending on the ReL application, improves the determined from the given feedback, but there is a set of reward performance w.r.t.
Rising Rested Bandits: Lower Bounds and Efficient Algorithms
Fiandri, Marco, Metelli, Alberto Maria, Trov`o, Francesco
This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e. those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. $arm$). We study a particular case of the rested bandits in which the arms' expected reward is monotonically non-decreasing and concave. We study the inherent sample complexity of the regret minimization problem by deriving suitable regret lower bounds. Then, we design an algorithm for the rested case $\textit{R-ed-UCB}$, providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset
Statistical Analysis of Policy Space Compression Problem
Molaei, Majid, Restelli, Marcello, Metelli, Alberto Maria, Papini, Matteo
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems. However, the complexity of exploring vast policy spaces can lead to significant inefficiencies. Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process. This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness. Our research focuses on determining the necessary sample size to learn this compressed set accurately. We employ R\'enyi divergence to measure the similarity between true and estimated policy distributions, establishing error bounds for good approximations. To simplify the analysis, we employ the $l_1$ norm, determining sample size requirements for both model-based and model-free settings. Finally, we correlate the error bounds from the $l_1$ norm with those from R\'enyi divergence, distinguishing between policies near the vertices and those in the middle of the policy space, to determine the lower and upper bounds for the required sample sizes.
Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs
Maran, Davide, Metelli, Alberto Maria, Papini, Matteo, Restelli, Marcello
Achieving the no-regret property for Reinforcement Learning (RL) problems in continuous state and action-space environments is one of the major open problems in the field. Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes. Furthermore, many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret, which makes any possible solution unfeasible in practice. In this paper, we identify local linearity as the feature that makes Markov Decision Processes (MDPs) both learnable (sublinear regret) and feasible (regret that is polynomial in $H$). We define a novel MDP representation class, namely Locally Linearizable MDPs, generalizing other representation classes like Linear MDPs and MDPS with low inherent Belmman error. Then, i) we introduce Cinderella, a no-regret algorithm for this general representation class, and ii) we show that all known learnable and feasible MDP families are representable in this class. We first show that all known feasible MDPs belong to a family that we call Mildly Smooth MDPs. Then, we show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation. This way, Cinderella is shown to achieve state-of-the-art regret bounds for all previously known (and some new) continuous MDPs for which RL is learnable and feasible.
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach
Poiani, Riccardo, Nobili, Nicole, Metelli, Alberto Maria, Restelli, Marcello
Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths, i.e., \emph{truncated}. Specifically, this surrogate shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially can allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the error of the final estimate. Building on these findings, we present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO). The main intuition behind RIDO is to split the available interaction budget into mini-batches. At each round, the agent determines the most convenient schedule of trajectories that minimizes an empirical and robust version of the surrogate of the estimator's error. After discussing the theoretical properties of our method, we conclude by assessing its performance across multiple domains. Our results show that RIDO can adapt its trajectory schedule toward timesteps where more sampling is required to increase the quality of the final estimation.
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting
Russo, Alessio, Metelli, Alberto Maria, Restelli, Marcello
Dealing with Partially Observable Markov Decision Processes is notably a challenging task. We face an average-reward infinite-horizon POMDP setting with an unknown transition model, where we assume the knowledge of the observation model. Under this assumption, we propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. Then, we propose the OAS-UCRL algorithm that implicitly balances the exploration-exploitation trade-off following the $\textit{optimism in the face of uncertainty}$ principle. The algorithm runs through episodes of increasing length. For each episode, the optimal belief-based policy of the estimated POMDP interacts with the environment and collects samples that will be used in the next episode by the OAS estimation procedure to compute a new estimate of the POMDP parameters. Given the estimated model, an optimization oracle computes the new optimal policy. We show the consistency of the OAS procedure, and we prove a regret guarantee of order $\mathcal{O}(\sqrt{T \log(T)})$ for the proposed OAS-UCRL algorithm. We compare against the oracle playing the optimal stochastic belief-based policy and show the efficient scaling of our approach with respect to the dimensionality of the state, action, and observation space. We finally conduct numerical simulations to validate and compare the proposed technique with other baseline approaches.
Learning Utilities from Demonstrations in Markov Decision Processes
Lazzati, Filippo, Metelli, Alberto Maria
Our goal is to extract useful knowledge from demonstrations of behavior in sequential decision-making problems. Although it is well-known that humans commonly engage in risk-sensitive behaviors in the presence of stochasticity, most Inverse Reinforcement Learning (IRL) models assume a risk-neutral agent. Beyond introducing model misspecification, these models do not directly capture the risk attitude of the observed agent, which can be crucial in many applications. In this paper, we propose a novel model of behavior in Markov Decision Processes (MDPs) that explicitly represents the agent's risk attitude through a utility function. We then define the Utility Learning (UL) problem as the task of inferring the observed agent's risk attitude, encoded via a utility function, from demonstrations in MDPs, and we analyze the partial identifiability of the agent's utility. Furthermore, we devise two provably efficient algorithms for UL in a finite-data regime, and we analyze their sample complexity. We conclude with proof-of-concept experiments that empirically validate both our model and our algorithms.
State and Action Factorization in Power Grids
Losapio, Gianvito, Beretta, Davide, Mussi, Marco, Metelli, Alberto Maria, Restelli, Marcello
The increase of renewable energy generation towards the zero-emission target is making the problem of controlling power grids more and more challenging. The recent series of competitions Learning To Run a Power Network (L2RPN) have encouraged the use of Reinforcement Learning (RL) for the assistance of human dispatchers in operating power grids. All the solutions proposed so far severely restrict the action space and are based on a single agent acting on the entire grid or multiple independent agents acting at the substations level. In this work, we propose a domain-agnostic algorithm that estimates correlations between state and action components entirely based on data. Highly correlated state-action pairs are grouped together to create simpler, possibly independent subproblems that can lead to distinct learning processes with less computational and data requirements. The algorithm is validated on a power grid benchmark obtained with the Grid2Op simulator that has been used throughout the aforementioned competitions, showing that our algorithm is in line with domain-expert analysis. Based on these results, we lay a theoretically-grounded foundation for using distributed reinforcement learning in order to improve the existing solutions.