Not enough data to create a plot.
Try a different view from the menu above.
Castro, Pablo Samuel
A functional mirror ascent view of policy gradient methods with function approximation
Vaswani, Sharan, Bachem, Olivier, Totaro, Simone, Mueller, Robert, Geist, Matthieu, Machado, Marlos C., Castro, Pablo Samuel, Roux, Nicolas Le
We use functional mirror ascent to propose a general framework (referred to as FMA-PG) for designing policy gradient methods. The functional perspective distinguishes between a policy's functional representation (what are its sufficient statistics) and its parameterization (how are these statistics represented) and naturally results in computationally efficient off-policy updates. For simple policy parameterizations, the FMA-PG framework ensures that the optimal policy is a fixed point of the updates. It also allows us to handle complex policy parameterizations (e.g., neural networks) while guaranteeing policy improvement. Our framework unifies several PG methods and opens the way for designing sample-efficient variants of existing methods. Moreover, it recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) in a principled way. With a softmax functional representation, FMA-PG results in a variant of TRPO with additional desirable properties. It also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on MuJoCo. Via experiments on simple reinforcement learning problems, we evaluate algorithms instantiated by FMA-PG.
MICo: Learning improved representations via sampling-based state similarity for Markov decision processes
Castro, Pablo Samuel, Kastner, Tyler, Panangaden, Prakash, Rowland, Mark
The success of reinforcement learning (RL) algorithms in large-scale, complex tasks depends on forming useful representations of the environment with which the algorithms interact. Feature selection and feature learning has long been an important subdomain of RL, and with the advent of deep reinforcement learning there has been much recent interest in understanding and improving the representations learnt by RL agents. Much of the work in representation learning has taken place from the perspective of auxiliary tasks [Jaderberg et al., 2017, Bellemare et al., 2017, Fedus et al., 2019]; in addition to the primary reinforcement learning task, the agent may attempt to predict and control additional aspects of the environment. Auxiliary tasks shape the agent's representation of the environment implicitly, typically via gradient descent on the additional learning objectives. As such, while auxiliary tasks continue to play an important role in improving the performance of deep RL algorithms, our understanding of the effects of auxiliary tasks on representations in RL is still in its infancy.
Metrics and continuity in reinforcement learning
Lan, Charline Le, Bellemare, Marc G., Castro, Pablo Samuel
In most practical applications of reinforcement learning, it is untenable to maintain direct estimates for individual states; in continuous-state systems, it is impossible. Instead, researchers often leverage state similarity (whether explicitly or implicitly) to build models that can generalize well from a limited set of samples. The notion of state similarity used, and the neighbourhoods and topologies they induce, is thus of crucial importance, as it will directly affect the performance of the algorithms. Indeed, a number of recent works introduce algorithms assuming the existence of "well-behaved" neighbourhoods, but leave the full specification of such topologies for future work. In this paper we introduce a unified formalism for defining these topologies through the lens of metrics. We establish a hierarchy amongst these metrics and demonstrate their theoretical implications on the Markov Decision Process specifying the reinforcement learning problem. We complement our theoretical results with empirical evaluations showcasing the differences between the metrics considered.
Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning
Agarwal, Rishabh, Machado, Marlos C., Castro, Pablo Samuel, Bellemare, Marc G.
Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.
Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research
Obando-Ceron, Johan S., Castro, Pablo Samuel
Since the introduction of DQN, a vast majority of reinforcement learning research has focused on reinforcement learning with deep neural networks as function approximators. New methods are typically evaluated on a set of environments that have now become standard, such as Atari 2600 games. While these benchmarks help standardize evaluation, their computational cost has the unfortunate side effect of widening the gap between those with ample access to computational resources, and those without. In this work we argue that, despite the community's emphasis on large-scale environments, the traditional small-scale environments can still yield valuable scientific insights and can help reduce the barriers to entry for underprivileged communities. To substantiate our claims, we empirically revisit the paper which introduced the Rainbow algorithm [Hessel et al., 2018] and present some new insights into the algorithms used by Rainbow.
GANterpretations
Castro, Pablo Samuel
Since the introduction of Generative Adversarial Networks (GANs) [Goodfellow et al., 2014] there has been a regular stream of both technical advances (e.g., Arjovsky et al. [2017]) and creative uses of these generative models (e.g., [Karras et al., 2019, Zhu et al., 2017, Jin et al., 2017]). In this work we propose an approach for using the power of GANs to automatically generate videos to accompany audio recordings by aligning to spectral properties of the recording. This allows musicians to explore new forms of multi-modal creative expression, where musical performance can induce an AIgenerated musical video that is guided by said performance, as well as a medium for creating a visual narrative to follow a storyline (similar to what was proposed by Frosst and Kereliuk [2019]). When trained properly, these latent spaces are learned in a structured manner, where nearby points generate similar images. For our work we make use of the BigGAN family of models [Brock et al., 2019], which are class-conditional generative models.
Inverse Reinforcement Learning with Multiple Ranked Experts
Castro, Pablo Samuel, Li, Shijian, Zhang, Daqing
We consider the problem of learning to behave optimally in a Markov Decision Process when a reward function is not specified, but instead we have access to a set of demonstrators of varying performance. We assume the demonstrators are classified into one of k ranks, and use ideas from ordinal regression to find a reward function that maximizes the margin between the different ranks. This approach is based on the idea that agents should not only learn how to behave from experts, but also how not to behave from non-experts. We show there are MDPs where important differences in the reward function would be hidden from existing algorithms by the behaviour of the expert. Our method is particularly useful for problems where we have access to a large set of agent behaviours with varying degrees of expertise (such as through GPS or cellphones). We highlight the differences between our approach and existing methods using a simple grid domain and demonstrate its efficacy on determining passenger-finding strategies for taxi drivers, using a large dataset of GPS trajectories.
A Comparative Analysis of Expected and Distributional Reinforcement Learning
Lyle, Clare, Castro, Pablo Samuel, Bellemare, Marc G.
Since their introduction a year ago, distributional approaches to reinforcement learning (distributional RL) have produced strong results relative to the standard approach which models expected values (expected RL). However, aside from convergence guarantees, there have been few theoretical results investigating the reasons behind the improvements distributional RL provides. In this paper we begin the investigation into this fundamental question by analyzing the differences in the tabular, linear approximation, and non-linear approximation settings. We prove that in many realizations of the tabular and linear approximation settings, distributional RL behaves exactly the same as expected RL. In cases where the two methods behave differently, distributional RL can in fact hurt performance when it does not induce identical behaviour. We then continue with an empirical analysis comparing distributional and expected RL methods in control settings with non-linear approximators to tease apart where the improvements from distributional RL methods are coming from.
Distributional reinforcement learning with linear function approximation
Bellemare, Marc G., Roux, Nicolas Le, Castro, Pablo Samuel, Moitra, Subhodeep
Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cram\'er distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cram\'er distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cram\'er-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model's prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cram\'er-based distributional methods may perform worse than directly approximating the value function.
A Geometric Perspective on Optimal Representations for Reinforcement Learning
Bellemare, Marc G., Dabney, Will, Dadashi, Robert, Taiga, Adrien Ali, Castro, Pablo Samuel, Roux, Nicolas Le, Schuurmans, Dale, Lattimore, Tor, Lyle, Clare
This paper proposes a new approach to representation learning based on geometric properties of the space of value functions. We study a two-part approximation of the value function: a nonlinear map from states to vectors, or representation, followed by a linear map from vectors to values. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We argue that these AVFs make excellent auxiliary tasks, and use them to construct a loss which can be efficiently minimized to find a near-optimal representation for reinforcement learning. We highlight characteristics of the method in a series of experiments on the four-room domain.