MacAlpine, Patrick
Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL
Mazoure, Bogdan, Ahmed, Ahmed M., MacAlpine, Patrick, Hjelm, R Devon, Kolobov, Andrey
A highly desirable property of a reinforcement learning (RL) agent -- and a major difficulty for deep RL approaches -- is the ability to generalize policies learned on a few tasks over a high-dimensional observation space to similar tasks not seen during training. Many promising approaches to this challenge consider RL as a process of training two functions simultaneously: a complex nonlinear encoder that maps high-dimensional observations to a latent representation space, and a simple linear policy over this space. We posit that a superior encoder for zero-shot generalization in RL can be trained by using solely an auxiliary SSL objective if the training process encourages the encoder to map behaviorally similar observations to similar representations, as reward-based signal can cause overfitting in the encoder (Raileanu et al., 2021). We propose Cross-Trajectory Representation Learning (CTRL), a method that runs within an RL agent and conditions its encoder to recognize behavioral similarity in observations by applying a novel SSL objective to pairs of trajectories from the agent's policies. CTRL can be viewed as having the same effect as inducing a pseudo-bisimulation metric but, crucially, avoids the use of rewards and associated overfitting risks. Our experiments ablate various components of CTRL and demonstrate that in combination with PPO it achieves better generalization performance on the challenging Procgen benchmark suite (Cobbe et al., 2020).
Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark
Mohanty, Sharada, Poonganam, Jyotish, Gaidon, Adrien, Kolobov, Andrey, Wulfe, Blake, Chakraborty, Dipam, Šemetulskis, Gražvydas, Schapke, João, Kubilius, Jonas, Pašukonis, Jurgis, Klimas, Linas, Hausknecht, Matthew, MacAlpine, Patrick, Tran, Quang Nhat, Tumiel, Thomas, Tang, Xiaocheng, Chen, Xinwei, Hesse, Christopher, Hilton, Jacob, Guss, William Hebgen, Genc, Sahika, Schulman, John, Cobbe, Karl
The NeurIPS 2020 Procgen Competition was designed as a centralized benchmark with clearly defined tasks for measuring Sample Efficiency and Generalization in Reinforcement Learning. Generalization remains one of the most fundamental challenges in deep reinforcement learning, and yet we do not have enough benchmarks to measure the progress of the community on Generalization in Reinforcement Learning. We present the design of a centralized benchmark for Reinforcement Learning which can help measure Sample Efficiency and Generalization in Reinforcement Learning by doing end to end evaluation of the training and rollout phases of thousands of user submitted code bases in a scalable way. We designed the benchmark on top of the already existing Procgen Benchmark by defining clear tasks and standardizing the end to end evaluation setups. The design aims to maximize the flexibility available for researchers who wish to design future iterations of such benchmarks, and yet imposes necessary practical constraints to allow for a system like this to scale. This paper presents the competition setup and the details and analysis of the top solutions identified through this setup in context of 2020 iteration of the competition at NeurIPS.
Multi-Preference Actor Critic
Durugkar, Ishan, Hausknecht, Matthew, Swaminathan, Adith, MacAlpine, Patrick
Policy gradient algorithms typically combine discounted future rewards with an estimated value function, to compute the direction and magnitude of parameter updates. However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning. We introduce a general method to incorporate multiple different feedback channels into a single policy gradient loss. In our formulation, the Multi-Preference Actor Critic (M-PAC), these different types of feedback are implemented as constraints on the policy. We use a Lagrangian relaxation to satisfy these constraints using gradient descent while learning a policy that maximizes rewards. Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.
UT Austin Villa 2014: RoboCup 3D Simulation League Champion via Overlapping Layered Learning
MacAlpine, Patrick (University of Texas at Austin) | Depinet, Mike (University of Texas at Austin) | Stone, Peter (University of Texas at Austin)
Layered learning is a hierarchical machine learning paradigm that enables learning of complex behaviors by incrementally learning a series of sub-behaviors. A key feature of layered learning is that higher layers directly depend on the learned lower layers. In its original formulation, lower layers were frozen prior to learning higher layers. This paper considers an extension to the paradigm that allows learning certain behaviors independently, and then later stitching them together by learning at the "seams" where their influences overlap. The UT Austin Villa 2014 RoboCup 3D simulation team, using such overlapping layered learning, learned a total of 19 layered behaviors for a simulated soccer-playing robot, organized both in series and in parallel. To the best of our knowledge this is more than three times the number of layered behaviors in any prior layered learning system. Furthermore, the complete learning process is repeated on four different robot body types, showcasing its generality as a paradigm for efficient behavior learning. The resulting team won the RoboCup 2014 championship with an undefeated record, scoring 52 goals and conceding none. This paper includes a detailed experimental analysis of the team's performance and the overlapping layered learning approach that led to its success.
SCRAM: Scalable Collision-avoiding Role Assignment with Minimal-Makespan for Formational Positioning
MacAlpine, Patrick (University of Texas at Austin) | Price, Eric (University of Texas at Austin) | Stone, Peter (University of Texas at Austin)
Teams of mobile robots often need to divide up subtasks efficiently. In spatial domains, a key criterion for doing so may depend on distances between robots and the subtasks' locations. This paper considers a specific such criterion, namely how to assign interchangeable robots, represented as point masses, to a set of target goal locations within an open two dimensional space such that the makespan (time for all robots to reach their target locations) is minimized while also preventing collisions among robots. We present scaleable (computable in polynomial time) role assignment algorithms that we classify as being SCRAM (Scalable Collision-avoiding Role Assignment with Minimal-makespan). SCRAM role assignment algorithms use a graph theoretic approach to map agents to target goal locations such that our objectives for both minimizing the makespan and avoiding agent collisions are met. A system using SCRAM role assignment was originally designed to allow for decentralized coordination among physically realistic simulated humanoid soccer playing robots in the partially observable, non-deterministic, noisy, dynamic, and limited communication setting of the RoboCup 3D simulation league. In its current form, SCRAM role assignment generalizes well to many realistic and real-world multiagent systems, and scales to thousands of agents.
Design and Optimization of an Omnidirectional Humanoid Walk: A Winning Approach at the RoboCup 2011 3D Simulation Competition
MacAlpine, Patrick (University of Texas at Austin) | Barrett, Samuel (University of Texas at Austin) | Urieli, Daniel (University of Texas at Austin) | Vu, Victor (University of Texas at Austin) | Stone, Peter (University of Texas at Austin)
This paper presents the design and learning architecture for an omnidirectional walk used by a humanoid robot soccer agent acting in the RoboCup 3D simulation environment. The walk, which was originally designed for and tested on an actual Nao robot before being employed in the 2011 RoboCup 3D simulation competition, was the crucial component in the UT Austin Villa team winning the competition in 2011. To the best of our knowledge, this is the first time that robot behavior has been conceived and constructed on a real robot for the end purpose of being used in simulation. The walk is based on a double linear inverted pendulum model, and multiple sets of its parameters are optimized via a novel framework. The framework optimizes parameters for different tasks in conjunction with one another, a little-understood problem with substantial practical significance. Detailed experiments show that the UT Austin Villa agent significantly outperforms all the other agents in the competition with the optimized walk being the key to its success.