dd-ppo
VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement Erik Wijmans 1,2 Georgia Institute of Technology 2
We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). Specifically, it learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL) enabling high throughput. We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency. We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation.
VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement
We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). Specifically, it learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL) enabling high throughput.We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO.
VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement
We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). Specifically, it learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL) enabling high throughput.We find that VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art for distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO.
VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement
Wijmans, Erik, Essa, Irfan, Batra, Dhruv
We present Variable Experience Rollout (VER), a technique for efficiently scaling batched on-policy reinforcement learning in heterogenous environments (where different environments take vastly different times to generate rollouts) to many GPUs residing on, potentially, many machines. VER combines the strengths of and blurs the line between synchronous and asynchronous on-policy RL methods (SyncOnRL and AsyncOnRL, respectively). VER learns from on-policy experience (like SyncOnRL) and has no synchronization points (like AsyncOnRL). VER leads to significant and consistent speed-ups across a broad range of embodied navigation and mobile manipulation tasks in photorealistic 3D simulation environments. Specifically, for PointGoal navigation and ObjectGoal navigation in Habitat 1.0, VER is 60-100% faster (1.6-2x speedup) than DD-PPO, the current state of art distributed SyncOnRL, with similar sample efficiency. For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat 2.0 VER is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup) on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art AsyncOnRL), VER matches its speed on 1 GPU, and is 70% faster (1.7x speedup) on 8 GPUs with better sample efficiency. We leverage these speed-ups to train chained skills for GeometricGoal rearrangement tasks in the Home Assistant Benchmark (HAB). We find a surprising emergence of navigation in skills that do not ostensible require any navigation. Specifically, the Pick skill involves a robot picking an object from a table. During training the robot was always spawned close to the table and never needed to navigate. However, we find that if base movement is part of the action space, the robot learns to navigate then pick an object in new environments with 50% success, demonstrating surprisingly high out-of-distribution generalization.
Researchers Improve Robotic Arm Used in Surgery
Facebook has recently created an algorithm that enhances an AI agent's ability to navigate an environment, letting the agent determine the shortest route through new environments without access to a map. While mobile robots typically have a map programmed into them, the new algorithm that Facebook designed could enable the creation of robots that can navigate environments without the need for maps. According to a post created by Facebook researchers, a major challenge for robot navigation is endowing AI systems with the ability to navigate through novel environments and reaching programmed destinations without a map. In order to tackle this challenge, Facebook created a reinforcement learning algorithm distributed across multiple learners. The algorithm was called decentralized distributed proximal policy optimization (DD-PPO).
Near-perfect point-goal navigation from 2.5 billion frames of experience
The AI community has a long-term goal of building intelligent machines that interact effectively with the physical world, and a key challenge is teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination -- without a preprovided map. We are announcing today that Facebook AI has created a new large-scale distributed reinforcement learning (RL) algorithm called DD-PPO, which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data. Agents trained with DD-PPO (which stands for decentralized distributed proximal policy optimization) achieve nearly 100 percent success in a variety of virtual environments, such as houses and office buildings. We have also successfully tested our model with tasks in real-world physical settings using a LoCoBot and Facebook AI's PyRobot platform. An unfortunate fact about maps is that they become outdated the moment they are created.
Facebook AI Researchers Achieve a 107x Speedup for Training Virtual Agents – NVIDIA Developer News Center
Navigating a new indoor space without any prior knowledge or even a map is a challenging task for a human, let alone a robot. To help develop intelligent machines that interact more effectively with complex 3D environments, Facebook researchers developed a GPU-accelerated deep reinforcement learning model that achieves near 100 percent success in navigating a variety of virtual environments without a pre-provided map. To achieve this breakthrough, the team focused their work on developing an efficient approach to scaling RL models, which require a significant number of training samples, using multi-node distribution. "A single parameter server and thousands of (typically CPU) workers may be fundamentally incompatible with the needs of modern computer vision and robotics communities," the researchers explained in their post, Near-perfect point-goal navigation from 2.5 billion frames of experience. "Unlike Gym or Atari, 3D simulators require GPU acceleration…. The desired agents operate from high-dimensional inputs (pixels) and use deep networks, such as ResNet50, which strain the parameter server. Thus, existing distributed RL architectures do not scale and there is a need to develop a new distributed architecture."
Facebook AI gives maps the brushoff in helping robots find the way
Facebook has scored an impressive feat involving AI that can navigate without any map. Facebook's wish for bragging rights, although they said they have a way to go, were evident in its blog post, "Near-perfect point-goal navigation from 2.5 billion frames of experience." Long story short, Facebook has delivered an algorithm that, quoting MIT Technology Review, lets robots find the shortest route in unfamiliar environments, opening the door to robots that can work inside homes and offices." And, in line with the plain-and-simple, Ubergizmo's Tyler Lee also remarked: "Facebook believes that with this new algorithm, it will be capable of creating robots that can navigate an area without the need for maps...in theory, you could place a robot in a room or an area without a map and it should be able to find its way to its destination." Erik Wijmans and Abhishek Kadian in the Facebook Jan. 21 post said that, well, after all, one of the technology key challenges is "teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination--without a preprovided map." Facebook has taken on the challenge. The two announced that Facebook AI created a large-scale distributed reinforcement learning algorithm called DD-PPO, "which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data," they wrote. DD-PPO stands for decentralized distributed proximal policy optimization. This is what Facebook is using to train agents and results seen in virtual environments such as houses and office buildings were encouraging. The bloggers pointed out that "even failing 1 out of 100 times is not acceptable in the physical world, where a robot agent might damage itself or its surroundings by making an error." Beyond DD-PPO, the authors gave credit to Facebook AI's open source AI Habitat platform for its "state-of-the-art speed and fidelity." AI Habitat made its open source announcement last year as a simulation platform to train embodied agents such as virtual robots in photo-realistic 3-D environments. Facebook said it was part of "Facebook AI's ongoing effort to create systems that are less reliant on large annotated data sets used for supervised training." InfoQ had said in July that "The technology was taking a different approach than relying upon static data sets which other researchers have traditionally used and that Facebook decided to open-source this technology to move this subfield forward." Jon Fingas in Engadget looked at how the team worked toward AI navigation (and this is where that 25 billion number comes in). "Previous projects tend to struggle without massive computational power.
Decentralized Distributed PPO: Solving PointGoal Navigation
Wijmans, Erik, Kadian, Abhishek, Morcos, Ari, Lee, Stefan, Essa, Irfan, Parikh, Devi, Savva, Manolis, Batra, Dhruv
DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever'stale'), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim (Savva et al., 2019), DD-PPO exhibits near-linear scaling - achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) - over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially'solves' the task - near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks - the analog of'ImageNet pre-training task-specific fine-tuning' for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models code will be publicly available). 1 I NTRODUCTION Recent advances in deep reinforcement learning (RL) have given rise to systems that can outperform human experts at variety of games (Silver et al., 2017; Tian et al., 2019; OpenAI, 2018). These advances, even more-so than those from supervised learning, rely on significant numbers of training samples, making them impractical without large-scale, distributed parallelization. Thus, scaling RL via multi-node distribution is of importance to AI - that is the focus of this work. Several works have proposed systems for distributed RL (Heess et al., 2017; Liang et al., 2018; Tian et al., 2019; Silver et al., 2016; OpenAI, 2018; Espeholt et al., 2018). These works utilize two core components: 1) workers that collect experience ('rollout workers'), and 2) a parameter server that optimizes the model. The rollout workers are then distributed across, potentially, thousands of CPUs 1 .