hard exploration game
Playing hard exploration games by watching YouTube
Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent's exact environment setup and the demonstrator's action and reward trajectories. Here we propose a method that overcomes these limitations in two stages. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e.
Review for NeurIPS paper: Learning Guidance Rewards with Trajectory-space Smoothing
Weaknesses: While the experimental results are promising in the episodic Mujoco tasks, it is unclear whether IRCR works in a more complicated setting (e.g., hard exploration games in Atari, sparse reward tasks, robotic arm pick-and-place tasks, etc). Specifically, in the Mujoco tasks considered in the paper, the episodic reward can be obtained by performing simple, repetitive patterns (i.e., keep going forward). But often, the solution to an environment may involve action sequences that are hardly overlapped or repetitive (e.g., a robotic agent that needs to pick up a block and place it in a specified location, a maze-exploring agent that needs to pick up a key to open the exit). In such cases, the algorithm needs to correctly associate the non-repetitive events with the final return to solve the problem. I wonder if IRCR works in such cases as well.
Reviews: Playing hard exploration games by watching YouTube
While usually supervision has to be intentionally provided by a human, the authors instead use YouTube videos as a form of supervision. They first align videos to a shared representation, using two concept that do not require any labeling: i) predicting the temporal distance between frames in the same video, and ii) predicting the temporal distance between a video and audio frame of the same video. Subsequently, they use the embedding on a novel video to generate checkpoints, which serve a intermediate rewards for an RL agent. Experiments show state-of-the-art performance amongst learning from demonstration approaches. Strength: - The paper is clearly written, with logical steps between the sections, and good motivations.
Playing hard exploration games by watching YouTube
Aytar, Yusuf, Pfaff, Tobias, Budden, David, Paine, Thomas, Wang, Ziyu, Freitas, Nando de
Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent's exact environment setup and the demonstrator's action and reward trajectories. Here we propose a method that overcomes these limitations in two stages. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e.
Never Give Up: Learning Directed Exploration Strategies
Badia, Adriร Puigdomรจnech, Sprechmann, Pablo, Vitvitskyi, Alex, Guo, Daniel, Piot, Bilal, Kapturowski, Steven, Tieleman, Olivier, Arjovsky, Martรญn, Pritzel, Alexander, Bolt, Andew, Blundell, Charles
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.
Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment
Taรฏga, Adrien Ali, Fedus, William, Machado, Marlos C., Courville, Aaron, Bellemare, Marc G.
This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent's performance. We use Rainbow, the state-of-the-art algorithm for value-based agents, and focus on some of the bonuses proposed in the last few years. We consider the impact these algorithms have on performance within the popular game Montezuma's Revenge which has gathered a lot of interest from the exploration community, across the the set of seven games identified by Bellemare et al. (2016) as challenging for exploration, and easier games where exploration is not an issue. We find that, in our setting, recently developed bonuses do not provide significantly improved performance on Montezuma's Revenge or hard exploration games. We also find that existing bonus-based methods may negatively impact performance on games in which exploration is not an issue and may even perform worse than $\epsilon$-greedy exploration.
Playing hard exploration games by watching YouTube
Aytar, Yusuf, Pfaff, Tobias, Budden, David, Paine, Tom Le, Wang, Ziyu, de Freitas, Nando
Deep reinforcement learning methods traditionally struggle with tasks where environment rewards are particularly sparse. One successful method of guiding exploration in these domains is to imitate trajectories provided by a human demonstrator. However, these demonstrations are typically collected under artificial conditions, i.e. with access to the agent's exact environment setup and the demonstrator's action and reward trajectories. Here we propose a two-stage method that overcomes these limitations by relying on noisy, unaligned footage without access to such data. First, we learn to map unaligned videos from multiple sources to a common representation using self-supervised objectives constructed over both time and modality (i.e. vision and sound). Second, we embed a single YouTube video in this representation to construct a reward function that encourages an agent to imitate human gameplay. This method of one-shot imitation allows our agent to convincingly exceed human-level performance on the infamously hard exploration games Montezuma's Revenge, Pitfall! and Private Eye for the first time, even if the agent is not presented with any environment rewards.