"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.
The past few years have witnessed breakthroughs in reinforcement learning (RL). From the first successful use of RL by a deep learning model for learning a policy from pixel input in 2013 to the OpenAI Dexterity program in 2019, we live in an exciting moment in RL research. Consequently, we need, as RL researchers, to create more and more complex environments and Unity helps us to do that. Unity ML-Agents toolkit is a new plugin based on the game engine Unity that allows us to use the Unity Game Engine as an environment builder to train agents. From playing football, learning to walk, to jump big walls, to train a cute doggy to catch sticks, Unity ML-Agents Toolkit provides a ton of amazing pre-made environment.
Facebook has recently created an algorithm that enhances an AI agent's ability to navigate an environment, letting the agent determine the shortest route through new environments without access to a map. While mobile robots typically have a map programmed into them, the new algorithm that Facebook designed could enable the creation of robots that can navigate environments without the need for maps. According to a post created by Facebook researchers, a major challenge for robot navigation is endowing AI systems with the ability to navigate through novel environments and reaching programmed destinations without a map. In order to tackle this challenge, Facebook created a reinforcement learning algorithm distributed across multiple learners. The algorithm was called decentralized distributed proximal policy optimization (DD-PPO).
Reinforcement learning, which spurs AI to complete goals using rewards or punishments, is a form of training that's led to gains in robotics, speech synthesis, and more. Unfortunately, it's data-intensive, which motivated research teams -- one from Google Brain (one of Google's AI research divisions) and the other from Alphabet's DeepMind -- to prototype more efficient means of executing it. In a pair of preprint papers, the researchers propose Adaptive Behavior Policy Sharing (ABPS), an algorithm that allows the sharing of experience adaptively selected from a pool of AI agents, and a framework -- Universal Value Function Approximators (UVFA) -- that simultaneously learns directed exploration policies with the same AI, with different trade-offs between exploration and exploitation. The teams claim ABPS achieves superior performance in several Atari games, reducing variance on top agents by 25%. As for UVFA, it doubles the performance of base agents in "hard exploration" in many of the same games while maintaining a high score across the remaining games; it's the first algorithm to achieve a high score in Pitfall without human demonstrations or hand-crafted features.
From microelectronics to mechanics and machine learning, the modern-day robots are a marvel of multiple engineering disciplines. They use sensors, image processing and reinforcement learning algorithms to move the objects around and move around the obstacles as well. However, this is not the case when it comes to handling objects such as glass. The surface properties of glass are transparent, and non-uniform light reflection makes it difficult for the sensors mounted on the robot to understand how to engage in a simple pick and place operation. To address this problem, researchers at Google AI along with Synthesis AI and Columbia University devised a novel machine-learning algorithm called ClearGrasp, that is capable of estimating accurate 3D data of transparent objects from RGB-D images.
"Generalization" is an AI buzzword these days for good reason: most scientists would love to see the models they're training in simulations and video game environments evolve and expand to take on meaningful real-world challenges -- for example in safety, conservation, medicine, etc. One concerned research area is deep reinforcement learning (DRL), which implements deep learning architectures with reinforcement learning algorithms to enable AI agents to learn the best actions possible to attain their goals in virtual environments. DRL has been widely applied in games and robotics. Such DRL agents have an impressive track record on Starcraft II and Dota-2. But because they were trained in fixed environments, studies suggest DRL agents can fail to generalize to even slight variations of their training environments.
Reinforcement learning is field that keeps growing and not only because of the breakthroughs in deep learning. Sure if we talk about deep reinforcement learning, it uses neural networks underneath, but there is more to it than that. In our journey through the world of reinforcement learning we focused on one of the most popular reinforcement learning algorithms out there Q-Learning. This approach is considered one of the biggest breakthroughs in Temporal Difference control. In this article, we are going to explore one variation and improvement of this algorithm – Double Q-Learning.
Working memory is a central topic of cognitive neuroscience because it is critical for solving real world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating" framework is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We bring together gating with ideas from machine learning about using finite memory systems in more general problems. Thus we present a normative Gating model that learns, by online temporal difference methods, to use working memory to maximize discounted future rewards in general partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in human experiments. Moreover, the model introduces a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards."
For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching.
Dealing with uncertainty is essential for efficient reinforcement learning. There is a growing literature on uncertainty estimation for deep learning from fixed datasets, but many of the most popular approaches are poorly-suited to sequential decision problems. Other methods, such as bootstrap sampling, have no mechanism for uncertainty that does not come from the observed data. We highlight why this can be a crucial shortcoming and propose a simple remedy through addition of a randomized untrainable prior' network to each ensemble member. We prove that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and show that this approach scales to large-scale problems far better than previous attempts.