Reinforcement Learning
Elon Musk's lab forced bots to create their own language
Have you ever experienced the dread of overhearing two people, speaking a language you don't understand, begin laughing wildly? You just have to wonder what it is they're talking about, and if it's a joke at your expense. Heck, maybe you even check your teeth to make sure you aren't walking around with half of your lunchtime ham sandwich stuck to your gums. As Wired reports, researchers at OpenAI have made some huge strides in getting bots to communicate with each other, and without actually telling them how to do so. The group published a research paper earlier this week explaining exactly how they were able to accomplish the complex task, and it's all based on reinforcement learning.
Learning from the Hindsight Plan -- Episodic MPC Improvement
Tamar, Aviv, Thomas, Garrett, Zhang, Tianhao, Levine, Sergey, Abbeel, Pieter
Model predictive control (MPC) is a popular control method that has proved effective for robotics, among other fields. MPC performs re-planning at every time step. Re-planning is done with a limited horizon per computational and real-time constraints and often also for robustness to potential model errors. However, the limited horizon leads to suboptimal performance. In this work, we consider the iterative learning setting, where the same task can be repeated several times, and propose a policy improvement scheme for MPC. The main idea is that between executions we can, offline, run MPC with a longer horizon, resulting in a hindsight plan. To bring the next real-world execution closer to the hindsight plan, our approach learns to re-shape the original cost function with the goal of satisfying the following property: short horizon planning (as realistic during real executions) with respect to the shaped cost should result in mimicking the hindsight plan. This effectively consolidates long-term reasoning into the short-horizon planning. We empirically evaluate our approach in contact-rich manipulation tasks both in simulated and real environments, such as peg insertion by a real PR2 robot.
A Survey of Available Corpora for Building Data-Driven Dialogue Systems
Serban, Iulian Vlad, Lowe, Ryan, Henderson, Peter, Charlin, Laurent, Pineau, Joelle
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.
Value Iteration Networks
Tamar, Aviv, Wu, Yi, Thomas, Garrett, Levine, Sergey, Abbeel, Pieter
We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.
Top 10 technologies for 2017
The technologies making waves in 2017 include brain implants and quantum computers. Here is a list of the top 10 technologies that are expected to be prevalent this year, according to MIT. At the top of the list is behavior-reinforced artificial intelligence. Whether that's mastering the complex game of Go and beating a champion or learning to merge a self-driving car into traffic. The technology is based on reinforcement learning, documented more than a 100 years ago by psychologist Edward Thorndike.
Revisiting stochastic off-policy action-value gradients
A BSTRACT Off-policy stochastic actor-critic methods rely on approximating the stochastic policy gradient in order to derive an optimal policy. One may also derive the optimal policy by approximating the action-value gradient. The use of action-value gradients is desirable as policy improvement occurs along the direction of steepest ascent. This has been studied extensively within the context of natural gradient actor-critic algorithms and more recently within the context of deterministic policy gradients. In this paper we briefly discuss the off-policy stochastic counterpart to deterministic action-value gradients, as well as an incremental approach for following the policy gradient in lieu of the natural gradient.
Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning
Anschel, Oron, Baram, Nir, Shimkin, Nahum
Instability and variability of Deep Reinforcement Learning (DRL) algorithms tend to adversely affect their performance. Averaged-DQN is a simple extension to the DQN algorithm, based on averaging previously learned Q-values estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. To understand the effect of the algorithm, we examine the source of value function estimation errors and provide an analytical comparison within a simplified model. We further present experiments on the Arcade Learning Environment benchmark that demonstrate significantly improved stability and performance due to the proposed extension.
What can you do with a rock? Affordance extraction via word embeddings
Fulda, Nancy, Ricks, Daniel, Murdoch, Ben, Wingate, David
Autonomous agents must often detect affordances: the set of behaviors enabled by a situation. Affordance detection is particularly helpful in domains with large action spaces, allowing the agent to prune its search space by avoiding futile behaviors. This paper presents a method for affordance extraction via word embeddings trained on a Wikipedia corpus. The resulting word vectors are treated as a common knowledge database which can be queried using linear algebra. We apply this method to a reinforcement learning agent in a text-only environment and show that affordance-based action selection improves performance most of the time. Our method increases the computational complexity of each learning step but significantly reduces the total number of steps needed. In addition, the agent's action selections begin to resemble those a human would choose.
Sample Efficient Feature Selection for Factored MDPs
Guo, Zhaohan Daniel, Brunskill, Emma
In reinforcement learning, the state of the real world is often represented by feature vectors. However, not all of the features may be pertinent for solving the current task. We propose Feature Selection Explore and Exploit (FS-EE), an algorithm that automatically selects the necessary features while learning a Factored Markov Decision Process, and prove that under mild assumptions, its sample complexity scales with the in-degree of the dynamics of just the necessary features, rather than the in-degree of all features. This can result in a much better sample complexity when the in-degree of the necessary features is smaller than the in-degree of all features.