reinforcement learning


Google's AI picks which machine learning models will produce the best results

#artificialintelligence

Leave it to the folks at Google to devise AI capable of predicting which machine learning models will produce the best results. In a newly-published paper ("Off-Policy Evaluation via Off-Policy Classification") and blog post, a team of Google AI researchers propose what they call "off-policy classification," or OPC, which evaluates the performance of AI-driven agents by treating evaluation as a classification problem. The team notes that their approach -- a variant of reinforcement learning, which employs rewards to drive software policies toward goals -- works with image inputs and scales to tasks including vision-based robotic grasping. "Fully off-policy reinforcement learning is a variant in which an agent learns entirely from older data, which is appealing because it enables model iteration without requiring a physical robot," writes Robotics at Google software engineer Alexa Irpan. "With fully off-policy RL, one can train several models on the same fixed dataset collected by previous agents, then select the best one."


Control What You Can: Intrinsically Motivated Task-Planning Agent

arXiv.org Artificial Intelligence

We present a novel intrinsically motivated agent that learns how to control the environment in the fastest possible manner by optimizing learning progress. It learns what can be controlled, how to allocate time and attention, and the relations between objects using surprise based motivation. The effectiveness of our method is demonstrated in a synthetic as well as a robotic manipulation environment yielding considerably improved performance and smaller sample complexity. In a nutshell, our work combines several task-level planning agent structures (backtracking search on task graph, probabilistic road-maps, allocation of search efforts) with intrinsic motivation to achieve learning from scratch.


Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

arXiv.org Artificial Intelligence

We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.


When to Trust Your Model: Model-Based Policy Optimization

arXiv.org Artificial Intelligence

Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of model-generated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.


Wasserstein Reinforcement Learning

arXiv.org Machine Learning

We propose behavior-driven optimization via Wasserstein distances (WDs) to improve several classes of state-of-the-art reinforcement learning (RL) algorithms. We show that WD regularizers acting on appropriate policy embeddings efficiently incorporate behavioral characteristics into policy optimization. We demonstrate that they improve Evolution Strategy methods by encouraging more efficient exploration, can be applied in imitation learning and to speed up training of Trust Region Policy Optimization methods. Since the exact computation of WDs is expensive, we develop approximate algorithms based on the combination of different methods: dual formulation of the optimal transport problem, alternating optimization and random feature maps, to effectively replace exact WD computations in the RL tasks considered. We provide theoretical analysis of our algorithms and exhaustive empirical evaluation in a variety of RL settings.


Wasserstein Adversarial Imitation Learning

arXiv.org Machine Learning

Imitation Learning describes the problem of recovering an expert policy from demonstrations. While inverse reinforcement learning approaches are known to be very sample-efficient in terms of expert demonstrations, they usually require problem-dependent reward functions or a (task-)specific reward-function regularization. In this paper, we show a natural connection between inverse reinforcement learning approaches and Optimal Transport, that enables more general reward functions with desirable properties (e.g., smoothness). Based on our observation, we propose a novel approach called Wasserstein Adversarial Imitation Learning. Our approach considers the Kantorovich potentials as a reward function and further leverages regularized optimal transport to enable large-scale applications. In several robotic experiments, our approach outperforms the baselines in terms of average cumulative rewards and shows a significant improvement in sample-efficiency, by requiring just one expert demonstration.


QXplore: Q-learning Exploration by Maximizing Temporal Difference Error

arXiv.org Machine Learning

A major challenge in reinforcement learning for continuous state-action spaces is exploration, especially when reward landscapes are very sparse. Several recent methods provide an intrinsic motivation to explore by directly encouraging RL agents to seek novel states. A potential disadvantage of pure state novelty-seeking behavior is that unknown states are treated equally regardless of their potential for future reward. In this paper, we propose that the temporal difference error of predicting primary reward can serve as a secondary reward signal for exploration. This leads to novelty-seeking in the absence of primary reward, and at the same time accelerates exploration of reward-rich regions in sparse (but nonzero) reward landscapes compared to state novelty-seeking. This objective draws inspiration from dopaminergic pathways in the brain that influence animal behavior. We implement this idea with an adversarial method in which Q and Qx are the action-value functions for primary and secondary rewards, respectively. Secondary reward is given by the absolute value of the TD-error of Q. Training is off-policy, based on a replay buffer containing a mixture of trajectories induced by Q and Qx. We characterize performance on a suite of continuous control benchmark tasks against recent state of the art exploration methods and demonstrate comparable or better performance on all tasks, with much faster convergence for Q.


Learning to Plan Hierarchically from Curriculum

arXiv.org Artificial Intelligence

We present a framework for learning to plan hierarchically in domains with unknown dynamics. We enhance planning performance by exploiting problem structure in several ways: (i) We simplify the search over plans by leveraging knowledge of skill objectives, (ii) Shorter plans are generated by enforcing aggressively hierarchical planning, (iii) We learn transition dynamics with sparse local models for better generalisation. Our framework decomposes transition dynamics into skill effects and success conditions, which allows fast planning by reasoning on effects, while learning conditions from interactions with the world. We propose a simple method for learning new abstract skills, using successful trajectories stemming from completing the goals of a curriculum. Learned skills are then refined to leverage other abstract skills and enhance subsequent planning. We show that both conditions and abstract skills can be learned simultaneously while planning, even in stochastic domains. Our method is validated in experiments of increasing complexity, with up to 2^100 states, showing superior planning to classic non-hierarchical planners or reinforcement learning methods. Applicability to real-world problems is demonstrated in a simulation-to-real transfer experiment on a robotic manipulator.


Inferred successor maps for better transfer learning

arXiv.org Artificial Intelligence

Humans and animals show remarkable flexibility in adjusting their behaviour when their goals, or rewards in the environment change. While such flexibility is a hallmark of intelligent behaviour, these multi-task scenarios remain an important challenge for machine learning algorithms and neurobiological models alike. Factored representations can enable flexible behaviour by abstracting away general aspects of a task from those prone to change, while nonparametric methods provide a principled way of using similarity to past experiences to guide current behaviour. Here we combine the successor representation (SR), that factors the value of actions into expected outcomes and corresponding rewards, with evaluating task similarity through nonparametric inference and clustering the space of rewards. The proposed algorithm improves SR's transfer capabilities by inverting a generative model over tasks, while also explaining important neurobiological signatures of place cell representation in the hippocampus. It dynamically samples from a flexible number of distinct SR maps while accumulating evidence about the current reward context, and outperforms competing algorithms in settings with both known and unsignalled rewards changes. It reproduces the "flickering" behaviour of hippocampal maps seen when rodents navigate to changing reward locations, and gives a quantitative account of trajectory-dependent hippocampal representations (so-called splitter cells) and their dynamics. We thus provide a novel algorithmic approach for multi-task learning, as well as a common normative framework that links together these different characteristics of the brain's spatial representation.


Towards White-box Benchmarks for Algorithm Control

arXiv.org Artificial Intelligence

The performance of many algorithms in the fields of hard combinatorial problem solving, machine learning or AI in general depends on tuned hyperparameter configurations. Automated methods have been proposed to alleviate users from the tedious and error-prone task of manually searching for performance-optimized configurations across a set of problem instances. However there is still a lot of untapped potential through adjusting an algorithm's hyperparameters online since different hyperparameters are potentially optimal at different stages of the algorithm. We formulate the problem of adjusting an algorithm's hyperparameters for a given instance on the fly as a contextual MDP, making reinforcement learning (RL) the prime candidate to solve the resulting algorithm control problem in a data-driven way. Furthermore, inspired by applications of algorithm configuration, we introduce new white-box benchmarks suitable to study algorithm control. We show that on short sequences, algorithm configuration is a valid choice, but that with increasing sequence length a black-box view on the problem quickly becomes infeasible and RL performs better.