Reinforcement Learning
Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter
Kurenkov, Andrey, Taglic, Joseph, Kulkarni, Rohun, Dominguez-Kuhne, Marcus, Garg, Animesh, Martín-Martín, Roberto, Savarese, Silvio
When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is extremely hard, when not impossible. In problems where accurate (forward) models are lacking, Deep Reinforcement Learning (RL) has shown to be a viable solution to map observations (e.g. images) to good interactions in the form of close-loop visuomotor policies. However, Deep RL is sample inefficient and fails when applied directly to the problem of unoccluding objects based on images. In this work we present a novel Deep RL procedure that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects. Our experiments show that our approach trains faster and converges to more efficient uncovering solutions than baselines and ablations, and that our uncovering policies lead to an average improvement in the graspability of the target object, facilitating downstream retrieval applications.
PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning
Agarwal, Alekh, Henaff, Mikael, Kakade, Sham, Sun, Wen
Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the Policy Cover-Policy Gradient (PC-PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case $\ell_{\infty}$ assumptions; this includes approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.
Reinforcement Learning with Trajectory Feedback
Efroni, Yonathan, Merlis, Nadav, Mannor, Shie
The computational model of reinforcement learning is based upon the ability to query a score of every visited state-action pair, i.e., to observe a per state-action reward signal. However, in practice, it is often the case such a score is not readily available to the algorithm designer. In this work, we relax this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward from every visited state-action pair, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent. We study natural extensions of reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing the regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a computationally efficient algorithm.
Evaluating the Performance of Reinforcement Learning Algorithms
Jordan, Scott M., Chandak, Yash, Cohen, Daniel, Zhang, Mengxue, Thomas, Philip S.
Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning. Recent reproducibility analyses have shown that reported performance results are often inconsistent and difficult to replicate. In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics. Taking a step towards ensuring that reported results are consistent, we propose a new comprehensive evaluation methodology for reinforcement learning algorithms that produces reliable measurements of performance both on a single environment and when aggregated across environments. We demonstrate this method by evaluating a broad class of reinforcement learning algorithms on standard benchmark tasks.
DeepMind & Google Are Betting Big On Reinforcement Learning
Recently, researchers from DeepMind and Google introduced methods for choosing the best policy in offline reinforcement learning (ORL) known as offline hyperparameter selection (OHS). It uses logged data from a set of many policies that are trained using different hyperparameters. Reinforcement learning has become one of the most critical techniques in AI which has been used to attain Artificial General Intelligence. Offline reinforcement learning has now become a fundamental approach for deploying RL techniques in real-world scenarios. According to this blog post, offline reinforcement learning can assist in pre-training a reinforcement learning agent using the existing data.
Training a Deep Reinforcement Learning Agent to Play Snake
Those of us who have ever used a Nokia mobile phone two decades ago will remember the Snake game that was first introduced on the Nokia 6110. An adaption of an arcade game from 1976, it eventually found itself on 400 million phones. Indeed, there is even a "World Snake Day" for nostalgic fans to remember this bygone era. But can you train a deep reinforcement learning agent to play the game? Data scientist Hennie de Harder decided to find out and chronicled her journey of pitting an agent against a Python version of the game in a blog post on Towards Data Science. One of three basic machine learning paradigms, reinforcement learning is an area of machine learning concerned with software agents that take action based on maximizing predefined rewards.
Model-Based Offline Planning
Argenson, Arthur, Dulac-Arnold, Gabriel
Offline learning is a key part of making reinforcement learning (RL) useable in real systems. Offline RL looks at scenarios where there is data from a system's operation, but no direct access to the system when learning a policy. Recent work on training RL policies from offline data has shown results both with model-free policies learned directly from the data, or with planning on top of learnt models of the data. Model-free policies tend to be more performant, but are more opaque, harder to command externally, and less easy to integrate into larger systems. We propose an offline learner that generates a model that can be used to control the system directly through planning. This allows us to have easily controllable policies directly from data, without ever interacting with the system. We show the performance of our algorithm, Model-Based Offline Planning (MBOP) on a series of robotics-inspired tasks, and demonstrate its ability leverage planning to respect environmental constraints. We are able to find near-optimal polices for certain simulated systems from as little as 50 seconds of real-time system interaction, and create zero-shot goal-conditioned policies on a series of environments.
Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers
Rusci, Manuele, Fariselli, Marco, Capotondi, Alessandro, Benini, Luca
The severe on-chip memory limitations are currently preventing the deployment of the most accurate Deep Neural Network (DNN) models on tiny MicroController Units (MCUs), even if leveraging an effective 8-bit quantization scheme. To tackle this issue, in this paper we present an automated mixed-precision quantization flow based on the HAQ framework but tailored for the memory and computational characteristics of MCU devices. Specifically, a Reinforcement Learning agent searches for the best uniform quantization levels, among 2, 4, 8 bits, of individual weight and activation tensors, under the tight constraints on RAM and FLASH embedded memory sizes. We conduct an experimental analysis on MobileNetV1, MobileNetV2 and MNasNet models for Imagenet classification. Concerning the quantization policy search, the RL agent selects quantization policies that maximize the memory utilization. Given an MCU-class memory bound of 2MB for weight-only quantization, the compressed models produced by the mixed-precision engine result as accurate as the state-of-the-art solutions quantized with a non-uniform function, which is not tailored for CPUs featuring integer-only arithmetic. This denotes the viability of uniform quantization, required for MCU deployments, for deep weights compression. When also limiting the activation memory budget to 512kB, the best MobileNetV1 model scores up to 68.4% on Imagenet thanks to the found quantization policy, resulting to be 4% more accurate than the other 8-bit networks fitting the same memory constraints.
Anomaly Detection through Reinforcement Learning
As Artificial Intelligence is becoming a mainstream and easily available commercial technology, both organizations and criminals are trying to take full advantage of it. In particular, there are predictions by cyber security experts that going forward, the world will witness many AI-powered cyber attacks1. This mandates the development of more sophisticated cyber defense systems using autonomous agents which are capable of generating and executing effective policies against such attacks, without human feedback in the loop. In this series of blog posts, we plan to write about such next generation cyber defense systems. One effective approach of detecting many types of cyber threats is to treat it as an anomaly detection problem and use machine learning or signature-based approaches to build detection systems.
Deep Reinforcement Learning with Interactive Feedback in a Human-Robot Environment
Moreira, Ithan, Rivas, Javier, Cruz, Francisco, Dazeley, Richard, Ayala, Angel, Fernandes, Bruno
Robots are extending their presence in domestic environments every day, being more common to see them carrying out tasks in home scenarios. In the future, robots are expected to increasingly perform more complex tasks and, therefore, be able to acquire experience from different sources as quickly as possible. A plausible approach to address this issue is interactive feedback, where a trainer advises a learner on which actions should be taken from specific states to speed up the learning process. Moreover, deep reinforcement learning has been recently widely utilized in robotics to learn the environment and acquire new skills autonomously. However, an open issue when using deep reinforcement learning is the excessive time needed to learn a task from raw input images. In this work, we propose a deep reinforcement learning approach with interactive feedback to learn a domestic task in a human-robot scenario. We compare three different learning methods using a simulated robotic arm for the task of organizing different objects; the proposed methods are (i) deep reinforcement learning (DeepRL); (ii) interactive deep reinforcement learning using a previously trained artificial agent as an advisor (agent-IDeepRL); and (iii) interactive deep reinforcement learning using a human advisor (human-IDeepRL). We demonstrate that interactive approaches provide advantages for the learning process. The obtained results show that a learner agent, using either agent-IDeepRL or human-IDeepRL, completes the given task earlier and has fewer mistakes compared to the autonomous DeepRL approach.