# Reinforcement Learning

### Facebook Research is developing touchy-feely curious robots

"Much of our work in robotics is focused on self-supervised learning, in which systems learn directly from raw data so they can adapt to new tasks and new circumstances," a team of researchers from FAIR (Facebook AI Research) wrote in a blog post. "In robotics, we're advancing techniques such as model-based reinforcement learning (RL) to enable robots to teach themselves through trial and error using direct input from sensors." Specifically, the team has been trying to get a six-legged robot to teach itself to walk without any outside assistance. "Generally speaking, locomotion is a very difficult task in robotics and this is what it makes it very exciting from our perspective," Roberto Calandra, a FAIR researcher, told Engadget. "We have been able to design algorithms for AI and actually test them on a really challenging problem that we otherwise don't know how to solve."

### The False Promise of Off-Policy Reinforcement Learning Algorithms

We have all witnessed the rapid development of reinforcement learning methods in the last couple of years. Most notably the biggest attention has been given to off-policy methods and the reason is quite obvious, they scale really well in comparison to other methods. Off-policy algorithms can (in principle) learn from data without interacting with the environment. This is a nice property, this means that we can collect our data by any means that we see fit and infer the optimal policy completely offline, in other words, we use a different behavioral policy that the one we are optimizing. Unfortunately, this doesn't work out of the box like most people think, as I will describe in this article.

### Top 5 Books on AI and ML to Grab Today

It has been popularly noted that artificial intelligence would be like the ultimate version of Google. With recent advancements in research and technology, Artificial Intelligence (AI) and Machine Learning (ML) are slowly becoming a part of our routine. The pace at which technology is growing is unfathomable. As these smart technologies engulf our life, staying updated with them is the need of the day. So, here's Packt's selection of finest books in artificial intelligence and machine learning that will help you have an edge in these fields: Reinforcement Learning is the trending and one of the most promising branches of artificial intelligence.

### Generalizable Deep Reinforcement Learning

Transfer learning is all the rage in the machine learning community these days. Transfer learning serves as the basis for many of the managed AutoML services that Google, Salesforce, IBM, and Azure provide. It now figures prominently in the latest NLP research -- appearing in Google's Bidirectional Encoder Representations from Transformers (BERT) model and in Sebastian Ruder and Jeremy Howard's Universal Language Model Fine-tuning for Text Classification (ULMFIT). As Sebastian writes in his blog post, 'NLP's ImageNet moment has arrived': We're also starting to see examples of neural networks that can handle multiple tasks using transfer learning across domains. Paras Chopra has an excellent tutorial for one PyTorch network that can conduct an image search based on a textual description, search for similar images and words, and write captions for images (link to his post below).

### On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (a) using the past experience to estimate {\em only} the gradient of the expected return $U(\theta)$ at the current policy parameterization $\theta$, rather than to obtain a more complete estimate of $U(\theta)$, and (b) using past experience under the current policy {\em only} rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines---a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.

### LSTD with Random Projections

We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a high-dimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. Papers published at the Neural Information Processing Systems Conference.

### Similarities between policy gradient methods (PGM) in Reinforcement learning (RL) and supervised learning (SL)

Reinforcement learning (RL) is about sequential decision making and is traditionally opposed to supervised learning (SL) and unsupervised learning (USL). In RL, given the current state, the agent makes a decision that may influence the next state as opposed to SL (and USL) where, the next state remains the same, regardless of the decisions taken, either in batch or online learning. Although this difference is fundamental between SL and RL, there are connections that have been overlooked. In particular, we prove in this paper that gradient policy method can be cast as a supervised learning problem where true label are replaced with discounted rewards. We provide a new proof of policy gradient methods (PGM) that emphasizes the tight link with the cross entropy and supervised learning. We provide a simple experiment where we interchange label and pseudo rewards. We conclude that other relationships with SL could be made if we modify the reward functions wisely.

### Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving

Behavior and motion planning play an important role in automated driving. Traditionally, behavior planners instruct local motion planners with predefined behaviors. Due to the high scene complexity in urban environments, unpredictable situations may occur in which behavior planners fail to match predefined behavior templates. Recently, general-purpose planners have been introduced, combining behavior and local motion planning. These general-purpose planners allow behavior-aware motion planning given a single reward function. However, two challenges arise: First, this function has to map a complex feature space into rewards. Second, the reward function has to be manually tuned by an expert. Manually tuning this reward function becomes a tedious task. In this paper, we propose an approach that relies on human driving demonstrations to automatically tune reward functions. This study offers important insights into the driving style optimization of general-purpose planners with maximum entropy inverse reinforcement learning. We evaluate our approach based on the expected value difference between learned and demonstrated policies. Furthermore, we compare the similarity of human driven trajectories with optimal policies of our planner under learned and expert-tuned reward functions. Our experiments show that we are able to learn reward functions exceeding the level of manual expert tuning without prior domain knowledge.

### Information-Theoretic Considerations in Batch Reinforcement Learning

Value-function approximation methods that operate in batch mode have foundational importance to reinforcement learning (RL). Finite sample guarantees for these methods often crucially rely on two types of assumptions: (1) mild distribution shift, and (2) representation conditions that are stronger than realizability. However, the necessity ("why do we need them?") and the naturalness ("when do they hold?") of such assumptions have largely eluded the literature. In this paper, we revisit these assumptions and provide theoretical results towards answering the above questions, and make steps towards a deeper understanding of value-function approximation.

### Efficient Model-free Reinforcement Learning in Metric Spaces

Model-free Reinforcement Learning (RL) algorithms such as Q-learning [Watkins, Dayan 92] have been widely used in practice and can achieve human level performance in applications such as video games [Mnih et al. 15]. Recently, equipped with the idea of optimism in the face of uncertainty, Q-learning algorithms [Jin, Allen-Zhu, Bubeck, Jordan 18] can be proven to be sample efficient for discrete tabular Markov Decision Processes (MDPs) which have finite number of states and actions. In this work, we present an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space--hence extending efficient model-free Q-learning algorithms to continuous state-action space. Compared to previous model-based RL algorithms for metric spaces [Kakade, Kearns, Langford 03], our algorithm does not require access to a black-box planning oracle.