AITopics

2006.05443

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJun-23-2020

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Matsushima, Tatsuya, Furuta, Hiroki, Matsuo, Yutaka, Nachum, Ofir, Gu, Shixiang

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that na\"{i}vely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN .

bremen, machine learning, reinforcement learning, (14 more...)

2006.03647

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Promising Solution (0.68)

Industry:

Health & Medicine (0.48)
Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJun-23-2020

Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

W, Aaron Sonabend, Lu, Junwei, Celi, Leo A., Cai, Tianxi, Szolovits, Peter

Offline Reinforcement Learning (RL) is a promising approach for learning optimal policies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hard to interpret within the application context, and lack measures of uncertainty for the learned policy value and its decisions. To overcome these issues, we propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk aversion within the application context, and finally, 3) we propose a way to interpret ESRL's policy at every state through posterior distributions, and use this framework to compute off-policy value function posteriors. We provide theoretical guarantees for our estimators and regret bounds consistent with Posterior Sampling for RL (PSRL) that account for any risk aversion threshold. We further propose an offline version of PSRL as a special case of ESRL.

machine learning, posterior distribution, reinforcement learning, (14 more...)

2006.13189

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

Tian, Yi, Qian, Jian, Sra, Suvrit

We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational complexity with a slightly worse regret. A key new ingredient of our algorithms is the design of a bonus term to guide exploration. We complement our algorithms by presenting several structure-dependent lower bounds on regret for FMDPs that reveal the difficulty hiding in the intricacy of the structures.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2006.13405

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Workflow (0.46)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.84)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.81)

On the Global Optimality of Model-Agnostic Meta-Learning

Wang, Lingxiao, Cai, Qi, Yang, Zhuoran, Wang, Zhaoran

Model-agnostic meta-learning (MAML) formulates meta-learning as a bilevel optimization problem, where the inner level solves each subtask based on a shared prior, while the outer level searches for the optimal shared prior by optimizing its aggregated performance over all the subtasks. Despite its empirical success, MAML remains less understood in theory, especially in terms of its global optimality, due to the nonconvexity of the meta-objective (the outer-level objective). To bridge such a gap between theory and practice, we characterize the optimality gap of the stationary points attained by MAML for both reinforcement learning and supervised learning, where the inner-level and outer-level problems are solved via first-order optimization methods. In particular, our characterization connects the optimality gap of such stationary points with (i) the functional geometry of inner-level objectives and (ii) the representation power of function approximators, including linear models and neural networks. To the best of our knowledge, our analysis establishes the global optimality of MAML with nonconvex meta-objectives for the first time.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2006.13182

Country:

North America > United States > Massachusetts (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Kamalaruban, Parameswaran, Devidze, Rati, Cevher, Volkan, Singla, Adish

Environment Shaping in Reinforcement Learning using State Abstraction

One of the central challenges faced by a reinforcement learning (RL) agent is to effectively learn a (near-)optimal policy in environments with large state spaces having sparse and noisy feedback signals. In real-world applications, an expert with additional domain knowledge can help in speeding up the learning process via \emph{shaping the environment}, i.e., making the environment more learner-friendly. A popular paradigm in literature is \emph{potential-based reward shaping}, where the environment's reward function is augmented with additional local rewards using a potential function. However, the applicability of potential-based reward shaping is limited in settings where (i) the state space is very large, and it is challenging to compute an appropriate potential function, (ii) the feedback signals are noisy, and even with shaped rewards the agent could be trapped in local optima, and (iii) changing the rewards alone is not sufficient, and effective shaping requires changing the dynamics. We address these limitations of potential-based shaping methods and propose a novel framework of \emph{environment shaping using state abstraction}. Our key idea is to compress the environment's large state space with noisy signals to an abstracted space, and to use this abstraction in creating smoother and more effective feedback signals for the agent. We study the theoretical underpinnings of our abstraction-based environment shaping, and show that the agent's policy learnt in the shaped environment preserves near-optimal behavior in the original environment.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2006.1316

Country: North America > United States > Illinois (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Hussenot, Léonard, Dadashi, Robert, Geist, Matthieu, Pietquin, Olivier

Show me the Way: Intrinsic Motivation from Demonstrations

The study of exploration in Reinforcement Learning (RL) has a long history but it remains an unsolved problem. Recent approaches applied to Deep RL are based on the concept of intrinsic motivation and are implemented in the shape of an exploration bonus, added to the environment reward, that encourages visiting exhaustively the whole state-action space as fast as possible. This approach is supported by the vast theory of RL for which convergence to optimality assumes exhaustive exploration. Yet, Human Beings and mammals do not exhaustively explore the world and their motivation is not only based on novelty but also on diverse other factors (e.g., curiosity, fun, style, pleasure, safety, competition, etc.). They optimize for life-long learning and train to learn transferable skills in playgrounds without obvious goals. They also apply innate or learned priors to save time and stay safe. For these reasons, we propose a method for learning an exploration bonus from demonstrations that could transfer these motivations to an artificial agent without explicitly modeling them. Using an inverse RL approach, we show that different exploration behaviors can be learnt and efficiently used by RL agents to solve tasks for which exhaustive exploration is prohibitive.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2006.12917

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Games (0.68)
Education > Educational Setting > Continuing Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Campero, Andres, Raileanu, Roberta, Küttler, Heinrich, Tenenbaum, Joshua B., Rocktäschel, Tim, Grefenstette, Edward

Learning with AMIGo: Adversarially Motivated Intrinsic Goals

arXiv.org Artificial IntelligenceJun-22-2020

A key challenge for reinforcement learning (RL) consists of learning in environments with sparse extrinsic rewards. In contrast to current RL methods, humans are able to learn new skills with little or no reward by using various forms of intrinsic motivation. We propose AMIGo, a novel agent incorporating a goal-generating teacher that proposes Adversarially Motivated Intrinsic Goals to train a goal-conditioned "student" policy in the absence of (or alongside) environment reward. Specifically, through a simple but effective "constructively adversarial" objective, the teacher learns to propose increasingly challenging---yet achievable---goals that allow the student to learn general skills for acting in a new environment, independent of the task to be solved. We show that our method generates a natural curriculum of self-proposed goals which ultimately allows the agent to solve challenging procedurally-generated tasks where other forms of intrinsic motivation and state-of-the-art RL methods fail.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2006.12122

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Ghosh, Dibya, Machado, Marlos C., Roux, Nicolas Le

An operator view of policy gradient methods

arXiv.org Artificial IntelligenceJun-22-2020

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}\pi$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2006.11266

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > New Jersey > Mercer County > Princeton (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Petrenko, Aleksei, Huang, Zhehui, Kumar, Tushar, Sukhatme, Gaurav, Koltun, Vladlen

Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning

arXiv.org Artificial IntelligenceJun-22-2020

Increasing the scale of reinforcement learning experiments has allowed researchers to achieve unprecedented results in both training sophisticated agents for video games, and in sim-to-real transfer for robotics. Typically such experiments rely on large distributed systems and require expensive hardware setups, limiting wider access to this exciting area of research. In this work we aim to solve this problem by optimizing the efficiency and resource utilization of reinforcement learning algorithms instead of relying on distributed computation. We present the "Sample Factory", a high-throughput training system optimized for a single-machine setting. Our architecture combines a highly efficient, asynchronous, GPU-based sampler with off-policy correction techniques, allowing us to achieve throughput higher than $10^5$ environment frames/second on non-trivial control problems in 3D without sacrificing sample efficiency. We extend Sample Factory to support self-play and population-based training and apply these techniques to train highly capable agents for a multiplayer first-person shooter game. The source code is available at https://github.com/alex-petrenko/sample-factory

machine learning, reinforcement learning, sample factory, (17 more...)

2006.11751

Country:

North America > United States > California (0.14)
Europe > Austria > Vienna (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)