AITopics

1905.12916

Country:

North America > United States (0.69)
North America > Canada > Quebec (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Diagnostic Medicine (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Learning Compositional Neural Programs with Recursive Tree Search and Planning

Pierrot, Thomas, Ligner, Guillaume, Reed, Scott, Sigaud, Olivier, Perrin, Nicolas, Laterre, Alexandre, Kas, David, Beguir, Karim, de Freitas, Nando

We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disks.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

1905.12941

Country: Asia > Vietnam > Hanoi > Hanoi (0.26)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningMay-30-2019

REGAL: Transfer Learning For Fast Optimization of Computation Graphs

Paliwal, Aditya, Gimeno, Felix, Nair, Vinod, Li, Yujia, Lubin, Miles, Kohli, Pushmeet, Vinyals, Oriol

We present a deep reinforcement learning approach to optimizing the execution cost of computation graphs in a static compiler. The key idea is to combine a neural network policy with a genetic algorithm, the Biased Random-Key Genetic Algorithm (BRKGA). The policy is trained to predict, given an input graph to be optimized, the node-level probability distributions for sampling mutations and crossovers in BRKGA. Our approach, "REINFORCE-based Genetic Algorithm Learning" (REGAL), uses the policy's ability to transfer to new graphs to significantly improve the solution quality of the genetic algorithm for the same objective evaluation budget. As a concrete application, we show results for minimizing peak memory in TensorFlow graphs by jointly optimizing device placement and scheduling. REGAL achieves on average 3.56% lower peak memory than BRKGA on previously unseen graphs, outperforming all the algorithms we compare to, and giving 4.4x bigger improvement than the next best algorithm. We also evaluate REGAL on a production compiler team's performance benchmark of XLA graphs and achieve on average 3.74% lower peak memory than BRKGA, again outperforming all others. Our approach and analysis is made possible by collecting a dataset of 372 unique real-world TensorFlow graphs, more than an order of magnitude more data than previous work.

evolutionary algorithm, machine learning, reinforcement learning, (20 more...)

1905.02494

Country:

Oceania > Australia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)

Provably Efficient Q-Learning with Low Switching Cost

Bai, Yu, Xie, Tengyang, Jiang, Nan, Wang, Yu-Xiang

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of local switching cost. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for H-step episodic MDP that achieves sublinear regret whose local switching cost in K episodes is $O(H^3SA\log K)$, and we provide a lower bound of $\Omega(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting, which yields nontrivial results that improve upon prior work in certain aspects.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

1905.12849

Country:

North America > United States > Illinois (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningMay-30-2019

Imitation Learning as $f$-Divergence Minimization

Ke, Liyiming, Barnes, Matt, Sun, Wen, Lee, Gilwoo, Choudhury, Sanjiban, Srinivasa, Siddhartha

We address the problem of imitation learning with multi-modal demonstrations. Instead of attempting to learn all modes, we argue that in many tasks it is sufficient to imitate any one of them. We show that the state-of-the-art methods such as GAIL and behavior cloning, due to their choice of loss function, often incorrectly interpolate between such modes. Our key insight is to minimize the right divergence between the learner and the expert state-action distributions, namely the reverse KL divergence or I-projection. We propose a general imitation learning framework for estimating and minimizing any f-Divergence. By plugging in different divergences, we are able to recover existing algorithms such as Behavior Cloning (Kullback-Leibler), GAIL (Jensen Shannon) and DAGGER (Total Variation). Empirical results show that our approximate I-projection technique is able to imitate multi-modal behaviors more reliably than GAIL and behavior cloning.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

1905.12888

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Industry: Leisure & Entertainment > Sports > Motorsports (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Prasad, Niranjani, Engelhardt, Barbara E, Doshi-Velez, Finale

Defining Admissible Rewards for High Confidence Policy Evaluation

arXiv.org Machine LearningMay-30-2019

A key impediment to reinforcement learning (RL) in real applications with limited, batch data is defining a reward function that reflects what we implicitly know about reasonable behaviour for a task and allows for robust off-policy evaluation. In this work, we develop a method to identify an admissible set of reward functions for policies that (a) do not diverge too far from past behaviour, and (b) can be evaluated with high confidence, given only a collection of past trajectories. Together, these ensure that we propose policies that we trust to be implemented in high-risk settings. We demonstrate our approach to reward design on synthetic domains as well as in a critical care context, for a reward that consolidates clinical objectives to learn a policy for weaning patients from mechanical ventilation.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

1905.13167

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.96)
Health & Medicine > Therapeutic Area > Immunology > HIV (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Tiwari, Nilay, Ghosh, Arnob, Aggarwal, Vaneet

Reinforcement Learning for Mean Field Game

Stochastic games provide a framework for interactions among multi-agents and enable a myriad of applications. In these games, agents decide on actions simultaneously, the state of an agent moves to the next state, and each agent receives a reward. However, finding an equilibrium (if exists) in this game is often difficult when the number of agents become large. This paper focuses on finding a mean-field equilibrium (MFE) in an action coupled stochastic game setting in an episodic framework. It is assumed that the impact of the other agents' can be assumed by the empirical distribution of the mean of the actions. All agents know the action distribution and employ lower-myopic best response dynamics to choose the optimal oblivious strategy. This paper proposes a posterior sampling based approach for reinforcement learning in the mean-field game, where each agent samples a transition probability from the previous transitions. We show that the policy and action distributions converge to the optimal oblivious strategy and the limiting distribution, respectively, which constitute a MFE.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

1905.13357

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.49)

Asadi, Kavosh, Misra, Dipendra, Kim, Seungchan, Littman, Michel L.

Combating the Compounding-Error Problem with a Multi-step Model

Model-based reinforcement learning is an appealing framework for creating agents that learn, plan, and act in sequential environments. Model-based algorithms typically involve learning a transition model that takes a state and an action and outputs the next state---a one-step model. This model can be composed with itself to enable predicting multiple steps into the future, but one-step prediction errors can get magnified, leading to unacceptable inaccuracy. This compounding-error problem plagues planning and undermines model-based reinforcement learning. In this paper, we address the compounding-error problem by introducing a multi-step model that directly outputs the outcome of executing a sequence of actions. Novel theoretical and empirical results indicate that the multi-step model is more conducive to efficient value-function estimation, and it yields better action selection compared to the one-step model. These results make a strong case for using multi-step models in the context of model-based reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

1905.1332

Country: North America > United States (0.28)

Genre:

Workflow (0.68)
Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Zombori, Zsolt, Csiszárik, Adrián, Michalewski, Henryk, Kaliszyk, Cezary, Urban, Josef

Towards Finding Longer Proofs

We present a reinforcement learning (RL) based guidance system for automated theorem proving geared towards Finding Longer Proofs (FLoP). FLoP focuses on generalizing from short proofs to longer ones of similar structure. To achieve that, FLoP uses state-of-the-art RL approaches that were previously not applied in theorem proving. In particular, we show that curriculum learning significantly outperforms previous learning-based proof guidance on a synthetic dataset of increasingly difficult arithmetic problems.

logic & formal reasoning, machine learning, reinforcement learning, (16 more...)

1905.131

Country:

Europe (1.00)
North America > United States > New York (0.28)
North America > United States > California (0.28)

Genre: Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Education (0.46)
Leisure & Entertainment > Games (0.46)
Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Han, Seungyul, Sung, Youngchul

Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning

arXiv.org Artificial IntelligenceMay-29-2019

In importance sampling (IS)-based reinforcement learning algorithms such as Proximal Policy Optimization (PPO), IS weights are typically clipped to avoid large variance in learning. However, policy update from clipped statistics induces large bias in tasks with high action dimensions, and bias from clipping makes it difficult to reuse old samples with large IS weights. In this paper, we consider PPO, a representative on-policy algorithm, and propose its improvement by dimension-wise IS weight clipping which separately clips the IS weight of each action dimension to avoid large bias and adaptively controls the IS weight to bound policy update from the current policy. This new technique enables efficient learning for high action-dimensional tasks and reusing of old samples like in off-policy learning to increase the sample efficiency. Numerical results show that the proposed new algorithm outperforms PPO and other RL algorithms in various Open AI Gym tasks.

machine learning, reinforcement learning, time step, (10 more...)

1905.02363

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)