AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq

Vajjha, Koundinya, Shinnar, Avraham, Pestun, Vasily, Trager, Barry, Fulton, Nathan

arXiv.org Artificial IntelligenceSep-23-2020

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman's optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

2009.11403

Country:

North America > United States > Oregon > Multnomah County > Portland (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > New York County > New York City (0.04)
(6 more...)

Genre: Research Report (0.40)

Industry: Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.73)

Add feedback

MOPO: Model-based Offline Policy Optimization

Yu, Tianhe, Thomas, Garrett, Yu, Lantao, Ermon, Stefano, Zou, James, Levine, Sergey, Finn, Chelsea, Ma, Tengyu

arXiv.org Artificial IntelligenceSep-23-2020

Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. The code is available at https://github.com/tianheyu927/mopo.

arxiv preprint arxiv, machine learning, reinforcement learning, (13 more...)

arXiv.org Artificial Intelligence

2005.13239

Country:

North America > United States > Massachusetts (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > China > Shandong Province > Dongying (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Exploring exploration: comparing children with RL agents in unified environments

AIHubSep-22-2020, 08:00:00 GMT

Despite recent advances in artificial intelligence (AI) research, human children are still by far the best learners we know of, learning impressive skills like language and high-level reasoning from very little data. Children's learning is supported by highly efficient, hypothesis-driven exploration: in fact, they explore so well that many machine learning researchers have been inspired to put videos like the one below in their talks to motivate research into exploration methods. However, because applying results from studies in developmental psychology can be difficult, this video is often the extent to which such research actually connects with human cognition. Why is directly applying research from developmental psychology to problems in AI so hard? For one, taking inspiration from developmental studies can be difficult because the environments that human children and artificial agents are typically studied in can be very different. Traditionally, reinforcement learning (RL) research takes place in grid-world-like settings or other 2D games, whereas children act in the real world which is rich and 3-dimensional.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

AIHub

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)

Add feedback

[R] GRAC: Self-Guided and Self-Regularized Actor-Critic

#artificialintelligenceSep-22-2020, 04:02:07 GMT

Abstract: Deep reinforcement learning (DRL) algorithms have successfully been demonstrated on a range of challenging decision making and control tasks. One dominant component of recent deep reinforcement learning algorithms is the target network which mitigates the divergence when learning the Q function. However, target networks can slow down the learning process due to delayed function updates. Another dominant component especially in continuous domains is the policy gradient method which models and optimizes the policy directly. However, when Q functions are approximated with neural networks, their landscapes can be complex and therefore mislead the local gradient.

machine learning, reinforcement learning, self-guided and self-regularized actor-critic, (4 more...)

#artificialintelligence

Industry: Media > News (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

UC Berkeley Reward-Free RL Beats SOTA Reward-Based RL

#artificialintelligenceSep-22-2020, 00:40:15 GMT

End-to-end Deep Reinforcement Learning (DRL) is a trending training approach in the field of computer vision, where it has proven successful at solving a wide range of complex tasks that were previously regarded as out of reach. End-to-end DRL is now being applied in domains ranging from real-world and simulated robotics to sophisticated video games. However, as appealing as end-to-end DRL methods are, most rely heavily on reward functions in order to learn visual features. This means feature-learning suffers when rewards are sparse, which is the case in most real-world scenarios. ATC trains a convolutional encoder to associate pairs of observations separated by a short time difference. Random shift, a stochastic data augmentation to the observations is applied within each training batch.

machine learning, reinforcement learning, rl beat sota reward-based rl, (8 more...)

#artificialintelligence

Country:

Asia > China (0.08)
North America > United States > California > Alameda County > Berkeley (0.06)

Industry:

Leisure & Entertainment > Games > Computer Games (0.57)
Education > Educational Setting > Higher Education (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.98)

Add feedback

Approximate exploitability: Learning a best response in large games

Timbers, Finbarr, Lockhart, Edward, Lanctot, Marc, Schmid, Martin, Schrittwieser, Julian, Hubert, Thomas, Bowling, Michael

arXiv.org Machine LearningSep-22-2020

A standard metric used to measure the approximate optimality of policies in imperfect information games is exploitability, i.e. the performance of a policy against its worst-case opponent. However, exploitability is intractable to compute in large games as it requires a full traversal of the game tree to calculate a best response to the given policy. We introduce a new metric, approximate exploitability, that calculates an analogous metric using an approximate best response; the approximation is done by using search and reinforcement learning. This is a generalization of local best response, a domain specific evaluation metric used in poker. We provide empirical results for a specific instance of the method, demonstrating that our method converges to exploitability in the tabular and function approximation settings for small games. In large games, our method learns to exploit both strong and weak agents, learning to exploit an AlphaZero agent.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2004.09677

Country:

North America > Canada > Alberta (0.14)
North America > United States > Texas (0.05)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games > Chess (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Games (1.00)

Add feedback

When Deep Reinforcement Learning Meets Federated Learning: Intelligent Multi-Timescale Resource Management for Multi-access Edge Computing in 5G Ultra Dense Network

Yu, Shuai, Chen, Xu, Zhou, Zhi, Gong, Xiaowen, Wu, Di

arXiv.org Artificial IntelligenceSep-22-2020

Ultra-dense edge computing (UDEC) has great potential, especially in the 5G era, but it still faces challenges in its current solutions, such as the lack of: i) efficient utilization of multiple 5G resources (e.g., computation, communication, storage and service resources); ii) low overhead offloading decision making and resource allocation strategies; and iii) privacy and security protection schemes. Thus, we first propose an intelligent ultra-dense edge computing (I-UDEC) framework, which integrates blockchain and Artificial Intelligence (AI) into 5G ultra-dense edge computing networks. First, we show the architecture of the framework. Then, in order to achieve real-time and low overhead computation offloading decisions and resource allocation strategies, we design a novel two-timescale deep reinforcement learning (\textit{2Ts-DRL}) approach, consisting of a fast-timescale and a slow-timescale learning process, respectively. The primary objective is to minimize the total offloading delay and network resource usage by jointly optimizing computation offloading, resource allocation and service caching placement. We also leverage federated learning (FL) to train the \textit{2Ts-DRL} model in a distributed manner, aiming to protect the edge devices' data privacy. Simulation results corroborate the effectiveness of both the \textit{2Ts-DRL} and FL in the I-UDEC framework and prove that our proposed algorithm can reduce task execution time up to 31.87%.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2009.10601

Country:

North America > United States > New York (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Alabama > Lee County > Auburn (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Is Q-Learning Provably Efficient? An Extended Analysis

Rastogi, Kushagra, Lee, Jonathan, Harel-Canada, Fabrice, Joglekar, Aditya

arXiv.org Artificial IntelligenceSep-22-2020

This work extends the analysis of the theoretical results presented within the paper Is Q-Learning Provably Efficient? by Jin et al. We include a survey of related research to contextualize the need for strengthening the theoretical guarantees related to perhaps the most important threads of model-free reinforcement learning. We also expound upon the reasoning used in the proofs to highlight the critical steps leading to the main result showing that Q-learning with UCB exploration achieves a sample efficiency that matches the optimal regret that can be achieved by any model-based approach.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2009.10396

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.29)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
(5 more...)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

The relationship between dynamic programming and active inference: the discrete, finite-horizon case

Da Costa, Lancelot, Sajid, Noor, Parr, Thomas, Friston, Karl, Smith, Ryan

arXiv.org Artificial IntelligenceSep-22-2020

Active inference is a normative framework for generating behaviour based upon the free energy principle, a theory of self-organisation. This framework has been successfully used to solve reinforcement learning and stochastic control problems, yet, the formal relation between active inference and reward maximisation has not been fully explicated. In this paper, we consider the relation between active inference and dynamic programming under the Bellman equation, which underlies many approaches to reinforcement learning and control. We show that, on partially observable Markov decision processes, dynamic programming is a limiting case of active inference. In active inference, agents select actions to minimise expected free energy. In the absence of ambiguity about states, this reduces to matching expected states with a target distribution encoding the agent's preferences. When target states correspond to rewarding states, this maximises expected reward, as in reinforcement learning. When states are ambiguous, active inference agents will choose actions that simultaneously minimise ambiguity. This allows active inference agents to supplement their reward maximising (or exploitative) behaviour with novelty-seeking (or exploratory) behaviour. This clarifies the connection between active inference and reinforcement learning, and how both frameworks may benefit from each other.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2009.08111

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.14)
Asia > Middle East > Jordan (0.05)
(5 more...)

Genre:

Research Report (0.50)
Instructional Material > Course Syllabus & Notes (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

IALE: Imitating Active Learner Ensembles

Löffler, Christoffer, Mutschler, Christopher

arXiv.org Machine LearningSep-22-2020

However, the performance of AL heuristics depends on the structure of the underlying classifier model and the data. We propose an imitation learning scheme that imitates the selection of the best expert heuristic at each stage of the AL cycle in a batch-mode pool-based setting. With multiple AL heuristics as experts, the policy is able to reflect the choices of the best AL heuristics given the current state of the AL process. Our experiment on well-known datasets show that we both outperform state of the art imitation learners and heuristics. The high performance of deep learning on various tasks from computer vision (Voulodimos et al., 2018) to natural language processing (NLP) (Barrault et al., 2019) also comes with disadvantages. One of their main drawbacks is the large amount of labeled training data they require. Obtaining such data is expensive and time-consuming and often requires domain expertise (Löffler et al., 2020). Active Learning (AL) is an iterative process where during every iteration an oracle (e.g. a human) is asked to label the most informative unlabeled data sample(s). In pool-based AL all data samples are available (while most of them are unlabeled). In batch-mode pool-based AL, we select unlabeled data samples from the pool in acquisition batches greater than 1. Batch-mode AL decreases the number of AL iterations required and makes it easier for an oracle to label the data samples (Settles, 2009).

dataset, iale, learning, (15 more...)

arXiv.org Machine Learning

2007.04637

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(17 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback