AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation

Lu, Chaochao, Huang, Biwei, Wang, Ke, Hernández-Lobato, José Miguel, Zhang, Kun, Schölkopf, Bernhard

arXiv.org Machine LearningDec-16-2020

Reinforcement learning (RL) algorithms usually require a substantial amount of interaction data and perform well only for specific tasks in a fixed environment. In some scenarios such as healthcare, however, usually only few records are available for each patient, and patients may show different responses to the same treatment, impeding the application of current RL algorithms to learn optimal policies. To address the issues of mechanism heterogeneity and related data scarcity, we propose a data-efficient RL algorithm that exploits structural causal models (SCMs) to model the state dynamics, which are estimated by leveraging both commonalities and differences across subjects. The learned SCM enables us to counterfactually reason what would have happened had another treatment been taken. It helps avoid real (possibly risky) exploration and mitigates the issue that limited experiences lead to biased policies. We propose counterfactual RL algorithms to learn both population-level and individual-level policies. We show that counterfactual outcomes are identifiable under mild conditions and that Q- learning on the counterfactual-based augmented data set converges to the optimal value function. Experimental results on synthetic and real-world data demonstrate the efficacy of the proposed approach.

arxiv preprint arxiv, causal model, dynamic model, (13 more...)

arXiv.org Machine Learning

2012.09092

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > Middle East > Jordan (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Introduction to Machine Learning

#artificialintelligenceDec-15-2020, 06:10:10 GMT

This course introduces principles, algorithms, and applications of machine learning from the point of view of modeling and prediction. It includes formulation of learning problems and concepts of representation, over-fitting, and generalization. These concepts are exercised in supervised learning and reinforcement learning, with applications to images and to temporal sequences. Learn how to perform supervised and reinforcement learning, with images and temporal sequences. This course includes lectures, lecture notes, exercises, labs, and homework problems.

machine learning, reinforcement learning, temporal sequence, (2 more...)

#artificialintelligence

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.40)

Genre: Instructional Material > Course Syllabus & Notes (1.00)

Industry:

Education > Focused Education > Special Education (0.42)
Education > Educational Technology > Educational Software > Computer Based Training (0.40)
Education > Educational Setting > Online (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.62)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.40)

Add feedback

Facebook Open Sources ReBeL, a New Reinforcement Learning Agent - KDnuggets

#artificialintelligenceDec-15-2020, 02:41:06 GMT

I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Poker has been considered by many the core inspiration for the formalization of game theory. John von Neuman was reportedly an avid poker fan and use many analogies of the card game while creating the foundation of game-theory.

imperfect-information game, rebel, reinforcement, (10 more...)

#artificialintelligence

Country: North America > United States > Texas (0.07)

Industry: Leisure & Entertainment > Games (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.92)

Add feedback

Towards a 6G AI-Native Air Interface

Hoydis, Jakob, Aoudia, Fayçal Ait, Valcarce, Alvaro, Viswanathan, Harish

arXiv.org Artificial IntelligenceDec-15-2020

Each generation of cellular communication systems is marked by a defining disruptive technology of its time, such as orthogonal frequency division multiplexing (OFDM) for 4G or Massive multiple-input multiple-output (MIMO) for 5G. Since artificial intelligence (AI) is the defining technology of our time, it is natural to ask what role it could play for 6G. While it is clear that 6G must cater to the needs of large distributed learning systems, it is less certain if AI will play a defining role in the design of 6G itself. The goal of this article is to paint a vision of a new air interface which is partially designed by AI to enable optimized communication schemes for any hardware, radio environment, and application.

application, protocol, receiver, (12 more...)

arXiv.org Artificial Intelligence

2012.08285

Country:

North America > United States > New York > Tompkins County > Ithaca (0.04)
North America > United States > Illinois (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(4 more...)

Genre: Research Report (0.64)

Industry: Telecommunications (0.93)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Add feedback

A Reinforcement Learning Formulation of the Lyapunov Optimization: Application to Edge Computing Systems with Queue Stability

Bae, Sohee, Han, Seungyul, Sung, Youngchul

arXiv.org Artificial IntelligenceDec-15-2020

In this paper, a deep reinforcement learning (DRL)-based approach to the Lyapunov optimization is considered to minimize the time-average penalty while maintaining queue stability. A proper construction of state and action spaces is provided to form a proper Markov decision process (MDP) for the Lyapunov optimization. A condition for the reward function of reinforcement learning (RL) for queue stability is derived. Based on the analysis and practical RL with reward discounting, a class of reward functions is proposed for the DRL-based approach to the Lyapunov optimization. The proposed DRL-based approach to the Lyapunov optimization does not required complicated optimization at each time step and operates with general non-convex and discontinuous penalty functions. Hence, it provides an alternative to the conventional drift-plus-penalty (DPP) algorithm for the Lyapunov optimization. The proposed DRL-based approach is applied to resource allocation in edge computing systems with queue stability and numerical results demonstrate its successful operation.

edge node, node, ptq, (15 more...)

arXiv.org Artificial Intelligence

2012.07279

Country:

North America > United States > Massachusetts > Plymouth County > Hanover (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)

Genre: Research Report (0.69)

Industry:

Telecommunications (0.67)
Energy > Power Industry (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

BeBold: Exploration Beyond the Boundary of Explored Regions

Zhang, Tianjun, Xu, Huazhe, Wang, Xiaolong, Wu, Yi, Keutzer, Kurt, Gonzalez, Joseph E., Tian, Yuandong

arXiv.org Machine LearningDec-15-2020

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. To guide exploration, previous work makes extensive use of intrinsic reward (IR). There are many heuristics for IR, including visitation counts, curiosity, and state-difference. In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR. The criterion helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment. The resulting method, BeBold, solves the 12 most challenging procedurally-generated tasks in MiniGrid with just 120M environment steps, without any curriculum learning. In comparison, the previous SoTA only solves 50% of the tasks. BeBold also achieves SoTA on multiple tasks in NetHack, a popular rogue-like game that contains more challenging procedurally-generated environments.

arxiv preprint arxiv, bebold, exploration, (12 more...)

arXiv.org Machine Learning

2012.08621

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Games (0.68)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)

Add feedback

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Zhou, Dongruo, Gu, Quanquan, Szepesvari, Csaba

arXiv.org Machine LearningDec-15-2020

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$ attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for this setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $\tilde O(d\sqrt{T}/(1-\gamma)^{1.5})$ regret, where $\gamma\in [0,1)$ is the discount factor. Our upper bound matches the lower bound $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ proved in Zhou et al. (2020) up to logarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

algorithm, inequality hold, probability, (10 more...)

arXiv.org Machine Learning

2012.08507

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > Canada > Alberta (0.14)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.84)

Add feedback

Adapting Behavior via Intrinsic Reward: A Survey and Empirical Study

Linke, Cam (University of Alberta) | Ady, Nadia M. (University of Alberta) | White, Martha (University of Alberta) | Degris, Thomas (DeepMind) | White, Adam (University of Alberta)

Journal of Artificial Intelligence ResearchDec-14-2020

Learning about many things can provide numerous benefits to a reinforcement learning system. For example, learning many auxiliary value functions, in addition to optimizing the environmental reward, appears to improve both exploration and representation learning. The question we tackle in this paper is how to sculpt the stream of experience—how to adapt the learning system’s behavior—to optimize the learning of a collection of value functions. A simple answer is to compute an intrinsic reward based on the statistics of each auxiliary learner, and use reinforcement learning to maximize that intrinsic reward. Unfortunately, implementing this simple idea has proven difficult, and thus has been the focus of decades of study. It remains unclear which of the many possible measures of learning would work well in a parallel learning setting where environmental reward is extremely sparse or absent. In this paper, we investigate and compare different intrinsic reward mechanisms in a new bandit-like parallel-learning testbed. We discuss the interaction between reward and prediction learners and highlight the importance of introspective prediction learners: those that increase their rate of learning when progress is possible, and decrease when it is not. We provide a comprehensive empirical comparison of 14 different rewards, including well-known ideas from reinforcement learning and active learning. Our results highlight a simple but seemingly powerful principle: intrinsic rewards based on the amount of learning can generate useful behavior, if each individual learner is introspective.

learner, machine learning, reinforcement learning, (20 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.1.12087

AI Access Foundation

12087

Journal of Artificial Intelligence Research

Country:

North America > Canada > Alberta (0.14)
North America > United States > Wisconsin (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Education (1.00)
Energy > Oil & Gas > Upstream (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Active Hierarchical Imitation and Reinforcement Learning

Niu, Yaru, Gu, Yijun

arXiv.org Artificial IntelligenceDec-14-2020

Humans can leverage hierarchical structures to split a task into sub-tasks and solve problems efficiently. Both imitation and reinforcement learning or a combination of them with hierarchical structures have been proven to be an efficient way for robots to learn complex tasks with sparse rewards. However, in the previous work of hierarchical imitation and reinforcement learning, the tested environments are in relatively simple 2D games, and the action spaces are discrete. Furthermore, many imitation learning works focusing on improving the policies learned from the expert polices that are hard-coded or trained by reinforcement learning algorithms, rather than human experts. In the scenarios of human-robot interaction, humans can be required to provide demonstrations to teach the robot, so it is crucial to improve the learning efficiency to reduce expert efforts, and know human's perception about the learning/training process. In this project, we explored different imitation learning algorithms and designed active learning algorithms upon the hierarchical imitation and reinforcement learning framework we have developed. We performed an experiment where five participants were asked to guide a randomly initialized agent to a random goal in a maze. Our experimental results showed that using DAgger and reward-based active learning method can achieve better performance while saving more human efforts physically and mentally during the training process.

algorithm, learning, reinforcement learning, (12 more...)

arXiv.org Artificial Intelligence

2012.0733

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report > New Finding (0.55)

Industry:

Health & Medicine (0.46)
Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL

Zanette, Andrea

arXiv.org Artificial IntelligenceDec-14-2020

Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive exponential information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) realizability holds, 2) the batch algorithm observes the exact reward and transition functions, and 3) the batch algorithm is given the best a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if all policies admit a linear representation. If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an online algorithm, showing that there exist RL problems where batch RL is exponentially harder than online RL even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work helps formalize the issue known as deadly triad and explains that the bootstrapping problem is potentially more severe than the extrapolation issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.

algorithm, mdp, target policy, (16 more...)

arXiv.org Artificial Intelligence

2012.08005

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback