AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Finite-Sample Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation

Sun, Jun, Wang, Gang, Giannakis, Georgios B., Yang, Qinmin, Yang, Zaiyue

arXiv.org Machine LearningNov-3-2019

Thanks to its generality, RL has been widely studied in many areas, such as control theory, game theory, operations research, multi-agent systems, machine learning, artificial intelligence, and statistics [23]. In recent years, combining with deep learning, RL has demonstrated its great potential in addressing challenging practical control and optimization problems [17, 21]. Among all possible algorithms, the temporal difference (TD) learning has arguably become one of the most popular RL algorithms so far, which is further dominated by the celebrated TD(0) algorithm [22]. TD learning provides an iterative process to update an estimate of the so-termed value function v π(s) with respect to a given policy π based on temporally successive samples. Dealing with a finite state space, the classical version of the TD(0) algorithm adopts a tabular representation for v π(s), which stores entry-wise value estimates on a per state basis. J. Sun and Q. Yang are with the College of Control Science and Engineering, and the State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, China. G. Wang and G. B. Giannakis are with the Digital Technology Center and the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA. Z. Yang is with the Department of Mechanical and Energy Engineering, Southern University of Science and Technology, Shenzhen, China.

algorithm, decentralized td, function approximation, (13 more...)

arXiv.org Machine Learning

1911.00934

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.88)
Asia > China > Zhejiang Province > Hangzhou (0.24)
Asia > China > Guangdong Province > Shenzhen (0.24)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.42)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Non-Cooperative Inverse Reinforcement Learning

Zhang, Xiangyuan, Zhang, Kaiqing, Miehling, Erik, Başar, Tamer

arXiv.org Artificial IntelligenceNov-3-2019

Making decisions in the presence of a strategic opponent requires one to take into account the opponent's ability to actively mask its intended objective. To describe such strategic situations, we introduce the non-cooperative inverse reinforcement learning (N-CIRL) formalism. The N-CIRL formalism consists of two agents with completely misaligned objectives, where only one of the agents knows the true objective function. Formally, we model the N-CIRL formalism as a zero-sum Markov game with one-sided incomplete information. Through interacting with the more informed player, the less informed player attempts to both infer, and act according to, the true objective function. As a result of the one-sided incomplete information, the multi-stage game can be decomposed into a sequence of single-stage games expressed by a recursive formula. Solving this recursive formula yields the value of the N-CIRL game and the more informed player's equilibrium strategy. Another recursive formula, constructed by forming an auxiliary game, termed the dual game, yields the less informed player's strategy. Building upon these two recursive formulas, we develop a computationally tractable algorithm to approximately solve for the equilibrium strategies. Finally, we demonstrate the benefits of our N-CIRL formalism over the existing multi-agent IRL formalism via extensive numerical simulation in a novel cyber security setting.

attacker, defender, information, (16 more...)

arXiv.org Artificial Intelligence

1911.0422

Country:

North America > United States > Illinois (0.04)
North America > Canada (0.04)
Asia > Malaysia (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

Add feedback

Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs

Zanette, Andrea, Brunskill, Emma

arXiv.org Artificial IntelligenceNov-3-2019

In order to make good decision under uncertainty an agent must learn from observations. To do so, two of the most common frameworks are Contextual Bandits and Markov Decision Processes (MDPs). In this paper, we study whether there exist algorithms for the more general framework (MDP) which automatically provide the best performance bounds for the specific problem at hand without user intervention and without modifying the algorithm. In particular, it is found that a very minor variant of a recently proposed reinforcement learning algorithm for MDPs already matches the best possible regret bound $\tilde O (\sqrt{SAT})$ in the dominant term if deployed on a tabular Contextual Bandit problem despite the agent being agnostic to such setting.

algorithm, failure event, identifying bandit structure, (14 more...)

arXiv.org Artificial Intelligence

1911.00954

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning

Cohen, Andrew, Yu, Lei, Qiao, Xingye, Tong, Xiangrong

arXiv.org Artificial IntelligenceNov-3-2019

Two hitherto disconnected threads of research, diverse exploration (DE) and maximum entropy RL have addressed a wide range of problems facing reinforcement learning algorithms via ostensibly distinct mechanisms. In this work, we identify a connection between these two approaches. First, a discriminator-based diversity objective is put forward and connected to commonly used divergence measures. We then extend this objective to the maximum entropy framework and propose an algorithm Maximum Entropy Diverse Exploration (MEDE) which provides a principled method to learn diverse behaviors. A theoretical investigation shows that the set of policies learned by MEDE capture the same modalities as the optimal maximum entropy policy. In effect, the proposed algorithm disentangles the maximum entropy policy into its diverse, constituent policies. Experiments show that MEDE is superior to the state of the art in learning high performing and diverse policies.

discriminator, international conference, objective, (11 more...)

arXiv.org Artificial Intelligence

1911.00828

Country: North America > United States > New York > Broome County > Binghamton (0.05)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Maximum Entropy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Part 4 -- Neural Network Q Learning, a Tic Tac Toe player that learns -- kind of

#artificialintelligenceNov-2-2019, 13:49:25 GMT

Not as good as I would have hoped given that mastering Tic Tac Toe is not a particularly hard challenge -- at least for a human, even of very young age.

board state, linear function, neural network, (14 more...)

#artificialintelligence

Industry: Leisure & Entertainment > Games > Tic-Tac-Toe (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On Solving the 2-Dimensional Greedy Shooter Problem for UAVs

Anderson, Loren, Senapathy, Sahitya

arXiv.org Machine LearningNov-2-2019

Unmanned Aerial Vehicles (UAVs), autonomously-guided aircraft, are widely used for tasks involving surveillance and reconnaissance. A version of the pursuit-evasion problems centered around UAVs and its variants has been extensively studied in recent years due to numerous breakthroughs in AI. We present an approach to UAV pursuit-evasion in a 2D aerial-engagement environment using reinforcement learning (RL), a machine learning paradigm concerned with goal-oriented algorithms. In this work, a UAV wielding the greedy shooter strategy engages with a UAV trained using deep Q-learning techniques. Simulated results show that the latter UAV wins every engagement in which the UAVs are suffciently separated during initialization. This approach highlights an exhaustive and robust application of reinforcement learning to pursuit-evasion that provides insight into effective strategies for UAV flight and interaction.

agent, gs agent, rl agent, (15 more...)

arXiv.org Machine Learning

1911.01419

Country:

North America > United States > Minnesota (0.04)
North America > United States > Maryland > Anne Arundel County > Annapolis (0.04)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Games (0.95)
Information Technology (0.86)
Aerospace & Defense (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.54)

Add feedback

Learning from Trajectories via Subgoal Discovery

Paul, Sujoy, van Baar, Jeroen, Roy-Chowdhury, Amit K.

arXiv.org Machine LearningNov-2-2019

Learning to solve complex goal-oriented tasks with sparse terminal-only rewards often requires an enormous number of samples. In such cases, using a set of expert trajectories could help to learn faster. However, Imitation Learning (IL) via supervised pre-training with these trajectories may not perform as well and generally requires additional finetuning with expert-in-the-loop. In this paper, we propose an approach which uses the expert trajectories and learns to decompose the complex main task into smaller sub-goals. We learn a function which partitions the state-space into sub-goals, which can then be used to design an extrinsic reward function. We follow a strategy where the agent first learns from the trajectories using IL and then switches to Reinforcement Learning (RL) using the identified sub-goals, to alleviate the errors in the IL step. To deal with states which are under-represented by the trajectory set, we also learn a function to modulate the sub-goal predictions. We show that our method is able to solve complex goal-oriented tasks, which other RL, IL or their combinations in literature are not able to solve.

expert trajectory, reward function, trajectory, (14 more...)

arXiv.org Machine Learning

1911.07224

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > United States > California > Riverside County > Riverside (0.04)
North America > Canada (0.04)

Genre: Research Report (0.64)

Industry:

Automobiles & Trucks (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Reinforcement Learning Algorithms with Python: Learn, understand, and develop smart algorithms for addressing AI challenges: Andrea Lonza: 9781789131116: Amazon.com: Books

#artificialintelligenceNov-1-2019, 15:17:27 GMT

Andrea Lonza is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.

andrea lonza, develop smart algorithm, reinforcement learning algorithm, (3 more...)

#artificialintelligence

Industry: Retail > Online (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.39)

Add feedback

AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning

#artificialintelligenceNov-1-2019, 03:12:58 GMT

Exploration is another key challenge in complex environments such as StarCraft. There are up to 1026 possible actions available to one of our agents at each time step, and the agent must make thousands of actions before learning if it has won or lost the game. Finding winning strategies is challenging in such a massive solution space. Even with a strong self-play system and a diverse league of main and exploiter agents, there would be almost no chance of a system developing successful strategies in such a complex environment without some prior knowledge. Learning human strategies, and ensuring that the agents keep exploring those strategies throughout self-play, was key to unlocking AlphaStar's performance.

alphastar, grandmaster level, multi-agent reinforcement, (3 more...)

#artificialintelligence

Industry: Leisure & Entertainment > Games > Computer Games (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Zanette, Andrea, Brandfonbrener, David, Pirotta, Matteo, Lazaric, Alessandro

arXiv.org Machine LearningNov-1-2019

A key challenge in reinforcement learning (RL) is how to bala nce exploration and exploitation in order to efficiently learn to make good sequences of decisions in a way that is both computationally tractable and statistically efficient. In the tabular case, the exploration-exploitation problem is well-understood for a number of settings (e.g., finite-hori zon, average reward, infinite horizon with discount), explorati on objectives (e.g., regret minimization and probably approximately correct), and for different algorithmic appro aches, where optimism-under-uncertainty [JOA10, FPLO18] and Thompson sampling (TS) [OBPVR16, Rus19] are the most pop ular principles. For instance, in the finite-horizon setting, [AOM17] and [ZB19] recently derived minimax optim al and structure adaptive regret bounds for optimistic exploration algorithms. TSbased algorithms have mainly b een analyzed in tabular MDPs in terms of Bayesian regret [OBPVR16, OR17, OGNJ17], which assumes that the MDP is s ampled from a known prior distribution. These bounds do not hold against a fixed MDP and algorithms with smal l Bayesian regret may still suffer high regret in some hard-to-learn MDPs within the chosen prior. In the tabu lar setting, frequentist (or worst-case) regret analysis h as been developed for TSbased algorithms both in the average r eward [GM15, AJ17] and finite-horizon case [Rus19]. Despite the fact that TSbased approaches have slightly wor se regret bounds compared to optimism-based algorithms, their empirical performance is often superior [CL11, OR17] . Unfortunately, the performance of tabular exploration met hods rapidly degrades with the number of states and actions, thus making them unfeasible in large or continuous MD Ps.

artificial intelligence, ps 1, upstream oil & gas, (20 more...)

arXiv.org Machine Learning

1911.00567

Country: North America > United States (0.14)

Genre: Research Report (0.63)

Industry: Energy > Oil & Gas > Upstream (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.34)

Add feedback