AITopics

2002.12292

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
Europe > Sweden > Stockholm > Stockholm (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry:

Education (0.67)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Machine LearningFeb-28-2020

Mixed Reinforcement Learning with Additive Stochastic Uncertainty

Mu, Yao, Li, Shengbo Eben, Liu, Chang, Sun, Qi, Nie, Bingbing, Cheng, Bo, Peng, Baiyu

Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual representations indicate the environmental model and the state-action data: the former can accelerate the learning process of RL, while its inherent model uncertainty generally leads to worse policy accuracy than the latter, which comes from direct measurements of states and actions. In the framework design of the mixed RL, the compensation of the additive stochastic model uncertainty is embedded inside the policy iteration RL framework by using explored state-action data via iterative Bayesian estimator (IBE). The optimal policy is then computed in an iterative way by alternating between policy evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL is proved using the Bellman's principle of optimality, and the recursive stability of the generated policy is proved via the Lyapunov's direct method. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of stochastic non-affine nonlinear systems (i.e., double lane change task with an automated vehicle).

algorithm, mixed rl, university, (15 more...)

2003.00848

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > China > Beijing > Beijing (0.05)
North America > United States > California > Alameda County > Berkeley (0.04)
(8 more...)

Genre: Research Report (0.50)

Industry: Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.47)

Abachi, Romina, Ghavamzadeh, Mohammad, Farahmand, Amir-massoud

Policy-Aware Model Learning for Policy Gradient Methods

arXiv.org Artificial IntelligenceFeb-28-2020

A model-based reinforcement learning (MBRL) agent gradually learns a model of the environment as it interacts with it, and uses the learned model to plan and find a good policy. This can be done by planning with samples coming from the model, instead of or in addition to the samples from the environment, e.g., Sutton (1990); Peng & Williams (1993); Sutton et al. (2008); Deisenroth et al. (2015); Talvitie (2017); Ha & Schmidhuber (2018). If learning a model is easier than learning the policy or value function in a model-free manner, MBRL will lead to a reduction in the number of required interactions with the real-world and will improve the sample complexity of the agent. However, this is contingent on the ability of the agent to learn an accurate model of the real environment. Therefore, the problem of learning a good model of the environment is of paramount importance in the success of MBRL. This paper addresses the question of how we can approach the problem of learning a model of the environment, and proposes a method called policy-aware model learning (PAML). The conventional approach to model learning in MBRL is to learn a model that is a good predictor of the environment. If the learned model is accurate enough, this leads to a value function or a policy that is close to the optimal one. Learning a good predictive model can be achieved by minimizing some form of a probabilistic loss.

dimension, paml, value function, (14 more...)

2003.0003

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Greige, Laura, Chin, Peter

Reinforcement Learning in FlipIt

arXiv.org Artificial IntelligenceFeb-28-2020

Reinforcement learning has shown much success in games such as chess, backgammon and Go [1, 2, 3]. However, in most of these games, agents have full knowledge of the environment at all times. In this paper, we describe a deep learning model that successfully optimizes its score using reinforcement learning in a game with incomplete and imperfect information. We apply our model to FlipIt [4], a two-player game in which both players, the attacker and the defender, compete for ownership of a shared resource and only receive information on the current state (such as the current owner of the resource, or the time since the opponent last moved, etc.) upon making a move. Our model is a deep neural network combined with Q-learning and is trained to maximize the defender's time of ownership of the resource. Despite the imperfect observations, our model successfully learns an optimal cost-effective counter-strategy and shows the advantages of the use of deep reinforcement learning in game theoretic scenarios. Our results show that it outperforms the Greedy strategy against distributions such as periodic and exponential distributions without any prior knowledge of the opponent's strategy, and we generalize the model to n-player games.

agent, flipit, opponent, (15 more...)

2002.12909

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.54)

Industry:

Information Technology > Security & Privacy (1.00)
Leisure & Entertainment > Games > Backgammon (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Tschantz, Alexander, Millidge, Beren, Seth, Anil K., Buckley, Christopher L.

Reinforcement Learning through Active Inference

arXiv.org Artificial IntelligenceFeb-28-2020

The central tenet of reinforcement learning (RL) is that agents seek to maximize the sum of cumulative rewards. In contrast, active inference, an emerging framework within cognitive and computational neuroscience, proposes that agents act to maximize the evidence for a biased generative model. Here, we illustrate how ideas from active inference can augment traditional RL approaches by (i) furnishing an inherent balance of exploration and exploitation, and (ii) providing a more flexible conceptualization of reward. Inspired by active inference, we develop and implement a novel objective for decision making, which we term the free energy of the expected future. We demonstrate that the resulting algorithm successfully balances exploration and exploitation, simultaneously achieving robust performance on several challenging RL benchmarks with sparse, well-shaped, and no rewards.

active inference, bridging ai and cognitive science, inference, (10 more...)

2002.12636

Country: Europe > United Kingdom > England > East Sussex > Brighton (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Wang, Chi-Hua, Cheng, Guang

Online Batch Decision-Making with High-Dimensional Covariates

We propose and investigate a class of new algorithms for sequential decision making that interacts with \textit{a batch of users} simultaneously instead of \textit{a user} at each decision epoch. This type of batch models is motivated by interactive marketing and clinical trial, where a group of people are treated simultaneously and the outcomes of the whole group are collected before the next stage of decision. In such a scenario, our goal is to allocate a batch of treatments to maximize treatment efficacy based on observed high-dimensional user covariates. We deliver a solution, named \textit{Teamwork LASSO Bandit algorithm}, that resolves a batch version of explore-exploit dilemma via switching between teamwork stage and selfish stage during the whole decision process. This is made possible based on statistical properties of LASSO estimate of treatment efficacy that adapts to a sequence of batch observations. In general, a rate of optimal allocation condition is proposed to delineate the exploration and exploitation trade-off on the data collection scheme, which is sufficient for LASSO to identify the optimal treatment for observed user covariates. An upper bound on expected cumulative regret of the proposed algorithm is provided.

health & medicine, inequality, upstream oil & gas, (20 more...)

2002.09438

Country: Europe > Italy (0.14)

Genre:

Research Report > Experimental Study (0.34)
Research Report > New Finding (0.34)

Industry:

Energy > Oil & Gas > Upstream (0.48)
Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.90)
Information Technology > Data Science > Data Mining > Big Data (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Liu, Anji, Liang, Yitao, Broeck, Guy Van den

Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration

Off-policy reinforcement learning (RL) is concerned with learning a rewarding policy by executing another policy that gathers samples of experience. While the former policy (i.e. target policy) is rewarding but in-expressive (in most cases, deterministic), doing well in the latter task, in contrast, requires an expressive policy (i.e. behavior policy) that offers guided and effective exploration. Contrary to most methods that make a trade-off between optimality and expressiveness, disentangled frameworks explicitly decouple the two objectives, which each is dealt with by a distinct separate policy. Although being able to freely design and optimize the two policies with respect to their own objectives, naively disentangling them can lead to inefficient learning or stability issues. To mitigate this problem, our proposed method Analogous Disentangled Actor-Critic (ADAC) designs analogous pairs of actors and critics. Specifically, ADAC leverages a key property about Stein variational gradient descent (SVGD) to constraint the expressive energy-based behavior policy with respect to the target one for effective exploration. Additionally, an analogous critic pair is introduced to incorporate intrinsic rewards in a principled manner, with theoretical guarantees on the overall learning stability and effectiveness. We empirically evaluate environment-reward-only ADAC on 14 continuous-control tasks and report the state-of-the-art on 10 of them. We further demonstrate ADAC, when paired with intrinsic rewards, outperform alternatives in exploration-challenging tasks.

adac, artificial intelligence, upstream oil & gas, (20 more...)

2002.10738

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Brazdil, Tomas, Chatterjee, Krishnendu, Novotny, Petr, Vahala, Jiri

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

arXiv.org Artificial IntelligenceFeb-27-2020

Markov decision processes (MDPs) are the defacto frame-work for sequential decision making in the presence ofstochastic uncertainty. A classical optimization criterion forMDPs is to maximize the expected discounted-sum pay-off, which ignores low probability catastrophic events withhighly negative impact on the system. On the other hand,risk-averse policies require the probability of undesirableevents to be below a given threshold, but they do not accountfor optimization of the expected payoff. We consider MDPswith discounted-sum payoff with failure states which repre-sent catastrophic outcomes. The objective of risk-constrainedplanning is to maximize the expected discounted-sum payoffamong risk-averse policies that ensure the probability to en-counter a failure state is below a desired threshold. Our maincontribution is an efficient risk-constrained planning algo-rithm that combines UCT-like search with a predictor learnedthrough interaction with the MDP (in the style of AlphaZero)and with a risk-constrained action selection via linear pro-gramming. We demonstrate the effectiveness of our approachwith experiments on classical MDPs from the literature, in-cluding benchmarks with an order of 10^6 states.

machine learning, payoff, reinforcement learning, (18 more...)

2002.12086

Country:

Europe > Austria (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Learning to Resolve Alliance Dilemmas in Many-Player Zero-Sum Games

Hughes, Edward, Anthony, Thomas W., Eccles, Tom, Leibo, Joel Z., Balduzzi, David, Bachrach, Yoram

Zero-sum games have long guided artificial intelligence research, since they possess both a rich strategy space of best-responses and a clear evaluation metric. What's more, competition is a vital mechanism in many real-world multi-agent systems capable of generating intelligent innovations: Darwinian evolution, the market economy and the AlphaZero algorithm, to name a few. In two-player zero-sum games, the challenge is usually viewed as finding Nash equilibrium strategies, safeguarding against exploitation regardless of the opponent. While this captures the intricacies of chess or Go, it avoids the notion of cooperation with co-players, a hallmark of the major transitions leading from unicellular organisms to human civilization. Beyond two players, alliance formation often confers an advantage; however this requires trust, namely the promise of mutual cooperation in the face of incentives to defect. Successful play therefore requires adaptation to co-players rather than the pursuit of non-exploitability. Here we argue that a systematic study of many-player zero-sum games is a crucial element of artificial intelligence research. Using symmetric zero-sum matrix games, we demonstrate formally that alliance formation may be seen as a social dilemma, and empirically that na\"ive multi-agent reinforcement learning therefore fails to form alliances. We introduce a toy model of economic competition, and show how reinforcement learning may be augmented with a peer-to-peer contract mechanism to discover and enforce alliances. Finally, we generalize our agent model to incorporate temporally-extended contracts, presenting opportunities for further work.

agent, contract, dilemma, (13 more...)

2003.00799

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.05)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Leisure & Entertainment > Games > Chess (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

van der Pol, Elise, Kipf, Thomas, Oliehoek, Frans A., Welling, Max

Plannable Approximations to MDP Homomorphisms: Equivariance under Actions

This work exploits action equivariance for representation learning in reinforcement learning. Equivariance under actions states that transitions in the input space are mirrored by equivalent transitions in latent space, while the map and transition functions should also commute. We introduce a contrastive loss function that enforces action equivariance on the learned representations. We prove that when our loss is zero, we have a homomorphism of a deterministic Markov Decision Process (MDP). Learning equivariant maps leads to structured latent spaces, allowing us to build a model on which we plan through value iteration. We show experimentally that for deterministic MDPs, the optimal policy in the abstract MDP can be successfully lifted to the original MDP. Moreover, the approach easily adapts to changes in the goal states. Empirically, we show that in such MDPs, we obtain better representations in fewer epochs compared to representation learning approaches using reconstructions, while generalizing better to new goals than model-free approaches.

international conference, learning, representation, (12 more...)

2002.11963

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
North America > United States > Massachusetts (0.04)
Europe > Netherlands > South Holland > Delft (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.49)