AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Online Apprenticeship Learning

Shani, Lior, Zahavy, Tom, Mannor, Shie

arXiv.org Machine LearningFeb-13-2021

In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. Instead, we observe trajectories sampled by an expert that acts according to some policy. The goal is to find a policy that matches the expert's performance on some predefined set of cost functions. We introduce an online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms: one for policy optimization and another for learning the worst case cost. To this end, we derive a convergent algorithm with $O(\sqrt{K})$ regret, where $K$ is the number of interactions with the MDP, and an additional linear error term that depends on the amount of expert trajectories available. Importantly, our algorithm avoids the need to solve an MDP at each iteration, making it more practical compared to prior AL methods. Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL \cite{ho2016generative}, but where the discriminator is replaced with the costs learned by the OAL problem. Our simulations demonstrate our theoretically grounded approach outperforms the baselines.

algorithm, iteration, optimization, (16 more...)

arXiv.org Machine Learning

2102.06924

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Education > Educational Setting > Online (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

Interactive Learning from Activity Description

Nguyen, Khanh, Misra, Dipendra, Schapire, Robert, Dudík, Miro, Shafto, Patrick

arXiv.org Artificial IntelligenceFeb-13-2021

We present a novel interactive learning protocol that enables training request-fulfilling agents by verbally describing their activities. Our protocol gives rise to a new family of interactive learning algorithms that offer complementary advantages against traditional algorithms like imitation learning (IL) and reinforcement learning (RL). We develop an algorithm that practically implements this protocol and employ it to train agents in two challenging request-fulfilling problems using purely language-description feedback. Empirical results demonstrate the strengths of our algorithm: compared to RL baselines, it is more sample-efficient; compared to IL baselines, it achieves competitive success rates while not requiring feedback providers to have agent-specific expertise. We also provide theoretical guarantees of the algorithm under certain assumptions on the teacher and the environment.

agent, execution, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2102.07024

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
Europe > Denmark > Capital Region > Copenhagen (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(14 more...)

Genre: Research Report > New Finding (0.87)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

Modelling Cooperation in Network Games with Spatio-Temporal Complexity

Bakker, Michiel A., Everett, Richard, Weidinger, Laura, Gabriel, Iason, Isaac, William S., Leibo, Joel Z., Hughes, Edward

arXiv.org Artificial IntelligenceFeb-13-2021

The real world is awash with multi-agent problems that require collective action by self-interested agents, from the routing of packets across a computer network to the management of irrigation systems. Such systems have local incentives for individuals, whose behavior has an impact on the global outcome for the group. Given appropriate mechanisms describing agent interaction, groups may achieve socially beneficial outcomes, even in the face of short-term selfish incentives. In many cases, collective action problems possess an underlying graph structure, whose topology crucially determines the relationship between local decisions and emergent global effects. Such scenarios have received great attention through the lens of network games. However, this abstraction typically collapses important dimensions, such as geometry and time, relevant to the design of mechanisms promoting cooperation. In parallel work, multi-agent deep reinforcement learning has shown great promise in modelling the emergence of self-organized cooperation in complex gridworld domains. Here we apply this paradigm in graph-structured collective action problems. Using multi-agent deep reinforcement learning, we simulate an agent society for a variety of plausible mechanisms, finding clear transitions between different equilibria over time. We define analytic tools inspired by related literatures to measure the social outcomes, and use these to draw conclusions about the efficacy of different environmental interventions. Our methods have implications for mechanism design in both human and artificial agent systems.

agent, processing center, supply chain, (16 more...)

arXiv.org Artificial Intelligence

2102.06911

Country:

North America > United States > Virginia (0.04)
North America > United States > Michigan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.34)

Add feedback

Improved Corruption Robust Algorithms for Episodic Reinforcement Learning

Chen, Yifang, Du, Simon S., Jamieson, Kevin

arXiv.org Artificial IntelligenceFeb-13-2021

The majority of the literature in learning in MDPs studies stationary environments, where the underlying unknown We study episodic reinforcement learning under transition function and reward function are fixed. The rewards unknown adversarial corruptions in both and the next states are independently and identically the rewards and the transition probabilities of distributed given the current state and the learner's chosen the underlying system. We propose new algorithms action. Under this setting, the goal is to minimize which, compared to the existing results the regret, which is the difference between the learner's in (Lykouris et al., 2020), achieve strictly better cumulative rewards and the total rewards of the optimal regret bounds in terms of total corruptions for policy (Brafman & Tennenholtz, 2002; Azar et al., 2017; the tabular setting. To be specific, firstly, our Jin et al., 2018; Ok et al., 2018; Zanette & Brunskill, 2019; regret bounds depend on more precise numerical Simchowitz & Jamieson, 2019; Zhang et al., 2020). However, values of total rewards corruptions and transition these techniques are vulnerable to corruptions on the corruptions, instead of only on the total rewards or the transitions.

improved corruption robust algorithm, probability, transition function, (8 more...)

arXiv.org Artificial Intelligence

2102.06875

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.86)

Add feedback

Disturbing Reinforcement Learning Agents with Corrupted Rewards

Majadas, Rubén, García, Javier, Fernández, Fernando

arXiv.org Artificial IntelligenceFeb-12-2021

Reinforcement Learning (RL) algorithms have led to recent successes in solving complex games, such as Atari or Starcraft, and to a huge impact in real-world applications, such as cybersecurity or autonomous driving. In the side of the drawbacks, recent works have shown how the performance of RL algorithms decreases under the influence of soft changes in the reward function. However, little work has been done about how sensitive these disturbances are depending on the aggressiveness of the attack and the learning exploration strategy. In this paper, we propose to fill this gap in the literature analyzing the effects of different attack strategies based on reward perturbations, and studying the effect in the learner depending on its exploration strategy. In order to explain all the behaviors, we choose a sub-class of MDPs: episodic, stochastic goal-only-rewards MDPs, and in particular, an intelligible grid domain as a benchmark. In this domain, we demonstrate that smoothly crafting adversarial rewards are able to mislead the learner, and that using low exploration probability values, the policy learned is more robust to corrupt rewards. Finally, in the proposed learning scenario, a counterintuitive result arises: attacking at each learning episode is the lowest cost attack strategy.

computer game, probability, upstream oil & gas, (19 more...)

arXiv.org Artificial Intelligence

2102.06587

Country:

Europe > Spain (0.14)
Oceania > Australia (0.14)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.90)
Energy > Oil & Gas > Upstream (0.71)
Leisure & Entertainment > Games > Computer Games (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning

Li, Gen, Cai, Changxiao, Chen, Yuxin, Gu, Yuantao, Wei, Yuting, Chi, Yuejie

arXiv.org Machine LearningFeb-12-2021

Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made recently towards understanding the sample efficiency of Q-learning. To yield an entrywise $\varepsilon$-accurate estimate of the optimal Q-function, state-of-the-art theory requires at least an order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{2}}$ samples for a $\gamma$-discounted infinite-horizon MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$. In this work, we sharpen the sample complexity of synchronous Q-learning to an order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^4\varepsilon^2}$ (up to some logarithmic factor) for any $0<\varepsilon <1$, leading to an order-wise improvement in terms of the effective horizon $\frac{1}{1-\gamma}$. Analogous results are derived for finite-horizon MDPs as well. Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage. A key ingredient of our analysis lies in the establishment of novel error decompositions and recursions, which might shed light on how to analyze finite-sample performance of other Q-learning variants.

log 4, q-learning, sample complexity, (13 more...)

arXiv.org Machine Learning

2102.06548

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.54)

Add feedback

LTL2Action: Generalizing LTL Instructions for Multi-Task RL

Vaezipoor, Pashootan, Li, Andrew, Icarte, Rodrigo Toro, McIlraith, Sheila

arXiv.org Artificial IntelligenceFeb-12-2021

We address the problem of teaching a deep reinforcement learning (RL) agent to follow instructions in multi-task environments. We employ a well-known formal language -- linear temporal logic (LTL) -- to specify instructions, using a domain-specific vocabulary. We propose a novel approach to learning that exploits the compositional syntax and the semantics of LTL, enabling our RL agent to learn task-conditioned policies that generalize to new instructions, not observed during training. The expressive power of LTL supports the specification of a diversity of complex temporally extended behaviours that include conditionals and alternative realizations. Experiments on discrete and continuous domains demonstrate the strength of our approach in learning to solve (unseen) tasks, given LTL instructions.

agent, instruction, proceedings, (12 more...)

arXiv.org Artificial Intelligence

2102.06858

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > Republic of Türkiye > Aksaray Province > Aksaray (0.04)

Genre: Research Report (1.00)

Industry: Education (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Equilibrium Inverse Reinforcement Learning for Ride-hailing Vehicle Network

Oda, Takuma

arXiv.org Artificial IntelligenceFeb-12-2021

Ubiquitous mobile computing have enabled ride-hailing services to collect vast amounts of behavioral data of riders and drivers and optimize supply and demand matching in real time. While these mobility service providers have some degree of control over the market by assigning vehicles to requests, they need to deal with the uncertainty arising from self-interested driver behavior since workers are usually free to drive when they are not assigned tasks. In this work, we formulate the problem of passenger-vehicle matching in a sparsely connected graph and proposed an algorithm to derive an equilibrium policy in a multi-agent environment. Our framework combines value iteration methods to estimate the optimal policy given expected state visitation and policy propagation to compute multi-agent state visitation frequencies. Furthermore, we developed a method to learn the driver's reward function transferable to an environment with significantly different dynamics from training data. We evaluated the robustness to changes in spatio-temporal supply-demand distributions and deterioration in data quality using a real-world taxi trajectory dataset; our approach significantly outperforms several baselines in terms of imitation accuracy. The computational time required to obtain an equilibrium policy shared by all vehicles does not depend on the number of agents, and even on the scale of real-world services, it takes only a few seconds on a single CPU.

agent, reward function, visitation count, (12 more...)

arXiv.org Artificial Intelligence

2102.06854

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.64)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)

Add feedback

Robust and Efficient Planning using Adaptive Entropy Tree Search

Kozakowski, Piotr, Pacek, Mikołaj, Miłoś, Piotr

arXiv.org Artificial IntelligenceFeb-12-2021

In this paper, we present the Adaptive EntropyTree Search (ANTS) algorithm. ANTS builds on recent successes of maximum entropy planning while mitigating its arguably major drawback - sensitivity to the temperature setting. We endow ANTS with a mechanism, which adapts the temperature to match a given range of action selection entropy in the nodes of the planning tree. With this mechanism, the ANTS planner enjoys remarkable hyper-parameter robustness, achieves high scores on the Atari benchmark, and is a capable component of a planning-learning loop akin to AlphaZero. We believe that all these features make ANTS a compelling choice for a general planner for complex tasks.

algorithm, node, robust and efficient planning, (13 more...)

arXiv.org Artificial Intelligence

2102.06808

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.85)

Add feedback

Reinforcement Learning For Data Poisoning on Graph Neural Networks

Dineen, Jacob, Haque, A S M Ahsan-Ul, Bielskas, Matthew

arXiv.org Artificial IntelligenceFeb-12-2021

Adversarial Machine Learning has emerged as a substantial subfield of Computer Science due to a lack of robustness in the models we train along with crowdsourcing practices that enable attackers to tamper with data. In the last two years, interest has surged in adversarial attacks on graphs yet the Graph Classification setting remains nearly untouched. Since a Graph Classification dataset consists of discrete graphs with class labels, related work has forgone direct gradient optimization in favor of an indirect Reinforcement Learning approach. We will study the novel problem of Data Poisoning (training time) attack on Neural Networks for Graph Classification using Reinforcement Learning Agents.

algorithm, graph, learning, (14 more...)

arXiv.org Artificial Intelligence

2102.068

Country:

North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.51)

Industry: Information Technology > Security & Privacy (0.36)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback