Goto

Collaborating Authors

 Reinforcement Learning


Efficiently Solving MDPs with Stochastic Mirror Descent

arXiv.org Machine Learning

Markov decision processes (MDPs) are a fundamental mathematical abstraction for sequential decision making under uncertainty and they serve as a basic modeling tool in reinforcement learning (RL) and stochastic control [5, 24, 30]. Two prominent classes of MDPs are average-reward MDPs (AMDPs) and discounted MDPs (DMDPs). Each have been studied extensively; AMDPs are applicable to optimal control, learning automata, and various real-world reinforcement learning settings [17, 3, 22] and DMDPs have a number of nice theoretical properties including reward convergence and operator monotonicity [6]. In this paper we consider the prevalent computational learning problem of finding an approximately optimal policy of an MDP given only restricted access to the model. In particular, we consider the problem of computing an ɛ-optimal policy, i.e. a policy with an additive ɛ error in expected cumulative reward over infinite horizon, under the standard assumption of a generative model [14, 13], which allows one to sample from state-transitions given the current state-action pair. This problem is well-studied and there are multiple known upper and lower bounds on its sample complexity [4, 32, 28, 31]. In this work, we provide a unified framework based on primal-dual stochastic mirror descent (SMD) for learning an ɛ-optimal policies for both AMDPs and DMDPs with a generative model.


Reinforcement Learning with Quantum Variational Circuits

arXiv.org Machine Learning

The development of quantum computational techniques has advanced greatly in recent years, parallel to the advancements in techniques for deep reinforcement learning. This work explores the potential for quantum computing to facilitate reinforcement learning problems. Quantum computing approaches offer important potential improvements in time and space complexity over traditional algorithms because of its ability to exploit the quantum phenomena of superposition and entanglement. Specifically, we investigate the use of quantum variational circuits, a form of quantum machine learning. We present our techniques for encoding classical data for a quantum variational circuit, we further explore pure and hybrid quantum algorithms for DQN and Double DQN. Our results indicate both hybrid and pure quantum variational circuit have the ability to solve reinforcement learning tasks with a smaller parameter space. These comparison are conducted with two OpenAI Gym environments: CartPole and Blackjack, The success of this work is indicative of a strong future relationship between quantum machine learning and deep reinforcement learning.


Delay-Aware Multi-Agent Reinforcement Learning for Cooperative and Competitive Environments

arXiv.org Machine Learning

Action and observation delays exist prevalently in the real-world cyber-physical systems which may pose challenges in reinforcement learning design. It is particularly an arduous task when handling multi-agent systems where the delay of one agent could spread to other agents. To resolve this problem, this paper proposes a novel framework to deal with delays as well as the non-stationary training issue of multi-agent tasks with model-free deep reinforcement learning. We formally define the Delay-Aware Markov Game that incorporates the delays of all agents in the environment. To solve Delay-Aware Markov Games, we apply centralized training and decentralized execution that allows agents to use extra information to ease the non-stationarity issue of the multi-agent systems during training, without the need of a centralized controller during execution. Experiments are conducted in multi-agent particle environments including cooperative communication, cooperative navigation, and competitive experiments. We also test the proposed algorithm in traffic scenarios that require coordination of all autonomous vehicles to show the practical value of delay-awareness. Results show that the proposed delay-aware multi-agent reinforcement learning algorithm greatly alleviates the performance degradation introduced by delay. Codes and demo videos are available at: https://github.com/baimingc/delay-aware-MARL.


Sample Efficiency in Sparse Reinforcement Learning: Or Your Money Back

arXiv.org Artificial Intelligence

Sparse rewards present a difficult problem in reinforcement learning and may be inevitable in certain domains with complex dynamics such as real-world robotics. Hindsight Experience Replay (HER) is a recent replay memory development that allows agents to learn in sparse settings by altering memories to show them as successful even though they may not be. While, empirically, HER has shown some success, it does not provide guarantees around the makeup of samples drawn from an agent's replay memory. This may result in minibatches that contain only memories with zero-valued rewards or agents learning an undesirable policy that completes HER-adjusted goals instead of the actual goal. In this paper, we introduce Or Your Money Back (OYMB), a replay memory sampler designed to work with HER. OYMB improves training efficiency in sparse settings by providing a direct interface to the agent's replay memory that allows for control over minibatch makeup, as well as a preferential lookup scheme that prioritizes real-goal memories before HER-adjusted memories. We test our approach on five tasks across three unique environments. Our results show that using HER in combination with OYMB outperforms using HER alone and leads to agents that learn to complete the real goal more quickly.


Neural mechanisms resolving exploitation-exploration dilemmas in the medial prefrontal cortex

Science

Successful behavior in an uncertain, changing, and open-ended environment critically relies on the ability to decide between continuing with the ongoing strategy or exploring new options. Neuroimaging studies have shown that the human medial prefrontal cortex (mPFC) is the part of the brain that primarily deals with this dilemma. However, the contribution of the different mPFC regions remains largely unknown. Domenech et al. recorded neuronal activity in six epileptic patients with depth electrodes in this brain area (see the Perspective by Steixner-Kumar and Gläscher). The ventral mPFC inferred the reliability of the ongoing action plan according to action outcomes. It proactively flagged outcomes either as learning signals to better exploit this plan or as potential triggers to explore new ones. The dorsal mPFC then evaluated action outcomes and generated an adaptive behavioral strategy. Science , this issue p. [eabb0184][1]; see also p. [1056][2] ### INTRODUCTION Everyday life often requires arbitrating between pursuing an ongoing action plan by possibly adjusting it versus exploring new action plans instead. Resolving this so-called exploitation-exploration dilemma is critical to gradually build a repertoire of action plans for efficient adaptive behavior in uncertain, changing, and open-ended everyday environments. Previous studies have shown that its resolution primarily involves the medial prefrontal cortex (mPFC). Human functional magnetic resonance imaging shows that activations in the ventromedial PFC (vmPFC) reflect the subjective value of the ongoing plan according to action outcomes, whereas the dorsomedial PFC (dmPFC) exhibits activations when this value drops and the plan is abandoned for exploring new ones. However, the neural mechanisms that resolve the dilemma and make the decision to exploit versus explore remain largely unknown. ### RATIONALE We addressed this issue by recording neuronal activity in participants using intracranial electroencephalography while they were performing a task that induced systematic exploitation-exploration dilemmas in an uncertain, changing, and open-ended environment. Participants were six epileptic patients with electrodes implanted in the vmPFC and dmPFC (see the figure), who were eventually diagnosed with temporal or parietal lobe epilepsy with no impacts in the PFC. Using computational modeling, we identified from participants’ behavior the so-called stay trials, when participants adjusted and exploited their ongoing action plan through reinforcement learning, and the switch trials, when action outcomes instead led participants to covertly switch away from this plan and explore new ones in the following trials. We then analyzed vmPFC and dmPFC neural activity in both stay and switch trials. ### RESULTS vmPFC neural activity in the high-gamma frequency band (>50 Hz) that reflects local processing was found to encode outcome expectations after action selection. This vmPFC high-gamma activity further encoded the prior and posterior reliability of the ongoing action plan relative to action outcomes, which, according to the computational model, subserved the arbitration between exploiting and exploring. Notably, this reliability encoding yielded vmPFC activity to proactively flag forthcoming action outcomes as potential triggers to explore rather than as learning signals to exploit. Preceding the occurrence of action outcomes, switch trials—unlike stay trials—witnessed an increased neural activity in the beta frequency band (13 to 30 Hz) that reflects top-down neural processing (see the figure). Following action outcomes in switch compared with stay trials, dmPFC neural activity then decreased in the theta frequency band (4 to 8 Hz), which indicates that the dmPFC was then configured to respond to action outcomes according to this vmPFC proactive construct. In stay trials, outcome expectations encoded in the vmPFC were transmitted to the dmPFC, so that from 300 ms after action outcomes, dmPFC neural activity in the high-gamma frequency band encoded the reward prediction error (i.e., the discrepancy between expected and actual outcomes that scales reinforcement learning). In switch trials, by contrast, this encoding was disrupted through reconfiguring dmPFC activity in the alpha frequency band (8 to 12 Hz) to release the inhibition bearing upon alternative action plans from 250 ms after action outcomes. ### CONCLUSION The medial PFC resolves exploitation-exploration dilemmas through a predictive coding mechanism that was originally proposed for perception. The vmPFC monitors the reliability of the ongoing action plan to proactively set the functional signification of forthcoming action outcomes as either learning signals to exploit or potential triggers to explore. The dmPFC responds to action outcomes according to this functional construct, yielding to either stay and adjust the ongoing plan through reinforcement learning or switch away from this plan to explore new ones. This predictive coding mechanism has the advantage of speeding up the abandonment of ongoing action plans and preventing action outcomes that trigger exploration from inappropriately acting as learning signals. These findings support the idea that predictive coding also operates within the prefrontal executive system and constitutes a general mechanism that underlies information processing across the cerebral cortex. In perceptual neural systems, predictive coding operates so that observers’ prior beliefs about a scene alter how they perceive the scene. Our findings suggest that within the prefrontal executive system, predictive coding operates by proactively altering the functional signification of behavioral events according to the agents’ beliefs about their own behavior. ![Figure][3] Action outcomes triggering exploration. Neural activity around outcome onsets in switch compared with stay trials recorded in ventromedial (orange, vmPFC) and dorsomedial (blue, dmPFC) prefrontal electrodes implanted in the six patients. Electrode localizations are shown on a canonical sagittal brain slice [Montreal Neurological Institute (MNI) coordinate: x = −10], and neural activity is shown against time according to its spectral decomposition. vmPFC activity reflecting top-down neural processing increased and proactively flagged action outcomes as potential triggers to explore rather than as learning signals to exploit. dmPFC activity followed action outcomes triggering exploration through reconfiguring neural processing. Stim, stimulus. Everyday life often requires arbitrating between pursuing an ongoing action plan by possibly adjusting it versus exploring a new action plan instead. Resolving this so-called exploitation-exploration dilemma involves the medial prefrontal cortex (mPFC). Using human intracranial electrophysiological recordings, we discovered that neural activity in the ventral mPFC infers and tracks the reliability of the ongoing plan to proactively encode upcoming action outcomes as either learning signals or potential triggers to explore new plans. By contrast, the dorsal mPFC exhibits neural responses to action outcomes, which results in either improving or abandoning the ongoing plan. Thus, the mPFC resolves the exploitation-exploration dilemma through a two-stage, predictive coding process: a proactive ventromedial stage that constructs the functional signification of upcoming action outcomes and a reactive dorsomedial stage that guides behavior in response to action outcomes. [1]: /lookup/doi/10.1126/science.abb0184 [2]: /lookup/doi/10.1126/science.abd7258 [3]: pending:yes


Document-editing Assistants and Model-based Reinforcement Learning as a Path to Conversational AI

arXiv.org Artificial Intelligence

Intelligent assistants that follow commands or answer simple questions, such as Siri and Google search, are among the most economically important applications of AI. Future conversational AI assistants promise even greater capabilities and a better user experience through a deeper understanding of the domain, the user, or the user's purposes. But what domain and what methods are best suited to researching and realizing this promise? In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning. The primary advantages of voice document editing are that the domain is tightly scoped and that it provides something for the conversation to be about (the document) that is delimited and fully accessible to the intelligent assistant. The advantages of reinforcement learning in general are that its methods are designed to learn from interaction without explicit instruction and that it formalizes the purposes of the assistant. Model-based reinforcement learning is needed in order to genuinely understand the domain of discourse and thereby work efficiently with the user to achieve their goals. Together, voice document editing and model-based reinforcement learning comprise a promising research direction for achieving conversational AI.


Is Deep Reinforcement Learning Ready for Practical Applications in Healthcare? A Sensitivity Analysis of Duel-DDQN for Hemodynamic Management in Sepsis Patients

arXiv.org Machine Learning

The potential of Reinforcement Learning (RL) has been demonstrated through successful applications to games such as Go and Atari. However, while it is straightforward to evaluate the performance of an RL algorithm in a game setting by simply using it to play the game, evaluation is a major challenge in clinical settings where it could be unsafe to follow RL policies in practice. Thus, understanding sensitivity of RL policies to the host of decisions made during implementation is an important step toward building the type of trust in RL required for eventual clinical uptake. In this work, we perform a sensitivity analysis on a state-of-the-art RL algorithm (Dueling Double Deep Q-Networks)applied to hemodynamic stabilization treatment strategies for septic patients in the ICU. We consider sensitivity of learned policies to input features, embedding model architecture, time discretization, reward function, and random seeds. We find that varying these settings can significantly impact learned policies, which suggests a need for caution when interpreting RL agent output.


learn2learn: A Library for Meta-Learning Research

arXiv.org Machine Learning

Meta-learning researchers face two fundamental issues in their empirical work: prototyping and reproducibility. Researchers are prone to make mistakes when prototyping new algorithms and tasks because modern meta-learning methods rely on unconventional functionalities of machine learning frameworks. In turn, reproducing existing results becomes a tedious endeavour -- a situation exacerbated by the lack of standardized implementations and benchmarks. As a result, researchers spend inordinate amounts of time on implementing software rather than understanding and developing new ideas. This manuscript introduces learn2learn, a library for meta-learning research focused on solving those prototyping and reproducibility issues. learn2learn provides low-level routines common across a wide-range of meta-learning techniques (e.g. meta-descent, meta-reinforcement learning, few-shot learning), and builds standardized interfaces to algorithms and benchmarks on top of them. In releasing learn2learn under a free and open source license, we hope to foster a community around standardized software for meta-learning research.


The Advantage Regret-Matching Actor-Critic

arXiv.org Artificial Intelligence

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.


Reinforcement Learning Market detailed strategies, Competitive landscaping and developments …

#artificialintelligence

Reinforcement learning is a part of machine learning which helps the software agents to take actions in environments to maximize the notion of …