AITopics

Policy evaluation algorithms are essential to reinforcement learning due to their ability to predict the performance of a policy. However, there are two long-standing issues lying in this prediction problem that need to be tackled: off-policy stability and on-policy efficiency. The conventional temporal difference (TD) algorithm is known to perform very well in the on-policy setting, yet is not off-policy stable. On the other hand, the gradient TD and emphatic TD algorithms are off-policy stable, but are not on-policy efficient. This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method. The empirical experimental results on various domains validate the effectiveness of the proposed approach.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2006.03978

Country:

North America > Canada > Alberta (0.14)
North America > United States > Alabama > Lee County > Auburn (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Regularized Off-Policy TD-Learning

Liu, Bo, Mahadevan, Sridhar, Liu, Ji

We present a novel $l_1$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2006.05314

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

Liu, Bo, Gemp, Ian, Ghavamzadeh, Mohammad, Liu, Ji, Mahadevan, Sridhar, Petrik, Marek

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal ``mirror maps'' to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2006.03976

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(6 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Energy > Energy Storage (1.00)
Electrical Industrial Apparatus (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Learning to Model Opponent Learning

Davies, Ian, Tian, Zheng, Wang, Jun

Multi-Agent Reinforcement Learning (MARL) considers settings in which a set of coexisting agents interact with one another and their environment. The adaptation and learning of other agents induces non-stationarity in the environment dynamics. This poses a great challenge for value function-based algorithms whose convergence usually relies on the assumption of a stationary environment. Policy search algorithms also struggle in multi-agent settings as the partial observability resulting from an opponent's actions not being known introduces high variance to policy training. Modelling an agent's opponent(s) is often pursued as a means of resolving the issues arising from the coexistence of learning opponents. An opponent model provides an agent with some ability to reason about other agents to aid its own decision making. Most prior works learn an opponent model by assuming the opponent is employing a stationary policy or switching between a set of stationary policies. Such an approach can reduce the variance of training signals for policy search algorithms. However, in the multi-agent setting, agents have an incentive to continually adapt and learn. This means that the assumptions concerning opponent stationarity are unrealistic. In this work, we develop a novel approach to modelling an opponent's learning dynamics which we term Learning to Model Opponent Learning (LeMOL). We show our structured opponent model is more accurate and stable than naive behaviour cloning baselines. We further show that opponent modelling can improve the performance of algorithmic agents in multi-agent settings.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2006.03923

Country: Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.80)

Kallus, Nathan, Uehara, Masatoshi

Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

Offline reinforcement learning, wherein one uses off-policy data logged by a fixed behavior policy to evaluate and learn new policies, is crucial in applications where experimentation is limited such as medicine. We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. Targeting deterministic policies, for which action is a deterministic function of state, is crucial since optimal policies are always deterministic (up to ties). In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. To circumvent this issue, we propose several new doubly robust estimators based on different kernelization approaches. We analyze the asymptotic mean-squared error of each of these under mild rate conditions for nuisance estimators. Specifically, we demonstrate how to obtain a rate that is independent of the horizon length.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2006.039

Country:

North America > United States > Wisconsin (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Grammatopoulou, Maria, Kanellopoulos, Aris, ~Vamvoudakis, Kyriakos G., Lau, Nathan

A Multi-step and Resilient Predictive Q-learning Algorithm for IoT with Human Operators in the Loop: A Case Study in Water Supply Networks

We consider the problem of recommending resilient and predictive actions for an IoT network in the presence of faulty components, considering the presence of human operators manipulating the information of the environment the agent sees for containment purposes. The IoT network is formulated as a directed graph with a known topology whose objective is to maintain a constant and resilient flow between a source and a destination node. The optimal route through this network is evaluated via a predictive and resilient Q-learning algorithm which takes into account historical data about irregular operation, due to faults, as well as the feedback from the human operators that are considered to have extra information about the status of the network concerning locations likely to be targeted by attacks. To showcase our method, we utilize anonymized data from Arlington County, Virginia, to compute predictive and resilient scheduling policies for a smart water supply system, while avoiding (i) all the locations indicated to be attacked according to human operators (ii) as many as possible neighborhoods detected to have leaks or other faults. This method incorporates both the adaptability of the human and the computation capability of the machine to achieve optimal implementation containment and recovery actions in water distribution.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2006.03899

Country:

North America > United States > Virginia > Arlington County (0.25)
North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
North America > United States > Virginia > Montgomery County > Blacksburg (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Water & Waste Management > Water Management > Water Supplies & Services (0.85)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

Zhang, Zihan, Zhou, Yuan, Ji, Xiangyang

We study the reinforcement learning problem in the setting of finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and episode length $H$. We propose a model-free algorithm UCB-Advantage and prove that it achieves $\tilde{O}(\sqrt{H^2SAT})$ regret where $T = KH$ and $K$ is the number of episodes to play. Our regret bound improves upon the results of [Jin et al., 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-Advantage achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al., 2019].

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2004.10019

Country:

North America > United States > Illinois (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

arXiv.org Artificial IntelligenceJun-5-2020

DeepSoCS: A Neural Scheduler for Heterogeneous System-on-Chip (SoC) Resource Scheduling

Sung, Tegg Taekyong, Ha, Jeongsoo, Kim, Jeewoo, Yahja, Alex, Sohn, Chae-Bong, Ryu, Bo

In this paper, we~present a novel scheduling solution for a class of System-on-Chip (SoC) systems where heterogeneous chip resources (DSP, FPGA, GPU, etc.) must be efficiently scheduled for continuously arriving hierarchical jobs with their tasks represented by a directed acyclic graph. Traditionally, heuristic algorithms have been widely used for many resource scheduling domains, and Heterogeneous Earliest Finish Time (HEFT) has been a dominating state-of-the-art technique across a broad range of heterogeneous resource scheduling domains over many years. Despite their long-standing popularity, HEFT-like algorithms are known to be vulnerable to a small amount of noise added to the environment. Our Deep Reinforcement Learning (DRL)-based SoC Scheduler (DeepSoCS), capable of learning the "best" task ordering under dynamic environment changes, overcomes the brittleness of rule-based schedulers such as HEFT with significantly higher performance across different types of jobs. We~describe a DeepSoCS design process using a real-time heterogeneous SoC scheduling emulator, discuss major challenges, and present two novel neural network design features that lead to outperforming HEFT: (i) hierarchical job- and task-graph embedding; and (ii) efficient use of real-time task information in the state space. Furthermore, we~introduce effective techniques to address two fundamental challenges present in our environment: delayed consequences and joint actions. Through an extensive simulation study, we~show that our DeepSoCS exhibits the significantly higher performance of job execution time than that of HEFT with a higher level of robustness under realistic noise conditions. We~conclude with a discussion of the potential improvements for our DeepSoCS neural scheduler.

deep learning, neural network, upstream oil & gas, (22 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/electronics9060936

2005.07666

Country:

North America > United States (1.00)
Asia (0.93)
Europe (0.68)

Genre: Research Report > Promising Solution (0.34)

Industry:

Semiconductors & Electronics (0.85)
Energy > Oil & Gas > Upstream (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJun-5-2020

Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning

Lee, Donghoon

Entropy augmented to reward is known to soften the greedy argmax policy to softmax policy. Entropy augmentation is reformulated and leads to a motivation to introduce an additional entropy term to the objective function in the form of KL-divergence to regularize optimization process. It results in a policy which monotonically improves while interpolating from the current policy to the softmax greedy policy. This policy is used to build a continuously parameterized algorithm which optimize policy and Q-function simultaneously and whose extreme limits correspond to policy gradient and Q-learning, respectively. Experiments show that there can be a performance gain using an intermediate algorithm. Both Q-learning[15] and policy gradient(PG)[13] update policy towards greedy one whether the policy is explicit or not.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2005.08844

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Zhang, Ziyao, Ma, Liang, Leung, Kin K., Poularakis, Konstantinos, Srivatsa, Mudhakar

State Action Separable Reinforcement Learning

arXiv.org Artificial IntelligenceJun-5-2020

Reinforcement Learning (RL) based methods have seen their paramount successes in solving serial decision-making and control problems in recent years. For conventional RL formulations, Markov Decision Process (MDP) and state-action-value function are the basis for the problem modeling and policy evaluation. However, several challenging issues still remain. Among most cited issues, the enormity of state/action space is an important factor that causes inefficiency in accurately approximating the state-action-value function. We observe that although actions directly define the agents' behaviors, for many problems the next state after a state transition matters more than the action taken, in determining the return of such a state transition. In this regard, we propose a new learning paradigm, State Action Separable Reinforcement Learning (sasRL), wherein the action space is decoupled from the value function learning process for higher efficiency. Then, a light-weight transition model is learned to assist the agent to determine the action that triggers the associated state transition. In addition, our convergence analysis reveals that under certain conditions, the convergence time of sasRL is $O(T^{1/k})$, where $T$ is the convergence time for updating the value function in the MDP-based formulation and $k$ is a weighting factor. Experiments on several gaming scenarios show that sasRL outperforms state-of-the-art MDP-based RL algorithms by up to $75\%$.

machine learning, reinforcement learning, state transition, (16 more...)

arXiv.org Artificial Intelligence

2006.03713

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
North America > United States > Connecticut > New Haven County > New Haven (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)