Reinforcement Learning
Is the Policy Gradient a Gradient?
Nota, Chris, Thomas, Philip S.
The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods do not use the discount factor in the manner originally prescribed, and therefore do not optimize the discounted objective. It has been an open question in RL as to which, if any, objective they optimize instead. We show that the direction followed by these methods is not the gradient of any objective, and reclassify them as semi-gradient methods with respect to the undiscounted objective. Further, we show that they are not guaranteed to converge to a locally optimal policy, and construct an counterexample where they will converge to the globally pessimal policy with respect to both the discounted and undiscounted objectives.
Learning-Driven Exploration for Reinforcement Learning
Usama, Muhammad, Chang, Dong Eui
Deep reinforcement learning algorithms have been shown to learn complex skills using only high-dimensional observations and scalar reward. Effective and intelligent exploration still remains an unresolved problem for reinforcement learning. Most contemporary reinforcement learning relies on simple heuristic strategies such as $\epsilon$-greedy exploration or adding Gaussian noise to actions. These heuristics, however, are unable to intelligently distinguish the well explored and the unexplored regions of the state space, which can lead to inefficient use of training time. We introduce entropy-based exploration (EBE) that enables an agent to explore efficiently the unexplored regions of the state space. EBE quantifies the agent's learning in a state using merely state dependent action values and adaptively explores the state space, i.e. more exploration for the unexplored region of the state space. We perform experiments on many environments including a simple linear environment, a simpler version of the breakout game and multiple first-person shooter (FPS) games of VizDoom platform. We demonstrate that EBE enables efficient exploration that ultimately results in faster learning without having to tune hyperparameters.
Hierarchical Soft Actor-Critic: Adversarial Exploration via Mutual Information Optimization
We describe a novel extension of soft actor-critics for hierarchical Deep Q-Networks (HDQN) architectures using mutual information metric. The proposed extension provides a suitable framework for encouraging explorations in such hierarchical networks. A natural utilization of this framework is an adversarial setting, where meta-controller and controller play minimax over the mutual information objective but cooperate on maximizing expected rewards.
Mo\"ET: Interpretable and Verifiable Reinforcement Learning via Mixture of Expert Trees
Vasic, Marko, Petrovic, Andrija, Wang, Kaiyuan, Nikolic, Mladen, Singh, Rishabh, Khurshid, Sarfraz
Deep Reinforcement Learning (DRL) has led to many recent breakthroughs on complex control tasks, such as defeating the best human player in the game of Go. However, decisions made by the DRL agent are not explainable, hindering its applicability in safety-critical settings. Viper, a recently proposed technique, constructs a decision tree policy by mimicking the DRL agent. Decision trees are interpretable as each action made can be traced back to the decision rule path that lead to it. However, one global decision tree approximating the DRL policy has significant limitations with respect to the geometry of decision boundaries. We propose Mo\"ET, a more expressive, yet still interpretable model based on Mixture of Experts, consisting of a gating function that partitions the state space, and multiple decision tree experts that specialize on different partitions. We propose a training procedure to support non-differentiable decision tree experts and integrate it into imitation learning procedure of Viper. We evaluate our algorithm on four OpenAI gym environments, and show that the policy constructed in such a way is more performant and better mimics the DRL agent by lowering mispredictions and increasing the reward. We also show that Mo\"ET policies are amenable for verification using off-the-shelf automated theorem provers such as Z3.
On Value Functions and the Agent-Environment Boundary
When function approximation is deployed in reinforcement learning (RL), the same problem may be formulated in different ways, often by treating a pre-processing step as a part of the environment or as part of the agent. As a consequence, fundamental concepts in RL, such as (optimal) value functions, are not uniquely defined as they depend on where we draw this agent-environment boundary, causing problems in theoretical analyses that provide optimality guarantees. We address this issue via a simple and novel boundary-invariant analysis of Fitted Q-Iteration, a representative RL algorithm, where the assumptions and the guarantees are invariant to the choice of boundary. We also discuss closely related issues on state resetting and Monte-Carlo Tree Search, deterministic vs stochastic systems, imitation learning, and the verifiability of theoretical assumptions from data.
Solution of Two-Player Zero-Sum Game by Successive Relaxation
Diddigi, Raghuram Bharadwaj, Kamanchi, Chandramouli, Bhatnagar, Shalabh
We consider the problem of two-player zero-sum game. In this setting, there are two agents working against each other. Both the agents observe the same state and the objective of the agents is to compute a strategy profile that maximizes their rewards. However, the reward of the second agent is negative of reward obtained by the first agent. Therefore, the objective of the second agent is to minimize the total reward obtained by the first agent. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the max-min reward (of first player), starting from a given state is called the equilibrium value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation. Successive relaxation has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the two-player zero-sum games. We prove that, under a special structure, this technique computes the optimal solution faster than the techniques in the literature. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm.
A gray-box approach for curriculum learning
Foglino, Francesco, Leonetti, Matteo, Sagratella, Simone, Seccia, Ruggiero
Curriculum learning is often employed in deep reinforcement learning to let the agent progress more quickly towards better behaviors. Numerical methods for curriculum learning in the literature provides only initial heuristic solutions, with little to no guarantee on their quality. We define a new gray-box function that, including a suitable scheduling problem, can be effectively used to reformulate the curriculum learning problem. We propose different efficient numerical methods to address this gray-box reformulation. Preliminary numerical results on a benchmark task in the curriculum learning literature show the viability of the proposed approach.
Learning When-to-Treat Policies
Nie, Xinkun, Brunskill, Emma, Wager, Stefan
Any solution to the "policy learning" problem needs to deal with numerous difficulties, including how to incorporate robustness to potential selection bias as well as fairness constraints articulated by stakeholders, and there have been several notable advances that address these difficulties over the past few years. One limitation of this line of work, however, is that the results cited above all focus on a static setting where a decision-maker only sees each subject once and immediately decides how to treat the subject. In contrast, many problems of applied interest involve a dynamic component whereby the decision-maker makes a series of decisions based on time-varying covariates. In medicine, if a patient has a disease for which all known cures are invasive and have serious side effects, their doctor may choose to monitor disease progression for some time before prescribing one of these invasive treatments. Meanwhile, a health inspector needs to not only choose which restaurants to inspect, but also when to carry out these inspections.
Reinforcement Learning Driven Heuristic Optimization
Cai, Qingpeng, Hang, Will, Mirhoseini, Azalia, Tucker, George, Wang, Jingtao, Wei, Wei
Heuristic algorithms such as simulated annealing, Concorde, and METIS are effective and widely used approaches to find solutions to combinatorial optimization problems. However, they are limited by the high sample complexity required to reach a reasonable solution from a cold-start. In this paper, we introduce a novel framework to generate better initial solutions for heuristic algorithms using reinforcement learning (RL), named RLHO. We augment the ability of heuristic algorithms to greedily improve upon an existing initial solution generated by RL, and demonstrate novel results where RL is able to leverage the performance of heuristics as a learning signal to generate better initialization. We apply this framework to Proximal Policy Optimization (PPO) and Simulated Annealing (SA). We conduct a series of experiments on the well-known NP-complete bin packing problem, and show that the RLHO method outperforms our baselines. We show that on the bin packing problem, RL can learn to help heuristics perform even better, allowing us to combine the best parts of both approaches.
Injecting Prior Knowledge for Transfer Learning into Reinforcement Learning Algorithms using Logic Tensor Networks
Badreddine, Samy, Spranger, Michael
Human ability at solving complex tasks is helped by priors on object and event semantics of their environment. This paper investigates the use of similar prior knowledge for transfer learning in Reinforcement Learning agents. In particular, the paper proposes to use a first-order-logic language grounded in deep neural networks to represent facts about objects and their semantics in the real world. Facts are provided as background knowledge a priori to learning a policy for how to act in the world. The priors are injected with the conventional input in a single agent architecture. As proof-of-concept, the paper tests the system in simple experiments that show the importance of symbolic abstraction and flexible fact derivation. The paper shows that the proposed system can learn to take advantage of both the symbolic layer and the image layer in a single decision selection module.