Towards An Understanding of What is Learned: Extracting Multi-Abstraction-Level Knowledge from Learning Agents

AAAI Conferences

Machine Learning approaches used in the context of agents (like Reinforcement Learning) commonly result in weighted state-action pair representations (where the weights determine which action should be performed, given a perceived state). The weighted state-action pairs are stored, e.g., in tabular form or as approximated functions which makes the learned knowledge hard to comprehend by humans, since the number of state-action pairs can be extremely high. In this paper, a knowledge extraction approach is presented which extracts compact and comprehensible knowledge bases from such weighted state-action pairs. For this purpose, so-called Hierarchical Knowledge Bases are described which allow for a top-down view on the learned knowledge at an adequate level of abstraction. The approach can be applied to gain structural insights into a problem and its solution and it can be easily transformed into common knowledge representation formalisms, like normal logic programs.

Reinforcement Learning for Mixed Open-loop and Closed-loop Control

Neural Information Processing Systems

Closed-loop control relies on sensory feedback that is usually assumed tobe free . But if sensing incurs a cost, it may be costeffective totake sequences of actions in open-loop mode. We describe a reinforcement learning algorithm that learns to combine open-loop and closed-loop control when sensing incurs a cost. Although weassume reliable sensors, use of open-loop control means that actions must sometimes be taken when the current state of the controlled system is uncertain. This is a special case of the hidden-state problem in reinforcement learning, and to cope, our algorithm relies on short-term memory.

Asynchronous n-steps Q-learning


Q-learning is the most famous Temporal Difference algorithm. Original Q-learning algorithm tries to determine the state-action value function that minimizes the error below. We will use an optimizer (the simplest one- Gradient Descent) to compute the values of the state-action function. First of all we need to compute the gradient of the loss function. Gradient descent finds the minimum of a function by subtracting the gradient, with respect to the parameters of the function, from the parameters.

PAC Reinforcement Learning With an Imperfect Model

AAAI Conferences

Reinforcement learning (RL) methods have proved to be successful in many simulated environments. The common approaches, however, are often too sample intensive to be applied directly in the real world. A promising approach to addressing this issue is to train an RL agent in a simulator and transfer the solution to the real environment. When a high-fidelity simulator is available we would expect significant reduction in the amount of real trajectories needed for learning. In this work we aim at better understanding the theoretical nature of this approach. We start with a perhaps surprising result that, even if the approximate model (e.g., a simulator) only differs from the real environment in a single state-action pair (but which one is unknown), such a model could be information-theoretically useless and the sample complexity (in terms of real trajectories) still scales with the total number of states in the worst case. We investigate the hard instances and come up with natural conditions that avoid the pathological situations. We then propose two conceptually simple algorithms that enjoy polynomial sample complexity guarantees with no dependence on the size of the state-action space, and prove some foundational results to provide insights into this important problem.

Online abstraction with MDP homomorphisms for Deep Learning Machine Learning

Abstraction of Markov Decision Processes is a useful tool for solving complex problems, as it can ignore unimportant aspects of an environment, simplifying the process of learning an optimal policy. In this paper, we propose a new algorithm for finding abstract MDPs in environments with continuous state spaces. It is based on MDP homomorphisms, a structure-preserving mapping between MDPs. We demonstrate our algorithm's ability to learns abstractions from collected experience and show how to reuse the abstractions to guide exploration in new tasks the agent encounters. Our novel task transfer method beats a baseline based on a deep Q-network.