relative entropy policy search
Online learning in episodic Markovian decision processes by relative entropy policy search
We study the problem of online learning in finite episodic Markov decision processes where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space $\A$ and the state space $\X$ has a layered structure with $L$ layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after $T$ episodes is $2\sqrt{L\nX\nA T\log(\nX\nA/L)}$ in the bandit setting and $2L\sqrt{T\log(\nX\nA/L)}$ in the full information setting. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.
Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search
We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.
Online learning in episodic Markovian decision processes by relative entropy policy search
Zimin, Alexander, Neu, Gergely
We study the problem of online learning in finite episodic Markov decision processes where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space $\A$ and the state space $\X$ has a layered structure with $L$ layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after $T$ episodes is $2\sqrt{L X A T\log( X A/L)}$ in the bandit setting and $2L\sqrt{T\log( X A/L)}$ in the full information setting. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.
Learning walk and trot from the same objective using different types of exploration
Liu, Zinan, Ploeger, Kai, Stark, Svenja, Rueckert, Elmar, Peters, Jan
In nature, animals have developed extensive gaits to adapt to the different terrestrial terrain and situations, such as a horse galloping for faster speed, or a lizard trotting for a stable locomotion. In recent years, quadrupedal gait learning has attracted some research interest in robotics. Quadruped gaits offer a wide range of different movement patterns. As the cyclic movements of all four legs are similar, the gaits can be categorized mainly by the timing and order of the footfall, which can be represented as phase gaps among the trajectories of each leg. In the presented work we learn open loop control policies for various gaits, focusing on walk and trot. In walk the leg trajectories are separated by quarter-phase gaps, resulting in an equidistant footfall, whereas in trot diagonal pairs of legs move synchronously and are separated by half-phase gaps. Other gaits that can be learned using the described approach are bound and pace. We show how these symmetry properties can be encoded in the parameter space of the chosen policy representation, in order to enhance the initial exploration and reliably learn the chosen gaits. Neither do we fully define the gait in the policy representation as in [7, 8, 12, 3], nor do we learn random gaits [6] which could lead to a highly non convex problem.
Stochastic Search In Changing Situations
Abdolmaleki, Abbas (University of Aveiro) | Simoes, David (University of Aveiro) | Lau, Nuno (University of Aveiro) | Reis, Luis Paulo (University of Minho) | Price, Bob (PARC) | Neumann, Gerhard (Technische Universität Darmstadt)
Stochastic search algorithms are black-box optimizer of an objective function. They have recently gained a lot of attention in operations research, machine learning and policy search of robot motor skills due to their ease of use and their generality. However, when the task or objective function slightly changes, many stochastic search algorithms require complete re-learning in order to adapt thesolution to the new objective function or the new context. As such, we consider the contextual stochastic search paradigm. Here, we want to find good parameter vectors for multiple related tasks, where each task is described by a continuous context vector. Hence, the objective function might change slightly for each parameter vector evaluation. In this paper, we investigate a contextual stochastic search algorithm known as Contextual Relative Entropy Policy Search (CREPS), an information-theoretic algorithm that can learn from multiple tasks simultaneously. We show the application of CREPS for simulated robotic tasks.
Model-Free Preference-Based Reinforcement Learning
Wirth, Christian (Technische Universität Darmstadt) | Fürnkranz, Johannes (Technische Universität Darmstadt) | Neumann, Gerhard (Technische Universität Darmstadt)
Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.
Online learning in episodic Markovian decision processes by relative entropy policy search
Zimin, Alexander, Neu, Gergely
We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L X A T log( X A /L) in the bandit setting and 2L T log( X A /L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions andcannot be significantly improved under general assumptions.
Relative Entropy Policy Search
Peters, Jan (Max Planck Institute for Biological Cybernetics) | Mulling, Katharina (Max Planck Institute for Biological Cybernetics) | Altun, Yasemin (Max Planck Institute for Biological Cybernetics)
Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest the Relative Entropy Policy Search (REPS) method. The resulting method differs significantly from previous policy gradient approaches and yields an exact update step. It can be shown to work well on typical reinforcement learning benchmark problems.