AITopics | relative entropy policy search

Collaborating Authors

relative entropy policy search

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Online learning in episodic Markovian decision processes by relative entropy policy search

Neural Information Processing SystemsSep-30-2025, 11:51:17 GMT

We study the problem of online learning in finite episodic Markov decision processes where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space $\A$ and the state space $\X$ has a layered structure with $L$ layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after $T$ episodes is $2\sqrt{L\nX\nA T\log(\nX\nA/L)}$ in the bandit setting and $2L\sqrt{T\log(\nX\nA/L)}$ in the full information setting. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.

artificial intelligence, episodic markovian decision process, machine learning, (7 more...)

Neural Information Processing Systems

Industry: Education > Educational Setting > Online (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.78)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.61)

Add feedback

Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search

Neural Information Processing SystemsMar-13-2024, 17:03:16 GMT

We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.

algorithm, decision process, learner, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Hungary > Budapest > Budapest (0.04)
(3 more...)

Industry: Education > Educational Setting > Online (0.72)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.36)

Add feedback

Online learning in episodic Markovian decision processes by relative entropy policy search

Zimin, Alexander, Neu, Gergely

Neural Information Processing SystemsFeb-14-2020, 17:26:51 GMT

We study the problem of online learning in finite episodic Markov decision processes where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space $\A$ and the state space $\X$ has a layered structure with $L$ layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after $T$ episodes is $2\sqrt{L X A T\log( X A/L)}$ in the bandit setting and $2L\sqrt{T\log( X A/L)}$ in the full information setting. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions.

episodic markovian decision process, online, relative entropy policy search, (3 more...)

Neural Information Processing Systems

Industry: Education > Educational Setting > Online (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.89)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.65)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.64)

Add feedback

Learning walk and trot from the same objective using different types of exploration

Liu, Zinan, Ploeger, Kai, Stark, Svenja, Rueckert, Elmar, Peters, Jan

arXiv.org Machine LearningApr-28-2019

In nature, animals have developed extensive gaits to adapt to the different terrestrial terrain and situations, such as a horse galloping for faster speed, or a lizard trotting for a stable locomotion. In recent years, quadrupedal gait learning has attracted some research interest in robotics. Quadruped gaits offer a wide range of different movement patterns. As the cyclic movements of all four legs are similar, the gaits can be categorized mainly by the timing and order of the footfall, which can be represented as phase gaps among the trajectories of each leg. In the presented work we learn open loop control policies for various gaits, focusing on walk and trot. In walk the leg trajectories are separated by quarter-phase gaps, resulting in an equidistant footfall, whereas in trot diagonal pairs of legs move synchronously and are separated by half-phase gaps. Other gaits that can be learned using the described approach are bound and pace. We show how these symmetry properties can be encoded in the parameter space of the chosen policy representation, in order to enhance the initial exploration and reliably learn the chosen gaits. Neither do we fully define the gait in the policy representation as in [7, 8, 12, 3], nor do we learn random gaits [6] which could lead to a highly non convex problem.

artificial intelligence, gait, machine learning, (16 more...)

arXiv.org Machine Learning

1904.12336

Country: Europe > Germany (0.15)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.92)

Add feedback

Stochastic Search In Changing Situations

AAAI ConferencesFeb-4-2017

Stochastic search algorithms are black-box optimizer of an objective function. They have recently gained a lot of attention in operations research, machine learning and policy search of robot motor skills due to their ease of use and their generality. However, when the task or objective function slightly changes, many stochastic search algorithms require complete re-learning in order to adapt thesolution to the new objective function or the new context. As such, we consider the contextual stochastic search paradigm. Here, we want to find good parameter vectors for multiple related tasks, where each task is described by a continuous context vector. Hence, the objective function might change slightly for each parameter vector evaluation. In this paper, we investigate a contextual stochastic search algorithm known as Contextual Relative Entropy Policy Search (CREPS), an information-theoretic algorithm that can learn from multiple tasks simultaneously. We show the application of CREPS for simulated robotic tasks.

algorithm, artificial intelligence, optimization problem, (16 more...)

AAAI Conferences

Workshops at the Thirty-First AAAI Conference on Artificial Intelligence

Country:

Europe > Portugal > Aveiro > Aveiro (0.04)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

Add feedback

Model-Free Preference-Based Reinforcement Learning

Wirth, Christian (Technische Universität Darmstadt) | Fürnkranz, Johannes (Technische Universität Darmstadt) | Neumann, Gerhard (Technische Universität Darmstadt)

AAAI ConferencesApr-19-2016

Specifying a numeric reward function for reinforcement learning typically requires a lot of hand-tuning from a human expert. In contrast, preference-based reinforcement learning (PBRL) utilizes only pairwise comparisons between trajectories as a feedback signal, which are often more intuitive to specify. Currently available approaches to PBRL for control problems with continuous state/action spaces require a known or estimated model, which is often not available and hard to learn. In this paper, we integrate preference-based estimation of the reward function into a model-free reinforcement learning (RL) algorithm, resulting in a model-free PBRL algorithm. Our new algorithm is based on Relative Entropy Policy Search (REPS), enabling us to utilize stochastic policies and to directly control the greediness of the policy update. REPS decreases exploration of the policy slowly by limiting the relative entropy of the policy update, which ensures that the algorithm is provided with a versatile set of trajectories, and consequently with informative preferences. The preference-based estimation is computed using a sample-based Bayesian method, which can also estimate the uncertainty of the utility. Additionally, we also compare to a linear solvable approximation, based on inverse RL. We show that both approaches perform favourably to the current state-of-the-art. The overall result is an algorithm that can learn non-parametric continuous action policies from a small number of preferences.

Add feedback

Online learning in episodic Markovian decision processes by relative entropy policy search

Zimin, Alexander, Neu, Gergely

Neural Information Processing SystemsDec-31-2013

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L X A T log( X A /L) in the bandit setting and 2L T log( X A /L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions andcannot be significantly improved under general assumptions.

algorithm, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Country: Europe > Hungary (0.14)

Industry: Education > Educational Setting > Online (0.71)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.36)

Add feedback

Relative Entropy Policy Search

Peters, Jan (Max Planck Institute for Biological Cybernetics) | Mulling, Katharina (Max Planck Institute for Biological Cybernetics) | Altun, Yasemin (Max Planck Institute for Biological Cybernetics)

AAAI ConferencesJul-15-2010

Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of reasoning and suggest the Relative Entropy Policy Search (REPS) method. The resulting method differs significantly from previous policy gradient approaches and yields an exact update step. It can be shown to work well on typical reinforcement learning benchmark problems.

artificial intelligence, machine learning, reinforcement learning, (10 more...)

AAAI Conferences

Twenty-Fourth AAAI Conference on Artificial Intelligence

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback