Online learning in episodic Markovian decision processes by relative entropy policy search

Dec-31-2013–Neural Information Processing Systems

We study the problem of online learning in finite episodic Markov decision processes (MDPs)where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L X A T log( X A /L) in the bandit setting and 2L T log( X A /L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions andcannot be significantly improved under general assumptions.

algorithm, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Dec-31-2013

Conferences PDF

Add feedback

Country:
- Europe > Hungary (0.14)

Industry:
- Education > Educational Setting > Online (0.71)

Technology:
- Information Technology
  - Enterprise Applications > Human Resources
    - Learning Management (0.62)
  - Artificial Intelligence
    - Representation & Reasoning
      - Search (0.48)
      - Optimization (0.47)
    - Machine Learning > Learning Graphical Models
      - Undirected Networks > Markov Models (0.36)

Duplicate Docs Excel Report

Title
Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search

Similar Docs Excel Report more

Title	Similarity	Source
None found