Online Convex Optimization in Adversarial Markov Decision Processes
Rosenberg, Aviv, Mansour, Yishay
We consider online learning in episodic loopfree We propose a novel algorithm for the adversarial MDP Markov decision processes (MDPs), where model where the transition function is unknown to the the loss function can change arbitrarily between learner and the losses change arbitrarily over time. Our episodes, and the transition function is not known algorithm, UC-O-REPS, uses two important ingredients, to the learner. We show Õ(L X A T) regret the first is Online Mirror Descent (OMD) (Shalev-Shwartz, bound, where T is the number of episodes, X 2012) and the second is UCRL-2 (Auer et al., 2008). A is the state space, A is the action space, and L major challenge in this work is to handle convex performance is the length of each episode. Our online algorithm criteria, which model different ways of aggregating is implemented using entropic regularization the losses of each episode. In order to handle convex performance methodology, which allows to extend the criteria, we use the methodology of OMD, which is original adversarial MDP model to handle convex widely used for online convex optimization, and we implement performance criteria (different ways to aggregate it in the adversarial MDP setting. In order to overcome the losses of a single episode), as well as the unknown dynamics (stochastic transition function) improve previous regret bounds.
May-19-2019
- Country:
- North America > United States (0.46)
- Asia > Middle East
- Israel (0.14)
- Genre:
- Research Report (0.40)
- Industry:
- Education (0.34)