Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
–Neural Information Processing Systems
We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( T log |Π| + log |Π|) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem.
Neural Information Processing Systems
Mar-13-2024, 16:30:24 GMT
- Country:
- Oceania > Australia
- Queensland (0.04)
- North America
- Canada > Alberta (0.14)
- United States > New York
- New York County > New York City (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Oceania > Australia
- Industry:
- Education > Educational Setting > Online (0.62)