Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Mar-13-2024, 16:30:24 GMT–Neural Information Processing Systems

We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( T log |Π| + log |Π|) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem.

algorithm, loss function, shortest path problem, (12 more...)

Neural Information Processing Systems

Mar-13-2024, 16:30:24 GMT

Conferences PDF

Add feedback

Country:
- Oceania > Australia
  - Queensland (0.04)
- North America
  - Canada > Alberta (0.14)
  - United States > New York
    - New York County > New York City (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Industry:
- Education > Educational Setting > Online (0.62)

Technology:
- Information Technology
  - Enterprise Applications > Human Resources
    - Learning Management (0.62)
  - Artificial Intelligence > Machine Learning
    - Learning Graphical Models > Undirected Networks > Markov Models (0.71)