Online Markov Decision Processes under Bandit Feedback

Neu, Gergely, Antos, Andras, György, András, Szepesvári, Csaba

Feb-15-2020, 02:44:14 GMT–Neural Information Processing Systems

We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm.

bandit feedback, online markov decision process, time step, (2 more...)

Neural Information Processing Systems

Feb-15-2020, 02:44:14 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.40)