Non-Stochastic Control with Bandit Feedback

Gradu, Paula, Hallman, John, Hazan, Elad

arXiv.org Machine Learning 

The crucial component in RL/control that allows learning is the feedback, or reward/penalty, which the agent iteratively observes and reacts to. While some signal is necessary for learning, different applications have different feedback to the learning agent. In many reinforcement learning and control problems it is unrealistic to assume that the learner has feedback for actions other than their own. One example is in game-playing, such as the game of Chess, where a player can observe the adversary's move for their own choice of play, but it is unrealistic to expect knowledge of the adversary's play for any possible move. This type of feedback is commonly known in the learning literature as "bandit feedback". Learning in Markov Decision Processes (MDP) is a general and difficult problem for which there are no known algorithms that have sublinear dependence on the number of states. For this reason we look at structured MDPs, and in particular the model of control in Linear Dynamical Systems (LDS), a highly structured special case that is known to admit more efficient methods as compared to general RL. In this paper we study learning in linear dynamical systems with bandit feedback.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found