Online Learning in MDPs with Linear Function Approximation and Bandit Feedback

Neural Information Processing Systems 

Consequently, the state of the environment changes according to the transition function of the underlying MDP, as a function of the previous state and the action taken by the learner.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found