Follow-the-Perturbed-LeaderforAdversarialMarkov DecisionProcesseswithBanditFeedback

Neural Information Processing Systems 

We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observesthe losses for the visited state-action pairs (i.e., bandit feedback).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found