Follow-the-Perturbed-LeaderforAdversarialMarkov DecisionProcesseswithBanditFeedback
–Neural Information Processing Systems
We consider regret minimization for Adversarial Markov Decision Processes (AMDPs), where the loss functions are changing over time and adversarially chosen, and the learner only observesthe losses for the visited state-action pairs (i.e., bandit feedback).
Neural Information Processing Systems
Feb-8-2026, 18:28:24 GMT
- Technology: