Reviews: Worst-Case Regret Bounds for Exploration via Randomized Value Functions
–Neural Information Processing Systems
The paper gives a frequentist regret bound for the RLSVI algorithm. While the bound is not minimax optimal (and potentially can be improved), this is the first frequentist guarantee for this algorithm and the proof contains some new technical insights, which may be useful in future work. Further the result demonstrates that other algorithmic strategies/paradigms (besides say optimism) may yield provably sample-efficient RL methods. Thanks for notifying us about a bug that you found in the proof! I discussed this with the reviewers and we all decided it was not a deal breaker for us.
Neural Information Processing Systems
Jan-23-2025, 07:48:06 GMT
- Technology: