Reviews: Worst-Case Regret Bounds for Exploration via Randomized Value Functions

Jan-23-2025, 07:48:06 GMT–Neural Information Processing Systems

The paper gives a frequentist regret bound for the RLSVI algorithm. While the bound is not minimax optimal (and potentially can be improved), this is the first frequentist guarantee for this algorithm and the proof contains some new technical insights, which may be useful in future work. Further the result demonstrates that other algorithmic strategies/paradigms (besides say optimism) may yield provably sample-efficient RL methods. Thanks for notifying us about a bug that you found in the proof! I discussed this with the reviewers and we all decided it was not a deal breaker for us.

exploration, randomized value function, worst-case regret bound

Neural Information Processing Systems

Jan-23-2025, 07:48:06 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.77)