First-Explore, then Exploit: Meta-Learning to Solve Hard Exploration-Exploitation Trade-Offs Ben Norman
–Neural Information Processing Systems
The objective is to maximize the total reward accumulated over all episodes (e.g., the number of games won), expressed as
Neural Information Processing Systems
Aug-17-2025, 02:58:47 GMT