related work
–Neural Information Processing Systems
Deterministic RLDeterministic system is often the starting case in the study of sample-efficient algorithms, where the issue of exploration and exploitation trade-off is more clearly revealed since both the transition kernel and reward function are deterministic. The seminal work [81] proposes a sample-efficient algorithm for Q-learning that works for a family of function classes. Recently, [21] studies the agnostic setting where the optimal Q-function can only be approximated by a function class with approximation error. The algorithm in [21] learns the optimal policy with the number of trajectories linear with the eluder dimension. Consider MDPM where the transition is deterministic. Assume the function class in Definition 3.1 satisfies Assumption 2.1 and Assumption 2.2. For any t (0,1), if d Ω(log(BW/λ))and n d poly(κ,k,λ,BW,Bϕ,H,log(d/t)), then with probability at least 1 tAlgorithm 1 returns the optimal policy π .
Neural Information Processing Systems
Apr-25-2026, 18:37:41 GMT
- Technology: