Supplementary Material for: An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap
–Neural Information Processing Systems
We first verify the statement for the terminal state f. Observe that at the terminal state f, regardless of the action taken, the next state is always f and the reward is always 0. Hence Q h(f,) = V h(f) = 0 for all h [H]. Thus Q h(f,) = hφ(f,),v(a)i= 0. We now verify realizability for other states via induction on h = H,H 1,,1. Next, note that h, (2) follows from (1). In other words, (1) implies that a is always the optimal action.
Neural Information Processing Systems
Apr-25-2026, 20:32:50 GMT
- Technology: